PUMiner: Mining Security Posts from Developer Question and Answer Websites with PU Learning

作者： Triet Huynh Minh Le , David Hin , Roland Croft , M. Ali Babar

关键词:

摘要: Security is an increasing concern in software development. Developer Question and Answer (QA however, the required negative (non-security) class too expensive to obtain. We propose a novel learning framework, PUMiner, automatically mine security posts from Q&A websites. PUMiner builds context-aware embedding model extract features of posts, then develops two-stage PU identify content using labelled Positive Unlabelled posts. evaluate on more than 17.2 million Stack Overflow 52,611 StackExchange. show that effective with validation performance at least 0.85 across all configurations. Moreover, Matthews Correlation Coefficient (MCC) 0.906, 0.534 0.084 points higher one-class SVM, positive-similarity filtering, one-stage models unseen testing respectively. also performs well MCC 0.745 for scenarios where string matching totally fails. Even when ratio positive unlabelled ones only 1:100, still achieves strong 0.65, which 160% better fully-supervised learning. Using we provide largest up-to-date websites practitioners researchers.

arxiv.org 本地加速

arxiv.org PDF 下载加速

sci-hub.st HTML 下载加速

参考文章(61)

Natural Language Processing and Text Mining Springer Publishing Company, Incorporated. ,(2006) , 10.1007/978-1-84628-754-1

Bing Liu, Xiaoli Li, Learning to classify texts using positive and unlabeled data international joint conference on artificial intelligence. pp. 587- 592 ,(2003)

Radim Řehůřek, Petr Sojka, Software Framework for Topic Modelling with Large Corpora University of Malta. ,(2010)

Kevin P. Murphy, Machine Learning : A Probabilistic Perspective ,(2012)

David M Blei, Andrew Y Ng, Michael I Jordan, None, Latent dirichlet allocation Journal of Machine Learning Research. ,vol. 3, pp. 993- 1022 ,(2003) , 10.5555/944919.944937

Tin Kam Ho, Random decision forests international conference on document analysis and recognition. ,vol. 1, pp. 278- 282 ,(1995) , 10.1109/ICDAR.1995.598994

H. B. Mann, D. R. Whitney, On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other Annals of Mathematical Statistics. ,vol. 18, pp. 50- 60 ,(1947) , 10.1214/AOMS/1177730491

Donato Hernández Fusilier, Manuel Montes-y-Gómez, Paolo Rosso, Rafael Guzmán Cabrera, Detecting positive and negative deceptive opinions using PU-learning Information Processing and Management. ,vol. 51, pp. 433- 443 ,(2015) , 10.1016/J.IPM.2014.11.001

N. S. Altman, An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression The American Statistician. ,vol. 46, pp. 175- 185 ,(1992) , 10.1080/00031305.1992.10475879

10.

Denys Poshyvanyk, Mario Linares-Vasquez, Christopher Vendome, Martin White, Toward deep learning software repositories mining software repositories. pp. 334- 345 ,(2015) , 10.5555/2820518.2820559

PUMiner: Mining Security Posts from Developer Question and Answer Websites with PU Learning

来源期刊

我的账户

PUMiner: Mining Security Posts from Developer Question and Answer Websites with PU Learning

来源期刊

相似文章 1

Challenges in Docker Development: A Large-scale Study Using Stack Overflow

我的账户