PUMiner: Mining Security Posts from Developer Question and Answer Websites with PU Learning

作者: Triet Huynh Minh Le , David Hin , Roland Croft , M. Ali Babar

DOI: 10.1145/3379597.3387443

关键词:

摘要: Security is an increasing concern in software development. Developer Question and Answer (QA however, the required negative (non-security) class too expensive to obtain. We propose a novel learning framework, PUMiner, automatically mine security posts from Q&A websites. PUMiner builds context-aware embedding model extract features of posts, then develops two-stage PU identify content using labelled Positive Unlabelled posts. evaluate on more than 17.2 million Stack Overflow 52,611 StackExchange. show that effective with validation performance at least 0.85 across all configurations. Moreover, Matthews Correlation Coefficient (MCC) 0.906, 0.534 0.084 points higher one-class SVM, positive-similarity filtering, one-stage models unseen testing respectively. also performs well MCC 0.745 for scenarios where string matching totally fails. Even when ratio positive unlabelled ones only 1:100, still achieves strong 0.65, which 160% better fully-supervised learning. Using we provide largest up-to-date websites practitioners researchers.

参考文章(61)
Natural Language Processing and Text Mining Springer Publishing Company, Incorporated. ,(2006) , 10.1007/978-1-84628-754-1
Bing Liu, Xiaoli Li, Learning to classify texts using positive and unlabeled data international joint conference on artificial intelligence. pp. 587- 592 ,(2003)
Radim Řehůřek, Petr Sojka, Software Framework for Topic Modelling with Large Corpora University of Malta. ,(2010)
David M Blei, Andrew Y Ng, Michael I Jordan, None, Latent dirichlet allocation Journal of Machine Learning Research. ,vol. 3, pp. 993- 1022 ,(2003) , 10.5555/944919.944937
Tin Kam Ho, Random decision forests international conference on document analysis and recognition. ,vol. 1, pp. 278- 282 ,(1995) , 10.1109/ICDAR.1995.598994
H. B. Mann, D. R. Whitney, On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other Annals of Mathematical Statistics. ,vol. 18, pp. 50- 60 ,(1947) , 10.1214/AOMS/1177730491
Donato Hernández Fusilier, Manuel Montes-y-Gómez, Paolo Rosso, Rafael Guzmán Cabrera, Detecting positive and negative deceptive opinions using PU-learning Information Processing and Management. ,vol. 51, pp. 433- 443 ,(2015) , 10.1016/J.IPM.2014.11.001
N. S. Altman, An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression The American Statistician. ,vol. 46, pp. 175- 185 ,(1992) , 10.1080/00031305.1992.10475879
Denys Poshyvanyk, Mario Linares-Vasquez, Christopher Vendome, Martin White, Toward deep learning software repositories mining software repositories. pp. 334- 345 ,(2015) , 10.5555/2820518.2820559