Multi-Factor Duplicate Question Detection in Stack Overflow

作者: Yun Zhang , David Lo , Xin Xia , Jian-Ling Sun

DOI: 10.1007/S11390-015-1576-4

关键词:

摘要: Stack Overflow is a popular on-line question and answer site for software developers to share their experience expertise. Among the numerous questions posted in Overflow, two or more of them may express same point thus are duplicates one another. Duplicate make maintenance harder, waste resources that could have been used other questions, cause unnecessarily wait answers already available. To reduce problem duplicate allows be manually marked as others. Since there thousands submitted every day, identifying difficult work. Thus, need an automated approach can help detecting these questions. address above-mentioned need, this paper, we propose named DupPredictor takes new input detects potential by considering multiple factors. extracts title description also tags attached question. These pieces information (title, description, few tags) mandatory user needs when posting then computes latent topics each using topic model. Next, pair it four similarity scores comparing titles, descriptions, topics, tags. finally combined together result score comprehensively considers examine benefit DupPredictor, perform experiment on dataset which contains total than million The shows achieve recall-rate@20 63.8%. We compare our with standard search engine improves its recall-rate@10 40.63%. approaches only use title, topic, tag Runeson et al.’s has detect bug reports, 27.2%, 97.4%, 746.0%, 231.1%, 16.4% respectively.

参考文章(43)
Berthier A. Ribeiro-Neto, Ricardo A. Baeza-Yates, Modern Information Retrieval - the concepts and technology behind search, Second edition Pearson Education Ltd., Harlow, England. ,(2011)
Hinrich Schütze, Christopher D. Manning, Prabhakar Raghavan, Introduction to Information Retrieval ,(2005)
David M Blei, Andrew Y Ng, Michael I Jordan, None, Latent dirichlet allocation Journal of Machine Learning Research. ,vol. 3, pp. 993- 1022 ,(2003) , 10.5555/944919.944937
Swapna Gottipati, David Lo, Jing Jiang, Finding relevant answers in software forums automated software engineering. pp. 323- 332 ,(2011) , 10.1109/ASE.2011.6100069
Pavneet Singh Kochhar, Ferdian Thung, David Lo, Automatic Fine-Grained Issue Report Reclassification international conference on engineering of complex computer systems. pp. 126- 135 ,(2014) , 10.1109/ICECCS.2014.25
Didi Surian, Nian Liu, David Lo, Hanghang Tong, Ee-Peng Lim, Christos Faloutsos, Recommending People in Developers' Collaboration Network working conference on reverse engineering. pp. 379- 388 ,(2011) , 10.1109/WCRE.2011.53
Anahita Alipour, Abram Hindle, Eleni Stroulia, A contextual approach towards more accurate duplicate bug report detection mining software repositories. pp. 183- 192 ,(2013) , 10.1109/MSR.2013.6624026
Qiaona Hong, Sunghun Kim, S.C. Cheung, Christian Bird, Understanding a developer social network and its evolution international conference on software maintenance. pp. 323- 332 ,(2011) , 10.1109/ICSM.2011.6080799
John Anvik, Lyndon Hiew, Gail C. Murphy, Coping with an open bug repository eclipse technology exchange. pp. 35- 39 ,(2005) , 10.1145/1117696.1117704
Alina Lazar, Sarah Ritchey, Bonita Sharif, Improving the accuracy of duplicate bug report detection using textual similarity measures mining software repositories. pp. 308- 311 ,(2014) , 10.1145/2597073.2597088