作者: Yun Zhang , David Lo , Xin Xia , Jian-Ling Sun
DOI: 10.1007/S11390-015-1576-4
关键词:
摘要: Stack Overflow is a popular on-line question and answer site for software developers to share their experience expertise. Among the numerous questions posted in Overflow, two or more of them may express same point thus are duplicates one another. Duplicate make maintenance harder, waste resources that could have been used other questions, cause unnecessarily wait answers already available. To reduce problem duplicate allows be manually marked as others. Since there thousands submitted every day, identifying difficult work. Thus, need an automated approach can help detecting these questions. address above-mentioned need, this paper, we propose named DupPredictor takes new input detects potential by considering multiple factors. extracts title description also tags attached question. These pieces information (title, description, few tags) mandatory user needs when posting then computes latent topics each using topic model. Next, pair it four similarity scores comparing titles, descriptions, topics, tags. finally combined together result score comprehensively considers examine benefit DupPredictor, perform experiment on dataset which contains total than million The shows achieve recall-rate@20 63.8%. We compare our with standard search engine improves its recall-rate@10 40.63%. approaches only use title, topic, tag Runeson et al.’s has detect bug reports, 27.2%, 97.4%, 746.0%, 231.1%, 16.4% respectively.