Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

作者: Kevin W. Boyack , David Newman , Russell J. Duhon , Richard Klavans , Michael Patek

DOI: 10.1371/JOURNAL.PONE.0018029

关键词:

摘要: BackgroundWe investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets text documents is important a variety information needs and applications such as collection management navigation, summary analysis. The few comparisons results from have focused on small literature given conflicting results. Our study was designed to seek robust answer question which approach would generate most coherent clusters set documents.MethodologyWe used corpus 2.15 recent (2004-2008) records MEDLINE, generated nine document-document matrices extracted their bibliographic records, including titles, abstracts subject headings. were comprised five analytical techniques with data sources. are cosine using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, Poisson-based language models – BM25 PMRA (PubMed Related Articles). sources a) MeSH headings, b) words titles abstracts. Each matrix filtered keep top-n highest similarities per then clustered combination graph layout average-link clustering. Cluster compared (1) within-cluster textual coherence based Jensen-Shannon divergence, (2) concentration measures grant-to-article linkages indexed in MEDLINE.ConclusionsPubMed's own related article (PMRA) concentrated cluster solution text-based tested, followed closely by Approaches only headings not competitive those

参考文章(51)
Genevieve Gorrell, Brandyn Webb, Generalized hebbian algorithm for incremental latent semantic analysis. conference of the international speech communication association. pp. 1325- 1328 ,(2005)
Jacob Kogan, Michael W. Berry, Text mining : applications and theory John Wiley & Sons. ,(2010)
Minh Duc Cao, Xiaoying Gao, Combining contents and citations for scientific document classification australasian joint conference on artificial intelligence. pp. 143- 152 ,(2005) , 10.1007/11589990_17
Andreas G. K. Janecek, Wilfried N. Gansterer, Utilizing Nonnegative Matrix Factorization for Email Classification Problems Text Mining. pp. 57- 80 ,(2010) , 10.1002/9780470689646.CH4
Gary Marchionini, Catherine C. Marshall, Michael L. Nelson, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries acm ieee joint conference on digital libraries. ,(2006)
Teuvo Kohonen, Self-Organizing Maps ,(1995)
David M Blei, Andrew Y Ng, Michael I Jordan, None, Latent dirichlet allocation Journal of Machine Learning Research. ,vol. 3, pp. 993- 1022 ,(2003) , 10.5555/944919.944937
Gerard Salton, Christopher Buckley, Term Weighting Approaches in Automatic Text Retrieval Information Processing and Management. ,vol. 24, pp. 323- 328 ,(1988) , 10.1016/0306-4573(88)90021-0
Yasunori Yamamoto, Toshihisa Takagi, Biomedical knowledge navigation by literature clustering Journal of Biomedical Informatics. ,vol. 40, pp. 114- 130 ,(2007) , 10.1016/J.JBI.2006.07.004