作者: Kevin W. Boyack , David Newman , Russell J. Duhon , Richard Klavans , Michael Patek
DOI: 10.1371/JOURNAL.PONE.0018029
关键词:
摘要: BackgroundWe investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets text documents is important a variety information needs and applications such as collection management navigation, summary analysis. The few comparisons results from have focused on small literature given conflicting results. Our study was designed to seek robust answer question which approach would generate most coherent clusters set documents.MethodologyWe used corpus 2.15 recent (2004-2008) records MEDLINE, generated nine document-document matrices extracted their bibliographic records, including titles, abstracts subject headings. were comprised five analytical techniques with data sources. are cosine using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, Poisson-based language models – BM25 PMRA (PubMed Related Articles). sources a) MeSH headings, b) words titles abstracts. Each matrix filtered keep top-n highest similarities per then clustered combination graph layout average-link clustering. Cluster compared (1) within-cluster textual coherence based Jensen-Shannon divergence, (2) concentration measures grant-to-article linkages indexed in MEDLINE.ConclusionsPubMed's own related article (PMRA) concentrated cluster solution text-based tested, followed closely by Approaches only headings not competitive those