作者: David R. Blair , Kanix Wang , Svetlozar Nestorov , James A. Evans , Andrey Rzhetsky
DOI: 10.1371/JOURNAL.PCBI.1003799
关键词:
摘要: Synonymous relationships among biomedical terms are extensively annotated within specialized terminologies, implying that synonymy is important for practical computational applications this field. It remains unclear, however, whether text mining actually benefits from documented and existing thesauri provide adequate coverage of these linguistic relationships. In study, we examine the impact extent undocumented a very large compendium thesauri. First, demonstrate missing has significant negative on named entity normalization, an problem field mining. To estimate amount currently thesauri, develop probabilistic model construction synonym terminologies capable handling wide range potential biases, evaluate its performance using broader domain near-synonymy general English words. Our predicts over 90% undocumented, result support experimentally through "crowd-sourcing." Finally, apply our to predict they vast majority (>90%) synonymous intend document. Overall, results expose dramatic incompleteness current suggest need "next-generation," high-coverage lexical terminologies.