Using large clinical corpora for query expansion in text-based cohort identification

作者: Dongqing Zhu , Stephen Wu , Ben Carterette , Hongfang Liu

DOI: 10.1016/J.JBI.2014.03.010

关键词: Text Retrieval ConferenceQuery expansionResource (project management)Task (computing)Term (time)Query likelihood modelPolysemyRelevance (information retrieval)Computer scienceInformation retrieval

摘要: Display Omitted Demonstrated utility of an in-domain collection (clinical text) for query expansion.Analyzed effect external size on a mixture relevance models.Any existing expansion configuration can benefit from indomain collection. In light the heightened problems polysemy, synonymy, and hyponymy in clinical text, we hypothesize that patient cohort identification be improved by using large, corpus expansion. We evaluate four auxiliary collections Text REtrieval Conference task IR-based retrieval, considering effects size, inherent difficulty query, interaction between collections. Each was applied to aid retrieval Pittsburgh NLP Repository models. Measured mean average precision, performance any resource (MAP=0.386 above) is shown improve over baseline likelihood model (MAP=0.373). Considering subsets Mayo Clinic collection, found after including 2.5 billion term instances, not adding more instances. However, did significantly setup, with system all obtaining best results (MAP=0.4223). Because optimal models would require selective sampling collections, common sense approach "use available data" inappropriate. it still beneficial add On identification, resulted consistent significant improvements. As such, IR access large could additional resource. Additionally, have data necessarily better, implying there value curation.

参考文章(38)
David A. Hanauer, EMERSE: The Electronic Medical Record Search Engine. american medical informatics association annual symposium. ,vol. 2006, pp. 941- 941 ,(2006)
Stephen Wu, Hongfang Liu, Semantic characteristics of NLP-extracted concepts in clinical notes vs. biomedical literature. american medical informatics association annual symposium. ,vol. 2011, pp. 1550- 1558 ,(2011)
David A. Hanauer, Qiaozhu Mei, Lei Yang, Kai Zheng, Query log analysis of an electronic health record search engine. american medical informatics association annual symposium. ,vol. 2011, pp. 915- 924 ,(2011)
Dongqing Zhu, Ben Carterette, Exploring Evidence Aggregation Methods and External Expansion Sources for Medical Record Search. text retrieval conference. ,(2012)
Yanjun Qi, Pierre-François Laquerre, Retrieving Medical Records with "sennamed": NEC Labs America at TREC 2012 Medical Record Track. text retrieval conference. ,(2012)
William Hersh, Health and Biomedical Information Springer, New York, NY. pp. 41- 116 ,(2009) , 10.1007/978-0-387-78703-9_2
William R. Hersh, Steven Bedrick, Tracy Edinger, Aaron M. Cohen, Identifying Patients for Clinical Studies from Electronic Health Records: TREC 2012 Medical Records Track at OHSU text retrieval conference. ,(2012)
Chueh Hc, Murphy Sn, Barnett Go, Visual query tool for finding patient cohorts from a clinical data warehouse of the partners HealthCare system american medical informatics association annual symposium. pp. 1174- 1174 ,(2000)
Paul Hanbury, Will Bridewell, Gregory F. Cooper, Wendy W. Chapman, Bruce G. Buchanan, Evaluation of negation phrases in narrative clinical reports. american medical informatics association annual symposium. pp. 105- 109 ,(2001)