作者: Dongqing Zhu , Stephen Wu , Ben Carterette , Hongfang Liu
DOI: 10.1016/J.JBI.2014.03.010
关键词: Text Retrieval Conference 、 Query expansion 、 Resource (project management) 、 Task (computing) 、 Term (time) 、 Query likelihood model 、 Polysemy 、 Relevance (information retrieval) 、 Computer science 、 Information retrieval
摘要: Display Omitted Demonstrated utility of an in-domain collection (clinical text) for query expansion.Analyzed effect external size on a mixture relevance models.Any existing expansion configuration can benefit from indomain collection. In light the heightened problems polysemy, synonymy, and hyponymy in clinical text, we hypothesize that patient cohort identification be improved by using large, corpus expansion. We evaluate four auxiliary collections Text REtrieval Conference task IR-based retrieval, considering effects size, inherent difficulty query, interaction between collections. Each was applied to aid retrieval Pittsburgh NLP Repository models. Measured mean average precision, performance any resource (MAP=0.386 above) is shown improve over baseline likelihood model (MAP=0.373). Considering subsets Mayo Clinic collection, found after including 2.5 billion term instances, not adding more instances. However, did significantly setup, with system all obtaining best results (MAP=0.4223). Because optimal models would require selective sampling collections, common sense approach "use available data" inappropriate. it still beneficial add On identification, resulted consistent significant improvements. As such, IR access large could additional resource. Additionally, have data necessarily better, implying there value curation.