作者: Robert Leaman , Zhiyong Lu
DOI: 10.1093/BIOINFORMATICS/BTW343
关键词:
摘要: Motivation: Text mining is increasingly used to manage the accelerating pace of biomedical literature. Many text applications depend on accurate named entity recognition (NER) and normalization (grounding). While high performing machine learning methods trainable for many types exist NER, are usually specialized a single type. NER systems also typically in serial pipeline, causing cascading errors limiting ability system directly exploit lexical information provided by normalization. Methods: We propose first model joint during both training prediction. The arbitrary consists semi-Markov structured linear classifier, with rich feature approach supervised semantic indexing introduce TaggerOne, Java implementation our as general toolkit TaggerOne not specific any type, requiring only annotated data corresponding lexicon, has been optimized throughput. Results: validated multiple gold-standard corpora containing mention- concept-level annotations. Benchmarking results show that achieves performance diseases (NCBI Disease corpus, f-score: 0.829, 0.807) chemicals (BioCreative 5 CDR 0.914, f-score 0.895). These compare favorably previous state art, notwithstanding greater flexibility model. conclude jointly modeling greatly improves performance. Availability Implementation: source code an online demonstration available at: http://www.ncbi.nlm.nih.gov/bionlp/taggerone Contact: zhiyong.lu@nih.gov Supplementary information: at Bioinformatics online.