TaggerOne: joint named entity recognition and normalization with semi-Markov Models

作者: Robert Leaman , Zhiyong Lu

DOI: 10.1093/BIOINFORMATICS/BTW343

关键词:

摘要: Motivation: Text mining is increasingly used to manage the accelerating pace of biomedical literature. Many text applications depend on accurate named entity recognition (NER) and normalization (grounding). While high performing machine learning methods trainable for many types exist NER, are usually specialized a single type. NER systems also typically in serial pipeline, causing cascading errors limiting ability system directly exploit lexical information provided by normalization. Methods: We propose first model joint during both training prediction. The arbitrary consists semi-Markov structured linear classifier, with rich feature approach supervised semantic indexing introduce TaggerOne, Java implementation our as general toolkit TaggerOne not specific any type, requiring only annotated data corresponding lexicon, has been optimized throughput. Results: validated multiple gold-standard corpora containing mention- concept-level annotations. Benchmarking results show that achieves performance diseases (NCBI Disease corpus, f-score: 0.829, 0.807) chemicals (BioCreative 5 CDR 0.914, f-score 0.895). These compare favorably previous state art, notwithstanding greater flexibility model. conclude jointly modeling greatly improves performance. Availability Implementation: source code an online demonstration available at: http://www.ncbi.nlm.nih.gov/bionlp/taggerone Contact: zhiyong.lu@nih.gov Supplementary information: at Bioinformatics online.

参考文章(42)
Juliane Fluck, Martin Hofmann-Apitius, Corinna Kolarik, Roman Klinger, Christoph M. Friedrich, Chemical Names: Terminological Resources and Corpora Annotation language resources and evaluation. ,(2008)
Min Song, Won Chul Kim, Dahee Lee, Go Eun Heo, Keun Young Kang, PKDE4J: Entity and relation extraction for public knowledge discovery. Journal of Biomedical Informatics. ,vol. 57, pp. 320- 332 ,(2015) , 10.1016/J.JBI.2015.08.008
J Altun, G Bakir, O Bousquet, S Chopra, C Cortes, Daume III, O Dekel, Z Ghahramani, R Hadsell, T Hofmann, F Huang, Y LeCun, T Mann, D Marcu, D McAllester, M Mohri, W Stafford Noble, F Pérez-Cruz, M Pontil, M Ranzato, J Rousu, C Saunders, B Schölkopf, MW Seeger, S Shalev-Shwartz, J Shawe-Taylor, Y Singer, AJ Smola, S Szedmak, B Taskar, I Tsochantaridis, SBN Vishwanathan, J Weston, Predicting Structured Data MIT Press. ,(2007)
Chih-Hsuan Wei, Hung-Yu Kao, Zhiyong Lu, GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains BioMed Research International. ,vol. 2015, pp. 918710- 918710 ,(2015) , 10.1155/2015/918710
Ultraconservative online algorithms for multiclass problems Journal of Machine Learning Research. ,vol. 3, pp. 951- 991 ,(2003) , 10.1162/JMLR.2003.3.4-5.951
Donna Harman, How effective is suffixing Journal of the Association for Information Science and Technology. ,vol. 42, pp. 7- 15 ,(1991) , 10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO;2-P
David R. Blair, Kanix Wang, Svetlozar Nestorov, James A. Evans, Andrey Rzhetsky, Quantifying the Impact and Extent of Undocumented Biomedical Synonymy PLoS Computational Biology. ,vol. 10, pp. e1003799- ,(2014) , 10.1371/JOURNAL.PCBI.1003799
David Campos, Sérgio Matos, José Luís Oliveira, A modular framework for biomedical concept recognition BMC Bioinformatics. ,vol. 14, pp. 281- 281 ,(2013) , 10.1186/1471-2105-14-281
Anaïs Mottaz, Yum L Yip, Patrick Ruch, Anne-Lise Veuthey, Mapping proteins to disease terminologies: from UniProt to MeSH BMC Bioinformatics. ,vol. 9, pp. 1- 10 ,(2008) , 10.1186/1471-2105-9-S5-S3
Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Olivier Chapelle, Kilian Weinberger, Learning to rank with (a lot of) word features Information Retrieval. ,vol. 13, pp. 291- 314 ,(2010) , 10.1007/S10791-009-9117-9