TSE-NER : An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications

作者: Sepideh Mesbah , Christoph Lofi , Manuel Valle Torre , Alessandro Bozzon , Geert-Jan Houben

DOI: 10.1007/978-3-030-00671-6_8

关键词: Natural language processingSet (abstract data type)Training setNamed-entity recognitionSpecific knowledgeSemantic expansionTask (project management)WorkaroundArtificial intelligenceEntity typeComputer science

摘要: Named Entity Recognition and Typing (NER/NET) is a challenging task, especially with long-tail entities such as the ones found in scientific publications. These (e.g. “WebKB”,“StatSnowball”) are rare, often relevant only specific knowledge domains, yet important for retrieval exploration purposes. State-of-the-art NER approaches employ supervised machine learning models, trained on expensive type-labeled data laboriously produced by human annotators. A common workaround generation of labeled training from bases; this approach not suitable entity types that are, definition, scarcely represented KBs. This paper presents an iterative NET classifiers publications relies minimal input, namely small seed set instances targeted type. We introduce different strategies extraction, semantic expansion, result filtering. evaluate our publications, focusing Datasets, Methods computer science Proteins biomedical

参考文章(29)
Patrice Lopez, GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications european conference on research and advanced technology for digital libraries. pp. 473- 474 ,(2009) , 10.1007/978-3-642-04346-8_62
T. L. Griffiths, M. Steyvers, Finding scientific topics Proceedings of the National Academy of Sciences of the United States of America. ,vol. 101, pp. 5228- 5235 ,(2004) , 10.1073/PNAS.0307752101
Michael Bada, Miriam Eckert, Donald Evans, Kristin Garcia, Krista Shipley, Dmitry Sitnikov, William A Baumgartner, K Bretonnel Cohen, Karin Verspoor, Judith A Blake, Lawrence E Hunter, Concept annotation in the CRAFT corpus BMC Bioinformatics. ,vol. 13, pp. 161- 161 ,(2012) , 10.1186/1471-2105-13-161
Christopher Funk, William Baumgartner, Benjamin Garcia, Christophe Roeder, Michael Bada, K Bretonnel Cohen, Lawrence E Hunter, Karin Verspoor, Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters BMC Bioinformatics. ,vol. 15, pp. 59- 59 ,(2014) , 10.1186/1471-2105-15-59
Chen-Tse Tsai, Gourab Kundu, Dan Roth, Concept-based analysis of scientific literature conference on information and knowledge management. pp. 1733- 1738 ,(2013) , 10.1145/2505515.2505613
Kumar Shubankar, AdityaPratap Singh, Vikram Pudi, A frequent keyword-set based algorithm for topic modeling and clustering of research papers data mining and optimization. pp. 96- 102 ,(2011) , 10.1109/DMO.2011.5976511
John D. Lafferty, Andrew McCallum, Fernando C. N. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data international conference on machine learning. pp. 282- 289 ,(2001)
Ilya Sutskever, Tomas Mikolov, Greg S Corrado, Kai Chen, Jeff Dean, Distributed Representations of Words and Phrases and their Compositionality neural information processing systems. ,vol. 26, pp. 3111- 3119 ,(2013)
Christoph Lofi, Measuring Semantic Similarity and Relatedness with Distributional and Knowledge- based Approaches Journal of Information Processing. ,vol. 10, pp. 493- 501 ,(2015) , 10.11185/IMT.10.493