A crowd-efficient learning approach for NER based on online encyclopedia

作者: Maolong Li , Zhixu Li , Qiang Yang , Zhigang Chen , Pengpeng Zhao

DOI: 10.1007/S11280-019-00736-3

关键词: Task (project management)Sample (statistics)Named-entity recognitionOnline encyclopediaSelection (linguistics)Computer scienceEmpirical researchSet (abstract data type)Machine learningArtificial intelligence

摘要: Named Entity Recognition (NER) is a core task of NLP. State-of-art supervised NER models rely heavily on large amount high-quality annotated data, which quite expensive to obtain. Various existing ways have been proposed reduce the heavy reliance training but only with limited effect. In this paper, we propose crowd-efficient learning approach for by making full use online encyclopedia pages. our approach, first define three criteria (representativeness, informativeness, diversity) help select much smaller set samples crowd labeling. We then data augmentation method, could generate lot more structured knowledge greatly augment After conducting model augmented sample set, re-select some new labeling refinement. perform and selection procedure iteratively until not be further improved or performance meets requirement. Our empirical study conducted several real collections shows that 50% manual annotations almost same as fully trained model.

参考文章(38)
Mark Dredze, Koby Crammer, Partha Pratim Talukdar, Sequence Learning from Data with Multiple Labels ,(2009)
Yoshua Bengio, Yoshua Bengio, Yoshua Bengio, Yann LeCun, Convolutional networks for images, speech, and time series The handbook of brain theory and neural networks. pp. 255- 258 ,(1998)
Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, SVM Based Learning System for Information Extraction Lecture Notes in Computer Science. pp. 319- 339 ,(2005) , 10.1007/11559887_19
Zhiheng Huang, Kai Yu, Wei Xu, Bidirectional LSTM-CRF Models for Sequence Tagging arXiv: Computation and Language. ,(2015)
Rion Snow, Brendan O'Connor, Daniel Jurafsky, Andrew Y. Ng, Cheap and fast---but is it good? Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP '08. pp. 254- 263 ,(2008) , 10.3115/1613715.1613751
Thanapon Noraset, Chandra Bhagavatula, Doug Downey, WebSAIL wikifier at ERD 2014 international acm sigir conference on research and development in information retrieval. pp. 119- 124 ,(2014) , 10.1145/2633211.2639489
CE Shennon, Warren Weaver, A mathematical theory of communication Bell System Technical Journal. ,vol. 27, pp. 379- 423 ,(1948) , 10.1002/J.1538-7305.1948.TB01338.X
David A van Dyk, Xiao-Li Meng, The Art of Data Augmentation Journal of Computational and Graphical Statistics. ,vol. 10, pp. 1- 50 ,(2001) , 10.1198/10618600152418584
Ralph Grishman, Beth Sundheim, Message Understanding Conference-6: a brief history international conference on computational linguistics. ,vol. 1, pp. 466- 471 ,(1996) , 10.3115/992628.992709
Patrick Schone, Alexander E. Richman, Mining Wiki Resources for Multilingual Named Entity Recognition meeting of the association for computational linguistics. pp. 1- 9 ,(2008)