Named Entity Extraction via Automatic Labeling and Tri-training: Comparison of Selection Methods

作者: Chien-Lung Chou , Chia-Hui Chang

DOI: 10.1007/978-3-319-12844-3_21

关键词: Pattern recognitionTask (project management)InitializationKnowledge engineeringNamed entityNatural language processingArtificial intelligencePersonal nameNamed-entity recognitionComputer scienceSequence labelingSelection method

摘要: Detecting named entities from documents is one of the most important tasks in knowledge engineering. Previous studies rely on annotated training data, which quite expensive to obtain large data sets, limiting effectiveness recognition. In this research, we propose a semi-supervised learning approach for entity recognition (NER) via automatic labeling and tritraining make use unlabeled structured resources containing known entities. By modifying tri-training sequence deriving proper initialization, can train NER model Web news articles automatically with satisfactory performance. task Chinese personal name extraction 8,672 (with 364,685 sentences 54,449 (11,856 distinct) person names), an F-measure 90.4% be achieved.

参考文章(14)
Andrew McCallum, Wei Li, Semi-supervised sequence modeling with syntactic topic models national conference on artificial intelligence. pp. 813- 818 ,(2005)
Wenliang Chen, Yujie Zhang, Hitoshi Isahara, Chinese chunking with tri-training learning international conference on the computer processing of oriental languages. pp. 466- 473 ,(2006) , 10.1007/11940098_49
Sally A. Goldman, Yan Zhou, Enhancing Supervised Learning with Unlabeled Data international conference on machine learning. pp. 327- 334 ,(2000)
Kamal Nigam, Rayid Ghani, Analyzing the effectiveness and applicability of co-training Proceedings of the ninth international conference on Information and knowledge management - CIKM '00. pp. 86- 93 ,(2000) , 10.1145/354756.354805
Avrim Blum, Tom Mitchell, None, Combining labeled and unlabeled data with co-training conference on learning theory. pp. 92- 100 ,(1998) , 10.1145/279943.279962
Lei Zheng, Shaojun Wang, Yan Liu, Chi-Hoon Lee, Information theoretic regularization for semi-supervised boosting Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09. pp. 1017- 1026 ,(2009) , 10.1145/1557019.1557129
Andrew McCallum, Gideon S. Mann, Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data Journal of Machine Learning Research. ,vol. 11, pp. 955- 984 ,(2010) , 10.5555/1756006.1756038
Zhi-Hua Zhou, Ming Li, Tri-training: exploiting unlabeled data using three classifiers IEEE Transactions on Knowledge and Data Engineering. ,vol. 17, pp. 1529- 1541 ,(2005) , 10.1109/TKDE.2005.186
Andrew McCallum, Wei Li, Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons north american chapter of the association for computational linguistics. pp. 188- 191 ,(2003) , 10.3115/1119176.1119206
Yves Grandvalet, Yoshua Bengio, Semi-supervised Learning by Entropy Minimization neural information processing systems. ,vol. 17, pp. 529- 536 ,(2004)