Fusion of multiple features and supervised learning for Chinese OOV term detection and POS guessing

作者: Yuejie Zhang , Xiangyang Xue , Lei Cen , Cheng Jin , Wei Wu

DOI: 10.5591/978-1-57735-516-8/IJCAI11-321

关键词: FusionMathematicsMachine learningSupervised learningHeuristicArtificial intelligenceTerm (time)

摘要: In this paper, to support more precise Chinese Out-of-Vocabulary (OOV) term detection and Part-of-Speech (POS) guessing, a unified mechanism is proposed formulated based on the fusion of multiple features supervised learning. Besides all traditional features, new for statistical information global contexts are introduced, as well some constraints heuristic rules, which reveal relationships among OOV candidates. Our experiments corpora from both People's Daily SIGHAN 2005 have achieved consistent results, better than those acquired by pure rule-based or statistics-based models. From experimental results combining our model with monolingual retrieval data sets TREC-9, it found that obvious improvement performance can also be obtained.

参考文章(10)
Masayuki Asahara, Yuji Matsumoto, Chooi-Ling Goh, Machine Learning-based Methods to Chinese Unknown Word Detection and POS Tag Guessing. Journal of Chinese Language and Computing. ,vol. 16, pp. 185- 206 ,(2006)
Andrew McCallum, Dayne Freitag, Fernando C. N. Pereira, Maximum Entropy Markov Models for Information Extraction and Segmentation international conference on machine learning. pp. 591- 598 ,(2000)
Aitao Chen, Chinese Word Segmentation Using Minimal Linguistic Knowledge Proceedings of the Second SIGHAN Workshop on Chinese Language Processing. pp. 148- 151 ,(2003) , 10.3115/1119250.1119271
Wei-Yun Ma, Keh-Jiann Chen, A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction meeting of the association for computational linguistics. pp. 31- 38 ,(2003) , 10.3115/1119250.1119255
Xiaofei Lu, Hybrid methods for POS guessing of Chinese unknown words Proceedings of the ACL Student Research Workshop on - ACL '05. pp. 1- 6 ,(2005) , 10.3115/1628960.1628962
John D. Lafferty, Douglas L. Vail, Noah A. Smith, Computationally Efficient M-Estimation of Log-Linear Structure Models meeting of the association for computational linguistics. pp. 752- 759 ,(2007)
Tetsuji Nakagawa, Yuji Matsumoto, Guessing Parts-of-Speech of Unknown Words Using Global Information meeting of the association for computational linguistics. ,vol. 1, pp. 705- 712 ,(2006) , 10.3115/1220175.1220264
Douglas L. Vail, Manuela M. Veloso, John D. Lafferty, Conditional random fields for activity recognition adaptive agents and multi-agents systems. pp. 235- ,(2007) , 10.1145/1329125.1329409
Sun Mao-song, Chinese Word Extraction Based on the Internal Associative Strength of Character Strings Journal of Chinese information processing. ,(2003)
Kiss Katalin, A corpus-based approach to morphological productivity Budapesti Gazdasági Főiskola. ,(2008)