Chinese Word Segmentation for Terrorism-Related Contents

作者: Daniel Zeng , Donghua Wei , Michael Chau , Feiyue Wang

DOI: 10.1007/978-3-540-69304-8_1

关键词: Computer scienceSuffix treeSegmentationMutual informationSpeech recognitionText segmentationn-gramUkkonen's algorithmBigramPrecision and recall

摘要: In order to analyze security and terrorism related content in Chinese, it is important perform word segmentation on Chinese documents. There are many previous studies segmentation. The two major approaches statistic-based dictionary-based approaches. pure statistic methods have lower precision, while the method cannot deal with new words restricted coverage of dictionary. this paper, we propose a hybrid that avoids limitations both Through use suffix tree mutual information (MI) dictionary, our segmenter, called IASeg, achieves high accuracy when domain training available. It can identify through MI-based token merging dictionary update. addition, Improved Bigram also process N-grams. To evaluate performance compare Hylanda segmenter ICTCLAS using terrorism-related corpus. experiment results show IASeg performs better than benchmarks precision recall.

参考文章(31)
Alberto Apostolico, Zvi Galil, Pattern matching algorithms Oxford University Press. ,(1997)
M. T. Chen, Joel Seiferas, Efficient and Elegant Subword-Tree Construction Springer, Berlin, Heidelberg. pp. 97- 107 ,(1985) , 10.1007/978-3-642-82456-2_7
Richard Sproat, Chilin Shih, William Gale, Nancy Chang, A stochastic finite-state word-segmentation algorithm for Chinese Computational Linguistics. ,vol. 22, pp. 377- 404 ,(1996)
Esko Ukkonen, Constructing Suffix Trees On-Line in Linear Time world computer congress on algorithms software architecture. pp. 484- 492 ,(1992)
Fuchun Peng, Dale Schuurmans, Self-Supervised Chinese Word Segmentation intelligent data analysis. pp. 238- 247 ,(2001) , 10.1007/3-540-44816-0_24
Lixin Zhou, Qun Liu, A character-net based Chinese text segmentation method international conference on computational linguistics. pp. 1- 6 ,(2002) , 10.3115/1118735.1118752
Yubin Dai, Teck Ee Loh, Christopher S. G. Khoo, A new statistical formula for Chinese text segmentation incorporating contextual information international acm sigir conference on research and development in information retrieval. pp. 82- 89 ,(1999) , 10.1145/312624.312659