Chinese Word Segmentation for Terrorism-Related Contents

作者： Daniel Zeng , Donghua Wei , Michael Chau , Feiyue Wang

关键词: Computer science 、 Suffix tree 、 Segmentation 、 Mutual information 、 Speech recognition 、 Text segmentation 、 n-gram 、 Ukkonen's algorithm 、 Bigram 、 Precision and recall

摘要: In order to analyze security and terrorism related content in Chinese, it is important perform word segmentation on Chinese documents. There are many previous studies segmentation. The two major approaches statistic-based dictionary-based approaches. pure statistic methods have lower precision, while the method cannot deal with new words restricted coverage of dictionary. this paper, we propose a hybrid that avoids limitations both Through use suffix tree mutual information (MI) dictionary, our segmenter, called IASeg, achieves high accuracy when domain training available. It can identify through MI-based token merging dictionary update. addition, Improved Bigram also process N-grams. To evaluate performance compare Hylanda segmenter ICTCLAS using terrorism-related corpus. experiment results show IASeg performs better than benchmarks precision recall.

参考文章(31)

Alberto Apostolico, Zvi Galil, Pattern matching algorithms Oxford University Press. ,(1997)

M. T. Chen, Joel Seiferas, Efficient and Elegant Subword-Tree Construction Springer, Berlin, Heidelberg. pp. 97- 107 ,(1985) , 10.1007/978-3-642-82456-2_7

Hsinchun Chen, Thian-Huat Ong, Updateable PAT-Tree Approach to Chinese Key PhraseExtraction using Mutual Information: A Linguistic Foundation for Knowledge Management ,(1999)

J. Ponte, USe: A Retargetable Word Segmentation Procedure for Information Retrieval University of Massachusetts. ,(1996)

Richard Sproat, Chilin Shih, William Gale, Nancy Chang, A stochastic finite-state word-segmentation algorithm for Chinese Computational Linguistics. ,vol. 22, pp. 377- 404 ,(1996)

Esko Ukkonen, Constructing Suffix Trees On-Line in Linear Time world computer congress on algorithms software architecture. pp. 484- 492 ,(1992)

Fuchun Peng, Dale Schuurmans, Self-Supervised Chinese Word Segmentation intelligent data analysis. pp. 238- 247 ,(2001) , 10.1007/3-540-44816-0_24

Lixin Zhou, Qun Liu, A character-net based Chinese text segmentation method international conference on computational linguistics. pp. 1- 6 ,(2002) , 10.3115/1118735.1118752

Yubin Dai, Teck Ee Loh, Christopher S. G. Khoo, A new statistical formula for Chinese text segmentation incorporating contextual information international acm sigir conference on research and development in information retrieval. pp. 82- 89 ,(1999) , 10.1145/312624.312659

10.

R. Giegerich, S. Kurtz, From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction Algorithmica. ,vol. 19, pp. 331- 353 ,(1997) , 10.1007/PL00009177

Chinese Word Segmentation for Terrorism-Related Contents

来源期刊

我的账户

Chinese Word Segmentation for Terrorism-Related Contents

来源期刊

相似文章 3

Determinants of Customer Satisfaction in the Hotel Industry: An Application of Online Review Analysis

Web-Based Traffic Sentiment Analysis: Methods and Applications

A Text Sentimental Approach for Online Portals Using Hadoop

我的账户