作者: Daniel Zeng , Donghua Wei , Michael Chau , Feiyue Wang
DOI: 10.1007/978-3-540-69304-8_1
关键词: Computer science 、 Suffix tree 、 Segmentation 、 Mutual information 、 Speech recognition 、 Text segmentation 、 n-gram 、 Ukkonen's algorithm 、 Bigram 、 Precision and recall
摘要: In order to analyze security and terrorism related content in Chinese, it is important perform word segmentation on Chinese documents. There are many previous studies segmentation. The two major approaches statistic-based dictionary-based approaches. pure statistic methods have lower precision, while the method cannot deal with new words restricted coverage of dictionary. this paper, we propose a hybrid that avoids limitations both Through use suffix tree mutual information (MI) dictionary, our segmenter, called IASeg, achieves high accuracy when domain training available. It can identify through MI-based token merging dictionary update. addition, Improved Bigram also process N-grams. To evaluate performance compare Hylanda segmenter ICTCLAS using terrorism-related corpus. experiment results show IASeg performs better than benchmarks precision recall.