GeoSegmenter: A statistically learned Chinese word segmenter for the geoscience domain

作者: Lan Huang , Youfu Du , Gongyang Chen

DOI: 10.1016/J.CAGEO.2014.11.005

关键词:

摘要: Abstract Unlike English, the Chinese language has no space between words. Segmenting texts into words, known as word segmentation (CWS) problem, thus becomes a fundamental issue for processing documents and first step in many text mining applications, including information retrieval, machine translation knowledge acquisition. However, geoscience subject domain, CWS problem remains unsolved. Although generic segmenter can be applied to process documents, they lack domain specific consequently their accuracy drops dramatically. This motivated us develop specifically domain: GeoSegmenter. We proposed two-step framework CWS. Following this framework, we built GeoSegmenter using conditional random fields, principled statistical sequence learning. Specifically, identifies general terms by baseline segmenter. Then it recognises learning applying model that transform initial goal segmentation. Empirical experimental results on benchmark datasets showed could effectively recognise both terms.

参考文章(22)
Thomas Emerson, The Second International Chinese Word Segmentation Bakeoff. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing. ,(2005)
Olena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna-Lan Huang, Ian H. Witten, Constructing a Focused Taxonomy from a Document Collection extended semantic web conference. pp. 367- 381 ,(2013) , 10.1007/978-3-642-38288-8_25
Chang-Ning Huang, Jianfeng Gao, Mu Li, Ashley X. Chang, Chinese word segmentation ,(2004)
Xiaogang Ma, Emmanuel John M. Carranza, Chonglong Wu, Freek D. van der Meer, Gang Liu, A SKOS-based multilingual thesaurus of geological time scale for interoperability of online geological maps Computers & Geosciences. ,vol. 37, pp. 1602- 1615 ,(2011) , 10.1016/J.CAGEO.2011.02.011
Daniel Zeng, Donghua Wei, Michael Chau, Feiyue Wang, Domain-specific Chinese word segmentation using suffix tree and mutual information Information Systems Frontiers. ,vol. 13, pp. 115- 125 ,(2011) , 10.1007/S10796-010-9278-5
Hua-Ping Zhang, Hong-Kui Yu, De-Yi Xiong, Qun Liu, HHMM-based Chinese Lexical Analyzer ICTCLAS Proceedings of the Second SIGHAN Workshop on Chinese Language Processing. pp. 184- 187 ,(2003) , 10.3115/1119250.1119280
Lan Huang, David Milne, Eibe Frank, Ian H. Witten, Learning a concept-based document similarity measure Journal of the Association for Information Science and Technology. ,vol. 63, pp. 1593- 1608 ,(2012) , 10.1002/ASI.22689
Hassan A. Babaie, M. Broda Cindi, Jafar Hadizadeh, Anuj Kumar, SAFOD Brittle Microstructure and Mechanics Knowledge Base (BM2KB) Computers & Geosciences. ,vol. 56, pp. 83- 91 ,(2013) , 10.1016/J.CAGEO.2013.03.004
Fuchun Peng, Fangfang Feng, Andrew McCallum, Chinese segmentation and new word detection using conditional random fields Proceedings of the 20th international conference on Computational Linguistics - COLING '04. pp. 562- 568 ,(2004) , 10.3115/1220355.1220436
Xiaogang Ma, Chonglong Wu, Emmanuel John M. Carranza, Ernst M. Schetselaar, Freek D. van der Meer, Gang Liu, Xinqing Wang, Xialin Zhang, Development of a controlled vocabulary for semantic interoperability of mineral exploration geodata for mining projects Computers & Geosciences. ,vol. 36, pp. 1512- 1522 ,(2010) , 10.1016/J.CAGEO.2010.05.014