作者: Lan Huang , Youfu Du , Gongyang Chen
DOI: 10.1016/J.CAGEO.2014.11.005
关键词:
摘要: Abstract Unlike English, the Chinese language has no space between words. Segmenting texts into words, known as word segmentation (CWS) problem, thus becomes a fundamental issue for processing documents and first step in many text mining applications, including information retrieval, machine translation knowledge acquisition. However, geoscience subject domain, CWS problem remains unsolved. Although generic segmenter can be applied to process documents, they lack domain specific consequently their accuracy drops dramatically. This motivated us develop specifically domain: GeoSegmenter. We proposed two-step framework CWS. Following this framework, we built GeoSegmenter using conditional random fields, principled statistical sequence learning. Specifically, identifies general terms by baseline segmenter. Then it recognises learning applying model that transform initial goal segmentation. Empirical experimental results on benchmark datasets showed could effectively recognise both terms.