作者: Andrew McCallum , Wei Li
DOI:
关键词:
摘要: Although there has been significant previous work on semi-supervised learning for classification, relatively little in sequence modeling. This paper presents an approach that leverages recent manifold-learning sequences to discover word clusters from language data, including both syntactic classes and semantic topics. From unlabeled data we form a smooth. low-dimensional feature space, where each token is projected based its underlying role as function or content word. We then use this projection additional input features linear-chain conditional random field trained limited labeled training data. On standard part-of-speech tagging Chinese segmentation sets show much 14% error reduction due the also statistically-significant improvements over related method Miller et al.