作者: Tom M. Mitchell , Kanal Paul Nigam
DOI:
关键词:
摘要: One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates supervised use a small number of labeled and inexpensive unlabeled can create high-accuracy classifiers. By assuming documents are created by parametric generative model, Expectation-Maximization (EM) finds local maximum posteriori models classifiers from all the data—labeled unlabeled. These do not capture intricacies text; however on some domains this technique substantially improves accuracy, especially when data sparse. Two problems arise basic approach. First, hurt performance in where modeling assumptions too strongly violated. In case be made more representative two ways: sub-topic class structure, super-topic hierarchical relationships. doing so, model probability accuracy come into correspondence, allowing improve performance. The second problem even improvements given sufficiently compensate for paucity data. Here, limited provide EM initializations lead low-probability models. Performance significantly improved using active select high-quality initializations, alternatives avoid maxima.