The Power of Word Clusters for Text Classification

作者: Naftali Tishby , Noam Slonim

DOI:

关键词:

摘要: The recently introduced Information Bottleneck method [21] provides an information theoretic framework, for extracting features of one variable, that are relevant the values another variable. Several previous works already suggested applying this document clustering, gene expression data analysis, spectral analysis and more. In work we present a novel implementation supervised text classification. Specifically, apply bottleneck to find word-clusters preserve about categories use these clusters as Previous [1] used similar clustering procedure show can significantly reduce feature space dimensionality, with only minor change in classification accuracy. reproduce results go further when training sample is small word yield significant improvement accuracy (up 18%) over performance using words directly.

参考文章(19)
S. T. Dumais, Using LSI for information filtering: TREC-3 experiments text retrieval conference. ,(1995)
Ken Lang, NewsWeeder: Learning to Filter Netnews Machine Learning Proceedings 1995. pp. 331- 339 ,(1995) , 10.1016/B978-1-55860-377-6.50048-7
David D. Lewis, Jason Catlett, Heterogeneous Uncertainty Sampling for Supervised Learning Machine Learning Proceedings 1994. pp. 148- 156 ,(1994) , 10.1016/B978-1-55860-335-6.50026-X
Robert E. Schapire, Yoram Singer, BoosTexter: A Boosting-based Systemfor Text Categorization Machine Learning. ,vol. 39, pp. 135- 168 ,(2000) , 10.1023/A:1007649029923
David D. Lewis, Robert E. Schapire, James P. Callan, Ron Papka, Training algorithms for linear text classifiers international acm sigir conference on research and development in information retrieval. pp. 298- 306 ,(1996) , 10.1145/243199.243277
Noam Slonim, Naftali Tishby, Document clustering using word clusters via the information bottleneck method international acm sigir conference on research and development in information retrieval. pp. 208- 215 ,(2000) , 10.1145/345508.345578
L. Douglas Baker, Andrew Kachites McCallum, Distributional clustering of words for text classification international acm sigir conference on research and development in information retrieval. pp. 96- 103 ,(1998) , 10.1145/290941.290970
N. Slonim, R. Somerville, N. Tishby, O. Lahav, Objective classification of galaxy spectra using the information bottleneck method Monthly Notices of the Royal Astronomical Society. ,vol. 323, pp. 270- 284 ,(2001) , 10.1046/J.1365-8711.2001.04125.X
Thorsten Joachims, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization international conference on machine learning. pp. 143- 151 ,(1997)
Thomas M. Cover, Joy A. Thomas, Elements of information theory ,(1991)