Improving text classification accuracy using topic modeling over an additional corpus

作者: Somnath Banerjee

DOI: 10.1145/1390334.1390546

关键词: Information retrievalTask (project management)Set (abstract data type)Topic modelSemi-supervised learningBaseline (configuration management)Computer science

摘要: The World Wide Web has many document repositories that can act as valuable sources of additional data for various machine learning tasks. In this paper, we propose a method improving text classification accuracy by using such an corpus easily be obtained from the web. This unlabeled and independent given task. proposed here uses topic modeling to extract set topics corpus. Those extracted then features An evaluation on RCV1 dataset shows significant improvement over baseline method.

参考文章(5)
David M Blei, Andrew Y Ng, Michael I Jordan, None, Latent dirichlet allocation Journal of Machine Learning Research. ,vol. 3, pp. 993- 1022 ,(2003) , 10.5555/944919.944937
T. L. Griffiths, M. Steyvers, Finding scientific topics Proceedings of the National Academy of Sciences of the United States of America. ,vol. 101, pp. 5228- 5235 ,(2004) , 10.1073/PNAS.0307752101
Evgeniy Gabrilovich, Shaul Markovitch, Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge national conference on artificial intelligence. pp. 1301- 1306 ,(2006)
Fei-Fei Li, P. Perona, A Bayesian hierarchical model for learning natural scene categories computer vision and pattern recognition. ,vol. 2, pp. 524- 531 ,(2005) , 10.1109/CVPR.2005.16
Yiming Yang, Fan Li, David D. Lewis, Tony G. Rose, RCV1: A New Benchmark Collection for Text Categorization Research Journal of Machine Learning Research. ,vol. 5, pp. 361- 397 ,(2004) , 10.5555/1005332.1005345