作者: Somnath Banerjee
关键词: Information retrieval 、 Task (project management) 、 Set (abstract data type) 、 Topic model 、 Semi-supervised learning 、 Baseline (configuration management) 、 Computer science
摘要: The World Wide Web has many document repositories that can act as valuable sources of additional data for various machine learning tasks. In this paper, we propose a method improving text classification accuracy by using such an corpus easily be obtained from the web. This unlabeled and independent given task. proposed here uses topic modeling to extract set topics corpus. Those extracted then features An evaluation on RCV1 dataset shows significant improvement over baseline method.