An Evaluation of Bag-of-Concepts Representations in Automatic Text Classification

作者： Oscar Täckström

DOI:

关键词: Curse of dimensionality 、 Pattern recognition 、 Decision stump 、 AdaBoost 、 Feature selection 、 Computer science 、 Information gain ratio 、 Artificial intelligence 、 Support vector machine 、 Representation (mathematics) 、 Dimensionality reduction

摘要: Automatic text classification is the process of automatically classifying documents into pre-defined document classes. Traditionally are represented in so called bag-of-words model. In this model simply as vectors, which dimensions correspond to words. project a representation bag-of-concepts has been evaluated. This based on models for representing meanings words vector space. Documents then linear combinations words' meaning vectors. The resulting vectors high-dimensional and very dense. We have investigated two different methods reducing dimensionality vectors: feature selection gain ratio random mapping. Two domains used: abstracts medical articles english texts from Internet newsgroups. former primary interest, while latter used comparison. performed by use three machine learning methods: Support Vector Machine, AdaBoost Decision Stump. Results evaluation difficult interpret, but suggest that new give significantly better results classes classical method fails. representations seem equal works fine. Both reduction robust. Random mapping, being much less computationally expensive, shows greater variance.

kth.se 本地加速

暂无可下载资源，当前可以选择系统获取到有开放资源时通知我或者直接发起求助文献求助

参考文章(40)

Susan T. Dumais, George Furnas, Thomas Landauer, Scott Deerwester, Using latent semantic analysis to improve information retrieval human factors in computing systems. ,(1988)

Robert E. Schapire, Theoretical Views of Boosting european conference on computational learning theory. pp. 1- 10 ,(1999) , 10.1007/3-540-49097-3_1

A. Zanasi, Text Mining and its Applications to Intelligence, CRM and Knowledge Management ,(2007)

Wayne Iba, Pat Langley, Induction of One-Level Decision Trees international conference on machine learning. pp. 233- 240 ,(1992) , 10.1016/B978-1-55860-247-2.50035-8

Bernhard Schölkopf, Alexander J. Smola, Learning with Kernels The MIT Press. pp. 626- ,(2018) , 10.7551/MITPRESS/4175.001.0001

Jan Kristoferson, Pentti Kanerva, Anders Holst, Random indexing of text samples for latent semantic analysis conference cognitive science. ,vol. 22, ,(2000)

Magnus Sahlgren, Vector-based semantic analysis: representing word meanings based on random labels Semantic Knowledge Acquisition and Categorisation Workshop at ESSLLI XIII (European Summer School in Logic, Language and Information), 13-17 Aug 2001, Helsinki, Finland. ,(2001)

Thorsten Joachims, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms Kluwer Academic Publishers. ,(2002)

Justin Zobel, Alistair Moffat, Exploring the similarity space international acm sigir conference on research and development in information retrieval. ,vol. 32, pp. 18- 34 ,(1998) , 10.1145/281250.281256

10.

J. R. Quinlan, Bagging, boosting, and C4.S national conference on artificial intelligence. pp. 725- 730 ,(1996)

An Evaluation of Bag-of-Concepts Representations in Automatic Text Classification

来源期刊

我的账户

An Evaluation of Bag-of-Concepts Representations in Automatic Text Classification

来源期刊

相似文章 10

我的账户