An Evaluation of Bag-of-Concepts Representations in Automatic Text Classification

作者: Oscar Täckström

DOI:

关键词: Curse of dimensionalityPattern recognitionDecision stumpAdaBoostFeature selectionComputer scienceInformation gain ratioArtificial intelligenceSupport vector machineRepresentation (mathematics)Dimensionality reduction

摘要: Automatic text classification is the process of automatically classifying documents into pre-defined document classes. Traditionally are represented in so called bag-of-words model. In this model simply as vectors, which dimensions correspond to words. project a representation bag-of-concepts has been evaluated. This based on models for representing meanings words vector space. Documents then linear combinations words' meaning vectors. The resulting vectors high-dimensional and very dense. We have investigated two different methods reducing dimensionality vectors: feature selection gain ratio random mapping. Two domains used: abstracts medical articles english texts from Internet newsgroups. former primary interest, while latter used comparison. performed by use three machine learning methods: Support Vector Machine, AdaBoost Decision Stump. Results evaluation difficult interpret, but suggest that new give significantly better results classes classical method fails. representations seem equal works fine. Both reduction robust. Random mapping, being much less computationally expensive, shows greater variance.

参考文章(40)
Susan T. Dumais, George Furnas, Thomas Landauer, Scott Deerwester, Using latent semantic analysis to improve information retrieval human factors in computing systems. ,(1988)
Robert E. Schapire, Theoretical Views of Boosting european conference on computational learning theory. pp. 1- 10 ,(1999) , 10.1007/3-540-49097-3_1
Wayne Iba, Pat Langley, Induction of One-Level Decision Trees international conference on machine learning. pp. 233- 240 ,(1992) , 10.1016/B978-1-55860-247-2.50035-8
Bernhard Schölkopf, Alexander J. Smola, Learning with Kernels The MIT Press. pp. 626- ,(2018) , 10.7551/MITPRESS/4175.001.0001
Jan Kristoferson, Pentti Kanerva, Anders Holst, Random indexing of text samples for latent semantic analysis conference cognitive science. ,vol. 22, ,(2000)
Magnus Sahlgren, Vector-based semantic analysis: representing word meanings based on random labels Semantic Knowledge Acquisition and Categorisation Workshop at ESSLLI XIII (European Summer School in Logic, Language and Information), 13-17 Aug 2001, Helsinki, Finland. ,(2001)
Justin Zobel, Alistair Moffat, Exploring the similarity space international acm sigir conference on research and development in information retrieval. ,vol. 32, pp. 18- 34 ,(1998) , 10.1145/281250.281256
J. R. Quinlan, Bagging, boosting, and C4.S national conference on artificial intelligence. pp. 725- 730 ,(1996)