作者: Oscar Täckström
DOI:
关键词: Curse of dimensionality 、 Pattern recognition 、 Decision stump 、 AdaBoost 、 Feature selection 、 Computer science 、 Information gain ratio 、 Artificial intelligence 、 Support vector machine 、 Representation (mathematics) 、 Dimensionality reduction
摘要: Automatic text classification is the process of automatically classifying documents into pre-defined document classes. Traditionally are represented in so called bag-of-words model. In this model simply as vectors, which dimensions correspond to words. project a representation bag-of-concepts has been evaluated. This based on models for representing meanings words vector space. Documents then linear combinations words' meaning vectors. The resulting vectors high-dimensional and very dense. We have investigated two different methods reducing dimensionality vectors: feature selection gain ratio random mapping. Two domains used: abstracts medical articles english texts from Internet newsgroups. former primary interest, while latter used comparison. performed by use three machine learning methods: Support Vector Machine, AdaBoost Decision Stump. Results evaluation difficult interpret, but suggest that new give significantly better results classes classical method fails. representations seem equal works fine. Both reduction robust. Random mapping, being much less computationally expensive, shows greater variance.