Vector Space Representations of Documents in Classifying Finnish Social Media Texts

作者: Viljami Venekoski , Samir Puuska , Jouko Vankka

DOI: 10.1007/978-3-319-46254-7_42

关键词: Word2vecSocial mediaTerm (logic)Vector spaceArtificial intelligenceLemmatisationMathematicstf–idfNatural language processingWord embeddingContext (language use)

摘要: Computational analysis of linguistic data requires that texts are transformed into numeric representations. The aim this research is to evaluate different methods for building vector representations text documents from social media. compared in respect their performance a classification task. Namely, traditional count-based term frequency-inverse document frequency (TFIDF) the semantic distributed word embedding Unlike previous research, we investigate context morphologically rich Finnish. Based on results, suggest framework space media, applicable language technologies languages. In current study, lemmatization tokens increased accuracy, while lexical filtering generally hindered performance. Finally, report embeddings and TFIDF perform at comparable levels with our data.

参考文章(27)
Christoph Reichenbach, Stephen Lewis, Franco Salvetti, Automatic Opinion Polarity Classification of Movie Reviews Colorado Research in Linguistics. ,vol. 17, pp. 1- ,(2004) , 10.25810/ATV1-V819
Michal Hrala, Pavel Král, Evaluation of the Document Classification Approaches computer recognition systems. pp. 877- 885 ,(2013) , 10.1007/978-3-319-00969-8_86
Radim Řehůřek, Petr Sojka, Software Framework for Topic Modelling with Large Corpora University of Malta. ,(2010)
Kilian Weinberger, Matt Kusner, Nicholas Kolkin, Yu Sun, From Word Embeddings To Document Distances international conference on machine learning. pp. 957- 966 ,(2015)
Omer Levy, Yoav Goldberg, Ido Dagan, Improving Distributional Similarity with Lessons Learned from Word Embeddings Transactions of the Association for Computational Linguistics. ,vol. 3, pp. 211- 225 ,(2015) , 10.1162/TACL_A_00134
Yoon Kim, Convolutional Neural Networks for Sentence Classification empirical methods in natural language processing. pp. 1746- 1751 ,(2014) , 10.3115/V1/D14-1181
Felix Hill, Roi Reichart, Anna Korhonen, Simlex-999: Evaluating semantic models with genuine similarity estimation Computational Linguistics. ,vol. 41, pp. 665- 695 ,(2015) , 10.1162/COLI_A_00237
David M Blei, Andrew Y Ng, Michael I Jordan, None, Latent dirichlet allocation Journal of Machine Learning Research. ,vol. 3, pp. 993- 1022 ,(2003) , 10.5555/944919.944937
Katri Haverinen, Jenna Nyblom, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Anna Missilä, Stina Ojala, Tapio Salakoski, Filip Ginter, None, Building the essential resources for Finnish: the Turku Dependency Treebank Language Resources and Evaluation. ,vol. 48, pp. 493- 531 ,(2014) , 10.1007/S10579-013-9244-1
Anthony Khoo, Yuval Marom, David Albrecht, Experiments with Sentence Classification Proceedings of the Australasian Language Technology Workshop 2006. pp. 18-25- 18-25 ,(2006)