作者: Viljami Venekoski , Samir Puuska , Jouko Vankka
DOI: 10.1007/978-3-319-46254-7_42
关键词: Word2vec 、 Social media 、 Term (logic) 、 Vector space 、 Artificial intelligence 、 Lemmatisation 、 Mathematics 、 tf–idf 、 Natural language processing 、 Word embedding 、 Context (language use)
摘要: Computational analysis of linguistic data requires that texts are transformed into numeric representations. The aim this research is to evaluate different methods for building vector representations text documents from social media. compared in respect their performance a classification task. Namely, traditional count-based term frequency-inverse document frequency (TFIDF) the semantic distributed word embedding Unlike previous research, we investigate context morphologically rich Finnish. Based on results, suggest framework space media, applicable language technologies languages. In current study, lemmatization tokens increased accuracy, while lexical filtering generally hindered performance. Finally, report embeddings and TFIDF perform at comparable levels with our data.