Text Augmentation Techniques for Document Vector Generation from Russian News Articles

作者: Christoffer Aminoff , Aleksei Romanenko , Onni Kosomaa , Jouko Vankka

DOI: 10.1007/978-3-319-99972-2_47

关键词:

摘要: In this paper, a document classification system is enhanced through the construction of text augmentation technique by testing various Part-of-Speech filters and word vector weighting methods with nine different models for representation. Subject/object tagging introduced as new form augmentation, along novel grounded in method based on distribution words among classes documents. When an including subject/object tagging, nouns+adjectives filter Inverse Document Frequency was applied, average increase accuracy 4.1% points observed.

参考文章(24)
Michal Hrala, Pavel Král, Evaluation of the Document Classification Approaches computer recognition systems. pp. 877- 885 ,(2013) , 10.1007/978-3-319-00969-8_86
Radim Řehůřek, Petr Sojka, Software Framework for Topic Modelling with Large Corpora University of Malta. ,(2010)
Andrey Kutuzov, Igor Andreev, Texts in, meaning out: neural language models in semantic similarity task for Russian arXiv: Computation and Language. pp. 143- 154 ,(2015)
Gerard Salton, Christopher Buckley, Term Weighting Approaches in Automatic Text Retrieval Information Processing and Management. ,vol. 24, pp. 323- 328 ,(1988) , 10.1016/0306-4573(88)90021-0
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Andreas Müller, Joel Nothman, Gilles Louppe, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay, Scikit-learn: Machine Learning in Python Journal of Machine Learning Research. ,vol. 12, pp. 2825- 2830 ,(2011)
Man Lan, Chew Lim Tan, Jian Su, Yue Lu, Supervised and Traditional Term Weighting Methods for Automatic Text Categorization IEEE Transactions on Pattern Analysis and Machine Intelligence. ,vol. 31, pp. 721- 735 ,(2009) , 10.1109/TPAMI.2008.110
KAREN SPARCK JONES, A statistical interpretation of term specificity and its application in retrieval Journal of Documentation. ,vol. 60, pp. 493- 502 ,(1972) , 10.1108/EB026526
Sida Wang, Christopher Manning, Baselines and Bigrams: Simple, Good Sentiment and Topic Classification meeting of the association for computational linguistics. ,vol. 2, pp. 90- 94 ,(2012)
Jeffrey Pennington, Richard Socher, Christopher Manning, Glove: Global Vectors for Word Representation empirical methods in natural language processing. pp. 1532- 1543 ,(2014) , 10.3115/V1/D14-1162