Word Embeddings: Stability and Semantic Change.

作者: Lucas Rettenmeier

DOI:

关键词:

摘要: Word embeddings are computed by a class of techniques within natural language processing (NLP), that create continuous vector representations words in from large text corpus. The stochastic nature the training process most embedding can lead to surprisingly strong instability, i.e. subsequently applying same technique data twice, produce entirely different results. In this work, we present an experimental study on instability three influential last decade: word2vec, GloVe and fastText. Based results, propose statistical model describe introduce novel metric measure representation individual word. Finally, method minimize - computing modified average over multiple runs apply it specific linguistic problem: detection quantification semantic change, measuring changes meaning usage time.

参考文章(80)
Mark Davies, Corpus of Historical American English (COHA) Brigham Young University, Provo, UT. ,(2010)
Tomas Mikolov, Martin Karafiát, Sanjeev Khudanpur, Jan Cernocký, Lukás Burget, Recurrent neural network based language model conference of the international speech communication association. pp. 1045- 1048 ,(2010)
David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams, Learning representations by back-propagating errors Nature. ,vol. 323, pp. 696- 699 ,(1988) , 10.1038/323533A0
Hinrich Schütze, Christopher D. Manning, Prabhakar Raghavan, Introduction to Information Retrieval ,(2005)
Joseph Lilleberg, Yun Zhu, Yanqing Zhang, Support vector machines and Word2vec for text classification with semantic features ieee international conference on cognitive informatics and cognitive computing. pp. 136- 140 ,(2015) , 10.1109/ICCI-CC.2015.7259377
Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, Steven Skiena, Statistically Significant Detection of Linguistic Change the web conference. pp. 625- 635 ,(2015) , 10.1145/2736277.2741627
Omer Levy, Yoav Goldberg, Ido Dagan, Improving Distributional Similarity with Lessons Learned from Word Embeddings Transactions of the Association for Computational Linguistics. ,vol. 3, pp. 211- 225 ,(2015) , 10.1162/TACL_A_00134