k-NN Embedding Stability for word2vec Hyper-Parametrisation in Scientific Text

作者: Amna Dridi , Mohamed Medhat Gaber , R. Muhammad Atif Azad , Jagdev Bhogal

DOI: 10.1007/978-3-030-01771-2_21

关键词:

摘要: Word embeddings are increasingly attracting the attention of researchers dealing with semantic similarity and analogy tasks. However, finding optimal hyper-parameters remains an important challenge due to resulting impact on revealed analogies mainly for domain-specific corpora. While highly used hypotheses synthesis, it is crucial optimise word embedding precise hypothesis synthesis. Therefore, we propose, in this paper, a methodological approach tuning by using stability k-nearest neighbors vectors within scientific corpora more specifically Computer Science Machine learning adopted as case study. This tested dataset created from NIPS (Conference Neural Information Processing Systems) publications, evaluated curated ACM hierarchy Wikipedia Learning outline gold standard. Our quantitative qualitative analysis indicate that our not only reliably captures interesting patterns like “unsupervised_learning kmeans supervised_learning knn”, but also analogical structure consistently outperforms \(61\%\) sate-of-the-art syntactic accuracy \(68\%\).

参考文章(24)
Holger Hoos, Kevin Leyton-Brown, Frank Hutter, An Efficient Approach for Assessing Hyperparameter Importance international conference on machine learning. pp. 754- 762 ,(2014)
Kilian Weinberger, Matt Kusner, Nicholas Kolkin, Yu Sun, From Word Embeddings To Document Distances international conference on machine learning. pp. 957- 966 ,(2015)
Hinrich Schütze, Christopher D. Manning, Prabhakar Raghavan, Introduction to Information Retrieval ,(2005)
Larry Wasserman, Aarti Singh, Alessandro Rinaldo, Rebecca Nugent, Stability of density-based clustering Journal of Machine Learning Research. ,vol. 13, pp. 905- 948 ,(2012)
Tomas Mikolov, Greg S. Corrado, Kai Chen, Jeffrey Dean, Efficient Estimation of Word Representations in Vector Space international conference on learning representations. ,(2013)
Matthias Samwald, Oscar Marín-Alonso, José Antonio Miñarro-Giménez, Applying deep learning techniques on medical corpora from the World Wide Web: a prototypical system and evaluation arXiv: Computation and Language. ,(2015)
Kevin Lund, Curt Burgess, Producing high-dimensional semantic spaces from lexical co-occurrence Behavior Research Methods, Instruments, & Computers. ,vol. 28, pp. 203- 208 ,(1996) , 10.3758/BF03204766
A neural probabilistic language model Journal of Machine Learning Research. ,vol. 3, pp. 1137- 1155 ,(2003) , 10.1162/153244303322533223
Tomas Mikolov, Geoffrey Zweig, Wen-tau Yih, Linguistic Regularities in Continuous Space Word Representations north american chapter of the association for computational linguistics. pp. 746- 751 ,(2013)
Shravan M. Narayanamurthy, Alex J. Smola, Tibério S. Caetano, James Petterson, Wray Buntine, Word Features for Latent Dirichlet Allocation neural information processing systems. ,vol. 23, pp. 1921- 1929 ,(2010)