Latent semantics in language models

作者: Tomáš Brychcín , Miloslav Konopík

DOI: 10.1016/J.CSL.2015.01.004

关键词: Machine learningUniversal Networking LanguageLatent Dirichlet allocationNatural language processingMachine translationComputer scienceLanguage modelPerplexityRandom indexingSemanticsMultilingualismArtificial intelligence

摘要: HighlightsThe unsupervised techniques of language modelling are investigated.We use global semantics, local and morphology information in our models.We experiment with six different languages.Final models dramatically reduce the perplexities improve machine translation. This paper investigates three sources their integration into modelling. Global semantics is modelled by Latent Dirichlet allocation brings long range dependencies models. Word clusters given semantic spaces enrich these short semantics. Finally, own stemming algorithm used to further enhance performance for inflectional languages.Our research shows that each other combination improves All investigated acquired a fully manner.We show efficiency methods several languages such as Czech, Slovenian, Slovak, Polish, Hungarian, English, proving multilingualism. The perplexity tests accompanied translation prove ability proposed real-world application.

参考文章(60)
Georgiana Dinu, Mirella Lapata, Measuring Distributional Similarity in Context empirical methods in natural language processing. pp. 1162- 1172 ,(2010)
Tomáš Brychcín, Pavel Král, Novel Unsupervised Features for Czech Multi-label Document Classification mexican international conference on artificial intelligence. pp. 70- 79 ,(2014) , 10.1007/978-3-319-13647-9_8
George Karypis, CLUTO - A Clustering Toolkit Defense Technical Information Center. ,(2002) , 10.21236/ADA439508
Magnus Sahlgren, Pentti Kanerva, Anders Holst, Permutations as a means to encode order in word space The 30th Annual Meeting of the Cognitive Science Society (CogSci'08), 23-26 July 2008, Washington D.C., USA. ,(2008)
Amruta Purandare, Ted Pedersen, Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces conference on computational natural language learning. pp. 41- 48 ,(2004)
George Karypis, Ying Zhao, Ding-Zhu Du, Criterion functions for document clustering University of Minnesota. ,(2005)
Tomas Mikolov, Martin Karafiát, Sanjeev Khudanpur, Jan Cernocký, Lukás Burget, Recurrent neural network based language model conference of the international speech communication association. pp. 1045- 1048 ,(2010)
Magnus Sahlgren, An Introduction to Random Indexing terminology and knowledge engineering. ,(2005)
Vassilios Digalakis, Dimitris Oikonomidis, Stem-based maximum entropy language models for inflectional languages. conference of the international speech communication association. ,(2003)