作者: Tomáš Brychcín , Miloslav Konopík
DOI: 10.1016/J.CSL.2015.01.004
关键词: Machine learning 、 Universal Networking Language 、 Latent Dirichlet allocation 、 Natural language processing 、 Machine translation 、 Computer science 、 Language model 、 Perplexity 、 Random indexing 、 Semantics 、 Multilingualism 、 Artificial intelligence
摘要: HighlightsThe unsupervised techniques of language modelling are investigated.We use global semantics, local and morphology information in our models.We experiment with six different languages.Final models dramatically reduce the perplexities improve machine translation. This paper investigates three sources their integration into modelling. Global semantics is modelled by Latent Dirichlet allocation brings long range dependencies models. Word clusters given semantic spaces enrich these short semantics. Finally, own stemming algorithm used to further enhance performance for inflectional languages.Our research shows that each other combination improves All investigated acquired a fully manner.We show efficiency methods several languages such as Czech, Slovenian, Slovak, Polish, Hungarian, English, proving multilingualism. The perplexity tests accompanied translation prove ability proposed real-world application.