Semantic spaces for improving language modeling

作者: Tomáš Brychcín , Miloslav Konopík

DOI: 10.1016/J.CSL.2013.05.001

关键词:

摘要: Language models are crucial for many tasks in NLP (Natural Processing) and n-grams the best way to build them. Huge effort is being invested improving n-gram language models. By introducing external information (morphology, syntax, partitioning into documents, etc.) a significant improvement can be achieved. The however improved with no smoothing an excellent example of such improvement. In this article we show another that also requires information. We examine patterns found large corpora by building semantic spaces (HAL, COALS, BEAGLE others described article). These have never been tested modeling before. Our method uses clustering classes class-based model. model then coupled standard create very effective experiments our reduce perplexity improve accuracy added. Training fully unsupervised. inflectional languages, which particularly hard results five different settings number classes. tests accompanied machine translation prove ability proposed performance real-world application.

参考文章(41)
Jianfeng Gao, Jiangbo Miao, Joshua T. Goodman, The Use of Clustering Techniques for Language Modeling--Application to Asian Language International Journal of Computational Linguistics & Chinese Language Processing, Volume 6, Number 1, February 2001: Special Issue on Natural Language Processing Researches in MSRA. ,vol. 6, pp. 27- 60 ,(2001) , 10.30019/IJCLCLP.200102.0002
Bhuvana Ramabhadran, Abhinav Sethy, Hong-Kwang Jeff Kuo, Sangyun Hahn, A study of unsupervised clustering techniques for language modeling. conference of the international speech communication association. pp. 1598- 1601 ,(2008)
George Karypis, CLUTO - A Clustering Toolkit Defense Technical Information Center. ,(2002) , 10.21236/ADA439508
Magnus Sahlgren, Pentti Kanerva, Anders Holst, Permutations as a means to encode order in word space The 30th Annual Meeting of the Cognitive Science Society (CogSci'08), 23-26 July 2008, Washington D.C., USA. ,(2008)
Amruta Purandare, Ted Pedersen, Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces conference on computational natural language learning. pp. 41- 48 ,(2004)
Magnus Sahlgren, An Introduction to Random Indexing terminology and knowledge engineering. ,(2005)
Airenas Vaičiūnas, Vytautas Kaminskas, Gailius Raškinis, Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition Informatica (lithuanian Academy of Sciences). ,vol. 15, pp. 565- 580 ,(2004) , 10.15388/INFORMATICA.2004.079
Thomas Hofmann, Probabilistic latent semantic analysis uncertainty in artificial intelligence. ,vol. 15, pp. 289- 296 ,(1999)
Daniel Gildea, Thomas Hofmann, Topic-based language models using EM. conference of the international speech communication association. pp. 2167- 2170 ,(1999)