Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

作者: Airenas Vaičiūnas , Vytautas Kaminskas , Gailius Raškinis

DOI: 10.15388/INFORMATICA.2004.079

关键词:

摘要: This paper describes our research on statistical language modeling of Lithuanian. The idea improving sparse n-gram models highly inflected Lithuanian by interpolating them with complex based word clustering and morphological decomposition was investigated. Words, base forms part-of-speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3-gram 4-gram class-based built evaluated text corpus, which contained 85 million words. Class-based linearly interpolated the model led up a 13% reduction in perplexity compared baseline model. Morphological decreased out-of-vocabulary rate from 1.5% 1.02%.

参考文章(15)
Alex Acero, Xuedong Huang, Hsiao-Wuen Hon, Spoken Language Processing Prentice-Hall. pp. 1008- ,(2001)
Dietrich Klakow, Log-linear interpolation of language models. conference of the international speech communication association. ,(1998)
Tetsunori Kobayashi, Norihiko Kobayashi, Class-combined word n-gram for robust language modeling. conference of the international speech communication association. ,(1999)
Pavel Ircing, Jan Hajic, Sanjeev Khudanpur, Frederick Jelinek, Josef Psutka, William Byrne, Pavel Krbec, On large vocabulary continuous speech recognition of highly inflectional language - Czech conference of the international speech communication association. pp. 487- 490 ,(2001)
Dan Jurafsky, James H. Martin, Speech and Language Processing ,(1999)
Mirjam Sepesy Maucec, Zdravko Kacic, Topic Detection for Language Model Adaptation of Highly-Inflected Languages by Using a Fuzzy Comparison Function conference of the international speech communication association. pp. 243- 246 ,(2001)
H. Crépy, M. Herzog, Francisco Palou, Paolo Bravetti, Giulio Maltese, B. J. Grainger, Combining word- and class-based language models: A comparative study in several languages using automatic and manual word-clustering techniques conference of the international speech communication association. pp. 21- 24 ,(2001)
Vesa Siivola, Mikko Kurimo, Krista Lagus, Large Vocabulary Statistical Language Modeling for Continuous Speech Recognition in Finnish conference of the international speech communication association. pp. 737- 740 ,(2001)
Gailius Raškinis, Danutė Raškinienė, Building Medium-Vocabulary Isolated-Word Lithuanian HMM Speech Recognition System Informatica (lithuanian Academy of Sciences). ,vol. 14, pp. 75- 84 ,(2003) , 10.15388/INFORMATICA.2003.005