作者: Airenas Vaičiūnas , Vytautas Kaminskas , Gailius Raškinis
DOI: 10.15388/INFORMATICA.2004.079
关键词:
摘要: This paper describes our research on statistical language modeling of Lithuanian. The idea improving sparse n-gram models highly inflected Lithuanian by interpolating them with complex based word clustering and morphological decomposition was investigated. Words, base forms part-of-speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3-gram 4-gram class-based built evaluated text corpus, which contained 85 million words. Class-based linearly interpolated the model led up a 13% reduction in perplexity compared baseline model. Morphological decreased out-of-vocabulary rate from 1.5% 1.02%.