Random forests and the data sparseness problem in language modeling

作者: Peng Xu , Frederick Jelinek

DOI: 10.1016/J.CSL.2006.01.003

关键词:

摘要: Abstract Language modeling is the problem of predicting words based on histories containing already hypothesized. Two key aspects language are effective history equivalence classification and robust probability estimation. The solution these hindered by data sparseness problem. Application random forests (RFs) to deals with two simultaneously. We develop a new smoothing technique randomly grown decision trees (DTs) apply resulting RF models automatic speech recognition. This method complementary many existing ones dealing study our approach in context n-gram type which n − 1 present history. Unlike regular models, have potential generalize well unseen data, even when longer than four words. show that superior best known technique, interpolated Kneser–Ney smoothing, reducing both perplexity (PPL) word error rate (WER) large vocabulary state-of-the-art recognition systems. In particular, we will statistically significant improvements contemporary conversational telephony system applying only one its models.

参考文章(56)
J. Kent Martin, An Exact Probability Metric for Decision Tree Splitting and Stopping Machine Learning. ,vol. 28, pp. 257- 291 ,(1997) , 10.1023/A:1007367629006
Joshua T. Goodman, Exponential priors for maximum entropy models north american chapter of the association for computational linguistics. pp. 305- 312 ,(2005)
Richard A Olshen, Charles J Stone, Leo Breiman, Jerome H Friedman, Classification and regression trees ,(1983)
Andreas Stolcke, SRILM – An Extensible Language Modeling Toolkit conference of the international speech communication association. ,(2002)
Mitch Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, None, Building a large annotated corpus of English: the penn treebank Computational Linguistics. ,vol. 19, pp. 313- 330 ,(1993) , 10.21236/ADA273556
Xiaojin Zhu, R. Rosenfeld, Improving trigram language modeling with the World Wide Web international conference on acoustics, speech, and signal processing. ,vol. 1, pp. 533- 536 ,(2001) , 10.1109/ICASSP.2001.940885