Random forests and the data sparseness problem in language modeling

作者： Peng Xu , Frederick Jelinek

DOI: 10.1016/J.CSL.2006.01.003

关键词:

摘要: Abstract Language modeling is the problem of predicting words based on histories containing already hypothesized. Two key aspects language are effective history equivalence classification and robust probability estimation. The solution these hindered by data sparseness problem. Application random forests (RFs) to deals with two simultaneously. We develop a new smoothing technique randomly grown decision trees (DTs) apply resulting RF models automatic speech recognition. This method complementary many existing ones dealing study our approach in context n-gram type which n − 1 present history. Unlike regular models, have potential generalize well unseen data, even when longer than four words. show that superior best known technique, interpolated Kneser–Ney smoothing, reducing both perplexity (PPL) word error rate (WER) large vocabulary state-of-the-art recognition systems. In particular, we will statistically significant improvements contemporary conversational telephony system applying only one its models.

sciencedirect.com 本地加速

参考文章(56)

Frederick Jelinek, Statistical methods for speech recognition ,(1997)

J. Kent Martin, An Exact Probability Metric for Decision Tree Splitting and Stopping Machine Learning. ,vol. 28, pp. 257- 291 ,(1997) , 10.1023/A:1007367629006

Brian Edward Roark, Mark Johnson, Robust probabilistic predictive syntactic processing: motivations, models, and applications Brown University. ,(2001)

Joshua T. Goodman, Exponential priors for maximum entropy models north american chapter of the association for computational linguistics. pp. 305- 312 ,(2005)

Ronald Rosenfeld, Adaptive Statistical Language Modeling; A Maximum Entropy Approach ,(1994)

Richard A Olshen, Charles J Stone, Leo Breiman, Jerome H Friedman, Classification and regression trees ,(1983)

Thomas G. Dietterich, An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization Machine Learning. ,vol. 40, pp. 139- 157 ,(2000) , 10.1023/A:1007607513941

Andreas Stolcke, SRILM – An Extensible Language Modeling Toolkit conference of the international speech communication association. ,(2002)

Mitch Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, None, Building a large annotated corpus of English: the penn treebank Computational Linguistics. ,vol. 19, pp. 313- 330 ,(1993) , 10.21236/ADA273556

10.

Xiaojin Zhu, R. Rosenfeld, Improving trigram language modeling with the World Wide Web international conference on acoustics, speech, and signal processing. ,vol. 1, pp. 533- 536 ,(2001) , 10.1109/ICASSP.2001.940885

Random forests and the data sparseness problem in language modeling

来源期刊

我的账户

Random forests and the data sparseness problem in language modeling

来源期刊

相似文章 10

我的账户