Knowledge integration into language models: a random forest approach

作者: Frederick Jelinek , Yi Su

DOI:

关键词:

摘要: A language model (LM) is a probability distribution over all possible word sequences. It vital component of many natural processing tasks, such as automatic speech recognition, statistical machine translation, information retrieval and so on. The art modeling has been dominated by simple yet powerful family, the n-gram models. Many attempts have made to go beyond n-grams either proposing new mathematical framework or integrating more knowledge human language, preferably both. random forest (RFLM)—a collection randomized decision tree models—has distinguished itself successful effort former kind; we explore its potential latter. We start our quest advancing understanding RFLM through explorative experimentation. To facilitate further investigation, address problem training on large amount data an efficient disk swapping algorithm. We formalize method various sources into models with forests illustrate applicability three innovative applications: morphological LMs Arabic, prosodic for recognition combination syntactic topic in LMs.

参考文章(68)
Andreas Stolcke, Dilek Zeynep Hakkani, Madelaine Plauché, Elizabeth Shriberg, Mari Ostendorf, Rebecca A. Bates, Gökhan Tür, Yu Lu, Automatic detection of sentence boundaries and disfluencies based on recognized words. conference of the international speech communication association. ,(1998)
Christine H. Nakatani, Julia Hirschberg, Acoustic indicators of topic segmentation. conference of the international speech communication association. ,(1998)
Pavel Ircing, Jan Hajic, Sanjeev Khudanpur, Frederick Jelinek, Josef Psutka, William Byrne, Pavel Krbec, On large vocabulary continuous speech recognition of highly inflectional language - Czech conference of the international speech communication association. pp. 487- 490 ,(2001)
John F. Pitrelli, Janet B. Pierrehumbert, Julia Hirschberg, Colin W. Wightman, Mary E. Beckman, Mari Ostendorf, Patti Price, Kim E. A. Silverman, TOBI: a standard for labeling English prosody. conference of the international speech communication association. ,(1992)
Frederick Jelinek, Peng Xu, Random Forests in Language Modelin empirical methods in natural language processing. pp. 325- 332 ,(2004)
Lidia Mangu, Peng Xu, Using random forest language models in the IBM RT-04 CTS system. conference of the international speech communication association. pp. 741- 744 ,(2005)
Jennifer Cole, Sarah Borys, Mark Hasegawa-Johnson, Ken Chen, Prosody dependent speech recognition with explicit duration modelling at intonational phrase boundaries conference of the international speech communication association. pp. 393- 396 ,(2003)
Roger K. Moore, Computer Speech and Language Elsevier Publishing Company. ,(1986)
Andreas Stolcke, Elizabeth Shriberg, Dilek Z. Hakkani-Tür, Gökhan Tür, Modeling the prosody of hidden events for improved word recognition. conference of the international speech communication association. ,(1999)