作者: Frederick Jelinek , Yi Su
DOI:
关键词:
摘要: A language model (LM) is a probability distribution over all possible word sequences. It vital component of many natural processing tasks, such as automatic speech recognition, statistical machine translation, information retrieval and so on. The art modeling has been dominated by simple yet powerful family, the n-gram models. Many attempts have made to go beyond n-grams either proposing new mathematical framework or integrating more knowledge human language, preferably both. random forest (RFLM)—a collection randomized decision tree models—has distinguished itself successful effort former kind; we explore its potential latter. We start our quest advancing understanding RFLM through explorative experimentation. To facilitate further investigation, address problem training on large amount data an efficient disk swapping algorithm. We formalize method various sources into models with forests illustrate applicability three innovative applications: morphological LMs Arabic, prosodic for recognition combination syntactic topic in LMs.