摘要: Topic models like latent Dirichlet allocation (LDA) provide a framework for analyzing large datasets where observations are collected into groups. Although topic modeling has been fruitfully applied to problems social science, biology, and computer vision, it most widely used model documents modeled as exchangeable groups of words. In this context, discover topics, distributions over words that express coherent theme “business” or “politics.” While one the strengths is they make few assumptions about underlying data, such general approach sometimes limits type can solve. When we restrict our focus natural language datasets, use insights from linguistics create understand richer patterns. thesis, extend LDA in three different ways: adding knowledge word meaning, multiple languages, incorporating local syntactic context. These extensions apply new problems, discovering meaning ambiguous words, unaligned multilingual corpora, combine with other sources information documents' In Chapter 2, present WordNet (LDAWN), an unsupervised probabilistic includes sense hidden variable. LDAWN replaces multinomial topics Abney Light's distribution meanings. Thus, posterior inference discovers not only topical domains each token, LDA, but also associated token. We show considering more improves problem disambiguation. LDAWN allows us separate representation how expressed forms. 3, allow meanings be using forms languages. addition disambiguation provided by LDAWN, offers method on corpora 4, relax LDAWN. text (MuTo). Like designed analyze composed Unlike which requires correspondence between languages painstakingly annotated, MuTo uses stochastic EM simultaneously both matching while learns topics. demonstrate similar recovered across 5, address recurring hindered performance presented previous chapters: lack develop (STM), non-parametric Bayesian parsed documents. The STM generates thematically syntactically constrained, combines semantic available parse trees. Each sentence generated document-specific weights parse-tree-specific transitions. Words assumed order respects tree. derive approximate based variational methods hierarchical processes, report qualitative quantitative results synthetic data hand-parsed 6, conclude discussion thesis real world applications sentiment analysis extended capture even linguistic text.