Linguistic extensions of topic models

作者: Jordan Boyd-Graber , David Blei

DOI:

关键词:

摘要: Topic models like latent Dirichlet allocation (LDA) provide a framework for analyzing large datasets where observations are collected into groups. Although topic modeling has been fruitfully applied to problems social science, biology, and computer vision, it most widely used model documents modeled as exchangeable groups of words. In this context, discover topics, distributions over words that express coherent theme “business” or “politics.” While one the strengths is they make few assumptions about underlying data, such general approach sometimes limits type can solve. When we restrict our focus natural language datasets, use insights from linguistics create understand richer patterns. thesis, extend LDA in three different ways: adding knowledge word meaning, multiple languages, incorporating local syntactic context. These extensions apply new problems, discovering meaning ambiguous words, unaligned multilingual corpora, combine with other sources information documents' In Chapter 2, present WordNet (LDAWN), an unsupervised probabilistic includes sense hidden variable. LDAWN replaces multinomial topics Abney Light's distribution meanings. Thus, posterior inference discovers not only topical domains each token, LDA, but also associated token. We show considering more improves problem disambiguation. LDAWN allows us separate representation how expressed forms. 3, allow meanings be using forms languages. addition disambiguation provided by LDAWN, offers method on corpora 4, relax LDAWN. text (MuTo). Like designed analyze composed Unlike which requires correspondence between languages painstakingly annotated, MuTo uses stochastic EM simultaneously both matching while learns topics. demonstrate similar recovered across 5, address recurring hindered performance presented previous chapters: lack develop (STM), non-parametric Bayesian parsed documents. The STM generates thematically syntactically constrained, combines semantic available parse trees. Each sentence generated document-specific weights parse-tree-specific transitions. Words assumed order respects tree. derive approximate based variational methods hierarchical processes, report qualitative quantitative results synthetic data hand-parsed 6, conclude discussion thesis real world applications sentiment analysis extended capture even linguistic text.

参考文章(124)
A. Kilgarriff, J. Rosenzweig, Framework and Results for English SENSEVAL Computers and The Humanities. ,vol. 34, pp. 15- 48 ,(2000) , 10.1023/A:1002693207386
Benoît Sagot, Darja Fišer, Building a free French wordnet from multilingual resources OntoLex. ,(2008)
Tanja Schultz, Yik-Cheung Tam, Bilingual LSA-based Translation Lexicon Adaptation for Spoken Language Translation conference of the international speech communication association. pp. 2461- 2464 ,(2007)
George Casella, Christian P. Robert, Monte Carlo Statistical Methods (Springer Texts in Statistics) Springer-Verlag New York, Inc.. ,(2005)
Robert Schapire, Jordan Boyd-Graber, Daniel Osherson, Christiane Fellbaum, Adding dense, weighted connections to WordNet 3rd International Global WordNet Conference, GWC 2006. pp. 29- 35 ,(2005)
Siddharth Patwardhan, Satanjeev Banerjee, Ted Pedersen, Using measures of semantic relatedness for word sense disambiguation international conference on computational linguistics. pp. 241- 257 ,(2003) , 10.1007/3-540-36456-0_24