Smoothing techniques for adaptive online language models

作者: Jimmy Lin , Rion Snow , William Morgan

DOI: 10.1145/2020408.2020476

关键词:

摘要: We are interested in the problem of tracking broad topics such as "baseball" and "fashion" continuous streams short texts, exemplified by tweets from microblogging service Twitter. The task is conceived a language modeling where per-topic models trained using hashtags tweet stream, which serve proxies for topic labels. Simple perplexity-based classifiers then applied to filter stream interest. Within this framework, we evaluate, both intrinsically extrinsically, smoothing techniques integrating "foreground" (to capture recency) "background" combat sparsity), well different retaining history. Experiments show that unigram smoothed normalized extension stupid backoff simple queue history retention performs on task.

参考文章(22)
Haewoon Kwak, Changhyun Lee, Hosung Park, Sue Moon, None, What is Twitter, a social network or a news media? the web conference. pp. 591- 600 ,(2010) , 10.1145/1772690.1772751
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean, Large Language Models in Machine Translation empirical methods in natural language processing. pp. 858- 867 ,(2007)
Sitaram Asur, Bernardo A. Huberman, Gabor Szabo, Chunyan Wang, Trends in Social Media: Persistence and Decay Social Science Research Network. ,(2011) , 10.2139/SSRN.1755748
Shaodian Zhang, Ming Zhou, Furu Wei, Xiaohua Liu, Recognizing Named Entities in Tweets meeting of the association for computational linguistics. pp. 359- 367 ,(2011)
Bryan R. Routledge, Ramnath Balasubramanyan, Noah A. Smith, Brendan O'Connor, From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series international conference on weblogs and social media. ,(2010)
Susan T. Dumais, Daniel Ramage, Daniel J. Liebling, Characterizing Microblogs with Topic Models international conference on weblogs and social media. ,(2010)
Miles Osborne, Saša Petrović, Victor Lavrenko, Streaming First Story Detection with application to Twitter north american chapter of the association for computational linguistics. pp. 181- 189 ,(2010)
Daniel M. Romero, Brendan Meeder, Jon Kleinberg, Differences in the mechanics of information diffusion across topics Proceedings of the 20th international conference on World wide web - WWW '11. pp. 695- 704 ,(2011) , 10.1145/1963405.1963503
Cherniack Mitch, Balakrishnan Hari, Balazinska Magdalena, Carney Donald, Cetintemel Ugur, Xing Ying, Zdonik Stan, Scalable Distributed Stream Processing conference on innovative data systems research. ,(2003)
Daniel J. Hopkins, Gary King, A Method of Automated Nonparametric Content Analysis for Social Science American Journal of Political Science. ,vol. 54, pp. 229- 247 ,(2010) , 10.1111/J.1540-5907.2009.00428.X