Twitter Normalization via 1-to-N Recovering

作者: Yafeng Ren , Jiayuan Deng , Donghong Ji

DOI: 10.1007/978-3-319-48740-3_2

关键词:

摘要: Twitter messages are written in an informal style, which hinders many information retrieval and natural language processing applications. Existing normalization systems have two major drawbacks. The first is that these methods largely require large-scale annotated training data. second assume a nonstandard token recovered to one standard word. However, there tokens should be or more words, so the problem remains highly challenging. To address above issues, we propose unsupervised system based on context similarity. proposed does not any Meanwhile, will words. Results show approach achieves state-of-the-art performance.

参考文章(33)
Fuliang Weng, Yang Liu, Fei Liu, Why is "SXSW" trending? Exploring Multiple Text Sources for Twitter Topic Summarization Proceedings of the Workshop on Language in Social Media (LSM 2011). pp. 66- 75 ,(2011)
Einat Minkov, William Cohen, Graph Based Similarity Measures for Synonym Extraction from Parsed Text graph based methods for natural language processing. pp. 20- 24 ,(2012)
Tanveer A. Faruquie, L. Venkata Subramaniam, Danish Contractor, Unsupervised cleansing of noisy text international conference on computational linguistics. pp. 189- 196 ,(2010)
Paul Cook, Timothy Baldwin, Bo Han, Automatically Constructing a Normalisation Dictionary for Microblogs empirical methods in natural language processing. pp. 421- 432 ,(2012)
I. Dan Melamed, Bitext maps and alignment via pattern recognition Computational Linguistics. ,vol. 25, pp. 107- 130 ,(1999)
Wenkui Ding, Xiubo Geng, Xu-Dong Zhang, Learning to Rank from Noisy Data ACM Transactions on Intelligent Systems and Technology. ,vol. 7, pp. 1- 21 ,(2015) , 10.1145/2576230
CE Shennon, Warren Weaver, A mathematical theory of communication Bell System Technical Journal. ,vol. 27, pp. 379- 423 ,(1948) , 10.1002/J.1538-7305.1948.TB01338.X
J.M. Cotelo, F.L. Cruz, J.A. Troyano, F.J. Ortega, A modular approach for lexical normalization applied to Spanish tweets Expert Systems With Applications. ,vol. 42, pp. 4743- 4754 ,(2015) , 10.1016/J.ESWA.2015.02.003
Eric Brill, Robert C. Moore, An improved error model for noisy channel spelling correction Proceedings of the 38th Annual Meeting on Association for Computational Linguistics - ACL '00. pp. 286- 293 ,(2000) , 10.3115/1075218.1075255
Fuliang Weng, Yang Liu, Bingqing Wang, Fei Liu, Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision meeting of the association for computational linguistics. pp. 71- 76 ,(2011)