Tokenizing micro-blogging messages using a text classification approach

作者: Gustavo Laboreiro , Luís Sarmento , Jorge Teixeira , Eugénio Oliveira

DOI: 10.1145/1871840.1871853

关键词:

摘要: The automatic processing of microblogging messages may be problematic, even in the case very elementary operations such as tokenization. problems arise from use non-standard language, including media-specific words (e.g. "2day", "gr8", "tl;dr", "loool"), emoticons "(o_o)", "(=^-^=)"), letter casing "dr. Fred") and unusual punctuation ".... ..", "!??!!!?", ",,,"). Additionally, spelling errors are abundant "I;m"), we can frequently find more than one language (with different tokenization requirements) same short message. For being efficient environment, manually-developed rule-based tokenizer systems have to deal with many conditions exceptions, which makes them difficult build maintain. We present a text classification approach for tokenizing Twitter messages, address complex cases successfully is relatively simple set up that, created corpus consisting 2500 manually tokenized -- task that human annotators trained an SVM classifier separating tokens at certain discontinuity characters. comparison, baseline system designed specifically dealing typical problematic situations. Results show achieve F-measures 96% classification-based approach, much above performance obtained by (85%). Also, subsequent analysis allowed us identify errors, partially solved adding some additional descriptive examples training re-training classifier.

参考文章(12)
Buzhou Tang, Xuan Wang, Xiaohong Wang, Chinese Word Segmentation Based on Large Margin Methods. Int. J. Asian Lang. Process.. ,vol. 19, pp. 55- 68 ,(2009)
Taku Kudo, Yuji Matsumoto, Chunking with support vector machines Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001 - NAACL '01. pp. 1- 8 ,(2001) , 10.3115/1073336.1073361
Michael Mathioudakis, Nick Koudas, Efficient identification of starters and followers in social media Proceedings of the 12th International Conference on Extending Database Technology Advances in Database Technology - EDBT '09. pp. 708- 719 ,(2009) , 10.1145/1516360.1516442
Karen Kukich, Techniques for automatically correcting words in text ACM Computing Surveys. ,vol. 24, pp. 377- 439 ,(1992) , 10.1145/146370.146380
Hironori Takeuchi, L. Venkata Subramaniam, Shourya Roy, Diwakar Punjani, Tetsuya Nasukawa, Sentence boundary detection in conversational speech transcripts using noisily labeled examples analytics for noisy unstructured text data. ,vol. 10, pp. 147- 155 ,(2007) , 10.1007/S10032-007-0056-Y
Rema Ananthanarayanan, Vijil Chenthamarakshan, Prasad M Deshpande, Raghuram Krishnapuram, Rule based synonyms for entity extraction from noisy text Proceedings of the second workshop on Analytics for noisy unstructured text data - AND '08. pp. 31- 38 ,(2008) , 10.1145/1390749.1390756
Valentin Jijkoun, Mahboob Alam Khalid, Maarten Marx, Maarten de Rijke, Named entity normalization in user generated content Proceedings of the second workshop on Analytics for noisy unstructured text data - AND '08. ,vol. 303, pp. 23- 30 ,(2008) , 10.1145/1390749.1390755
Sreangsu Acharyya, Sumit Negi, L. V. Subramaniam, Shourya Roy, Unsupervised learning of multilingual short message service (SMS) dialect from noisy examples Proceedings of the second workshop on Analytics for noisy unstructured text data - AND '08. pp. 67- 74 ,(2008) , 10.1145/1390749.1390761
Thorsten Joachims, Text Categorization with Suport Vector Machines: Learning with Many Relevant Features european conference on machine learning. ,vol. 1398, pp. 137- 142 ,(1998) , 10.1007/BFB0026683
Grace Ngai, David Yarowsky, Rule writing or annotation: cost-efficient resource usage for base noun phrase chunking meeting of the association for computational linguistics. pp. 117- 125 ,(2000) , 10.3115/1075218.1075234