Tokenizing micro-blogging messages using a text classification approach

作者： Gustavo Laboreiro , Luís Sarmento , Jorge Teixeira , Eugénio Oliveira

关键词:

摘要: The automatic processing of microblogging messages may be problematic, even in the case very elementary operations such as tokenization. problems arise from use non-standard language, including media-specific words (e.g. "2day", "gr8", "tl;dr", "loool"), emoticons "(o_o)", "(=^-^=)"), letter casing "dr. Fred") and unusual punctuation ".... ..", "!??!!!?", ",,,"). Additionally, spelling errors are abundant "I;m"), we can frequently find more than one language (with different tokenization requirements) same short message. For being efficient environment, manually-developed rule-based tokenizer systems have to deal with many conditions exceptions, which makes them difficult build maintain. We present a text classification approach for tokenizing Twitter messages, address complex cases successfully is relatively simple set up that, created corpus consisting 2500 manually tokenized -- task that human annotators trained an SVM classifier separating tokens at certain discontinuity characters. comparison, baseline system designed specifically dealing typical problematic situations. Results show achieve F-measures 96% classification-based approach, much above performance obtained by (85%). Also, subsequent analysis allowed us identify errors, partially solved adding some additional descriptive examples training re-training classifier.

参考文章(12)

Buzhou Tang, Xuan Wang, Xiaohong Wang, Chinese Word Segmentation Based on Large Margin Methods. Int. J. Asian Lang. Process.. ,vol. 19, pp. 55- 68 ,(2009)

Taku Kudo, Yuji Matsumoto, Chunking with support vector machines Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001 - NAACL '01. pp. 1- 8 ,(2001) , 10.3115/1073336.1073361

Michael Mathioudakis, Nick Koudas, Efficient identification of starters and followers in social media Proceedings of the 12th International Conference on Extending Database Technology Advances in Database Technology - EDBT '09. pp. 708- 719 ,(2009) , 10.1145/1516360.1516442

Karen Kukich, Techniques for automatically correcting words in text ACM Computing Surveys. ,vol. 24, pp. 377- 439 ,(1992) , 10.1145/146370.146380

Hironori Takeuchi, L. Venkata Subramaniam, Shourya Roy, Diwakar Punjani, Tetsuya Nasukawa, Sentence boundary detection in conversational speech transcripts using noisily labeled examples analytics for noisy unstructured text data. ,vol. 10, pp. 147- 155 ,(2007) , 10.1007/S10032-007-0056-Y

Rema Ananthanarayanan, Vijil Chenthamarakshan, Prasad M Deshpande, Raghuram Krishnapuram, Rule based synonyms for entity extraction from noisy text Proceedings of the second workshop on Analytics for noisy unstructured text data - AND '08. pp. 31- 38 ,(2008) , 10.1145/1390749.1390756

Valentin Jijkoun, Mahboob Alam Khalid, Maarten Marx, Maarten de Rijke, Named entity normalization in user generated content Proceedings of the second workshop on Analytics for noisy unstructured text data - AND '08. ,vol. 303, pp. 23- 30 ,(2008) , 10.1145/1390749.1390755

Sreangsu Acharyya, Sumit Negi, L. V. Subramaniam, Shourya Roy, Unsupervised learning of multilingual short message service (SMS) dialect from noisy examples Proceedings of the second workshop on Analytics for noisy unstructured text data - AND '08. pp. 67- 74 ,(2008) , 10.1145/1390749.1390761

Thorsten Joachims, Text Categorization with Suport Vector Machines: Learning with Many Relevant Features european conference on machine learning. ,vol. 1398, pp. 137- 142 ,(1998) , 10.1007/BFB0026683

10.

Grace Ngai, David Yarowsky, Rule writing or annotation: cost-efficient resource usage for base noun phrase chunking meeting of the association for computational linguistics. pp. 117- 125 ,(2000) , 10.3115/1075218.1075234

Tokenizing micro-blogging messages using a text classification approach

来源期刊

我的账户

Tokenizing micro-blogging messages using a text classification approach

来源期刊

相似文章 10

我的账户