作者: Gustavo Laboreiro , Luís Sarmento , Jorge Teixeira , Eugénio Oliveira
关键词:
摘要: The automatic processing of microblogging messages may be problematic, even in the case very elementary operations such as tokenization. problems arise from use non-standard language, including media-specific words (e.g. "2day", "gr8", "tl;dr", "loool"), emoticons "(o_o)", "(=^-^=)"), letter casing "dr. Fred") and unusual punctuation ".... ..", "!??!!!?", ",,,"). Additionally, spelling errors are abundant "I;m"), we can frequently find more than one language (with different tokenization requirements) same short message. For being efficient environment, manually-developed rule-based tokenizer systems have to deal with many conditions exceptions, which makes them difficult build maintain. We present a text classification approach for tokenizing Twitter messages, address complex cases successfully is relatively simple set up that, created corpus consisting 2500 manually tokenized -- task that human annotators trained an SVM classifier separating tokens at certain discontinuity characters. comparison, baseline system designed specifically dealing typical problematic situations. Results show achieve F-measures 96% classification-based approach, much above performance obtained by (85%). Also, subsequent analysis allowed us identify errors, partially solved adding some additional descriptive examples training re-training classifier.