Towards Shared Datasets for Normalization Research

作者: Orphee De Clercq , Bart Desmet , Sarah Schulz , Veronique Hoste

DOI:

关键词:

摘要: In this paper we present a Dutch and English dataset that can serve as gold standard for evaluating text normalization approaches. With the combination of messages, message board posts tweets, these datasets represent variety user generated content. All data was manually normalized to their form using newly-developed guidelines. We perform automatic lexical experiments on statistical machine translation techniques. focus both word character level find improve BLEU score with ca. 20% languages. order content be released publicly research community some issues first need resolved. These are discussed in closer detail by focussing current legislation investigating previous similar collection projects. discussion hope shed light various difficulties researchers facing when trying share social media data.

参考文章(26)
Mining User Generated Content Chapman and Hall/CRC. ,(2014) , 10.1201/B16413
Nelleke Oostdijk, Martin Reynaert, Véronique Hoste, Ineke Schuurman, The construction of a 500-million-word reference corpus of contemporary written Dutch Spyns, P.;Odijk, J. (ed.), Essential Speech and Language Technology for Dutch. pp. 219- 247 ,(2013) , 10.1007/978-3-642-30910-6_13
Alexandra Balahur, Andres Montoyo, Piek Vossen, Erik van der Goot, Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis Proceedings of the 6th Workshop on Computational Approaches to#N# Subjectivity, Sentiment and Social Media Analysis. ,(2013)
Louise-Amélie Cougnon, Thomas François, Étudier l’écrit SMS. : Un objectif du projet sms4science Linguistik Online. ,vol. 4, pp. 19- 34 ,(2011) , 10.13092/LO.48.331
Miles Osborne, Saša Petrović, Victor Lavrenko, The Edinburgh Twitter Corpus north american chapter of the association for computational linguistics. pp. 25- 26 ,(2010)
Maarten Truyens, Patrick Van Eecke, Legal aspects of text mining Computer Law & Security Review. ,vol. 30, pp. 153- 170 ,(2014) , 10.1016/J.CLSR.2014.01.009
Min-Yen Kan, Tao Chen, Creating a live, public short message service corpus: the NUS SMS corpus language resources and evaluation. ,vol. 47, pp. 299- 335 ,(2012) , 10.1007/S10579-012-9197-9
Bo Han, Paul Cook, Timothy Baldwin, Lexical normalization for social media text ACM Transactions on Intelligent Systems and Technology. ,vol. 4, pp. 5- ,(2013) , 10.1145/2414425.2414430
Catherine Kobus, François Yvon, Géraldine Damnati, Normalizing SMS: are Two Metaphors Better than One ? international conference on computational linguistics. pp. 441- 448 ,(2008) , 10.3115/1599081.1599137
Fuliang Weng, Yang Liu, Bingqing Wang, Fei Liu, Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision meeting of the association for computational linguistics. pp. 71- 76 ,(2011)