作者: Orphee De Clercq , Bart Desmet , Sarah Schulz , Veronique Hoste
DOI:
关键词:
摘要: In this paper we present a Dutch and English dataset that can serve as gold standard for evaluating text normalization approaches. With the combination of messages, message board posts tweets, these datasets represent variety user generated content. All data was manually normalized to their form using newly-developed guidelines. We perform automatic lexical experiments on statistical machine translation techniques. focus both word character level find improve BLEU score with ca. 20% languages. order content be released publicly research community some issues first need resolved. These are discussed in closer detail by focussing current legislation investigating previous similar collection projects. discussion hope shed light various difficulties researchers facing when trying share social media data.