Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts

作者： Eva Pettersson , Gerold Schneider , Michael Percillier

关键词: Preprocessor 、 Drag and drop 、 Spelling 、 Natural language processing 、 Machine translation 、 Computational linguistics 、 Computer science 、 Parsing 、 Register (sociolinguistics) 、 Rule-based system 、 Artificial intelligence

摘要: To be able to use existing natural language processing tools for analysing historical text, an important preprocessing step is spelling normalisation, converting the original present-day spelling, before applying such as taggers and parsers. In this paper, we compare a probablistic, language-independent approach normalisation based on statistical machine translation (SMT) techniques, rule-based system combining dictionary lookup with rules non-probabilistic weights. The reaches best accuracy, up 94% precision at 74% recall, while SMT improves each tested period.

参考文章(13)

Nicola Bertoldi, Marcello Federico, Mauro Cettolo, IRSTLM: an open source toolkit for handling large scale language models. conference of the international speech communication association. pp. 1618- 1621 ,(2008)

Paul Rayson, Jonathan Culpeper, Dawn Archer, Alistair Baron, Nicholas Smith, Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora UCREL. ,(2007)

Martin Durrell, Paul Bennett, Richard J. Whitt, Silke Scheible, Evaluating an 'off-the-shelf' POS-tagger on Early Modern German text meeting of the association for computational linguistics. pp. 19- 23 ,(2011)

Jörg Tiedemann, Character-based PSMT for Closely Related Languages Proceedings of the 13th Annual conference of the European Association for Machine Translation. pp. 12- 19 ,(2009)

Yves Scherrer, Tomaż Erjavec, Modernizing historical Slovene words with character-based SMT meeting of the association for computational linguistics. pp. 58- 62 ,(2013)

Jörg Tiedemann, Eva Pettersson, Beáta Megyesi, An SMT Approach to Automatic Annotation of Historical Text Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18. pp. 54- 69 ,(2013)

Paul Rayson, Alistair Baron, VARD2 : a tool for dealing with spelling variation in historical corpora ,(2008)

Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer, Peter F. Brown, The mathematics of statistical machine translation: parameter estimation Computational Linguistics. ,vol. 19, pp. 263- 311 ,(1993)

Hrafn Loftsson, Tagging Icelandic text: A linguistic rule-based approach Nordic Journal of Linguistics. ,vol. 31, pp. 47- 72 ,(2008) , 10.1017/S0332586508001820

10.

Philipp Koehn, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, Evan Herbst, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Moses: Open Source Toolkit for Statistical Machine Translation meeting of the association for computational linguistics. pp. 177- 180 ,(2007) , 10.3115/1557769.1557821

Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts

来源期刊

我的账户

Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts

来源期刊

相似文章 5

A Large-Scale Comparison of Historical Text Normalization Systems

Towards a Spell Checker for Zamboanga Chavacano Orthography.

A Survey of Orthographic Information in Machine Translation.

Latin-Spanish Neural Machine Translation: from the Bible to Saint Augustine

Context-Aware Text Normalisation for Historical Dialects.

我的账户