Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts

作者: Eva Pettersson , Gerold Schneider , Michael Percillier

DOI: 10.5167/UZH-137462

关键词: PreprocessorDrag and dropSpellingNatural language processingMachine translationComputational linguisticsComputer scienceParsingRegister (sociolinguistics)Rule-based systemArtificial intelligence

摘要: To be able to use existing natural language processing tools for analysing historical text, an important preprocessing step is spelling normalisation, converting the original present-day spelling, before applying such as taggers and parsers. In this paper, we compare a probablistic, language-independent approach normalisation based on statistical machine translation (SMT) techniques, rule-based system combining dictionary lookup with rules non-probabilistic weights. The reaches best accuracy, up 94% precision at 74% recall, while SMT improves each tested period.

参考文章(13)
Nicola Bertoldi, Marcello Federico, Mauro Cettolo, IRSTLM: an open source toolkit for handling large scale language models. conference of the international speech communication association. pp. 1618- 1621 ,(2008)
Paul Rayson, Jonathan Culpeper, Dawn Archer, Alistair Baron, Nicholas Smith, Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora UCREL. ,(2007)
Martin Durrell, Paul Bennett, Richard J. Whitt, Silke Scheible, Evaluating an 'off-the-shelf' POS-tagger on Early Modern German text meeting of the association for computational linguistics. pp. 19- 23 ,(2011)
Jörg Tiedemann, Character-based PSMT for Closely Related Languages Proceedings of the 13th Annual conference of the European Association for Machine Translation. pp. 12- 19 ,(2009)
Yves Scherrer, Tomaż Erjavec, Modernizing historical Slovene words with character-based SMT meeting of the association for computational linguistics. pp. 58- 62 ,(2013)
Jörg Tiedemann, Eva Pettersson, Beáta Megyesi, An SMT Approach to Automatic Annotation of Historical Text Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18. pp. 54- 69 ,(2013)
Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer, Peter F. Brown, The mathematics of statistical machine translation: parameter estimation Computational Linguistics. ,vol. 19, pp. 263- 311 ,(1993)
Hrafn Loftsson, Tagging Icelandic text: A linguistic rule-based approach Nordic Journal of Linguistics. ,vol. 31, pp. 47- 72 ,(2008) , 10.1017/S0332586508001820
Philipp Koehn, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, Evan Herbst, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Moses: Open Source Toolkit for Statistical Machine Translation meeting of the association for computational linguistics. pp. 177- 180 ,(2007) , 10.3115/1557769.1557821