An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

作者: Mika Hämäläinen , Simon Hengchen , Quan Duong

DOI:

关键词: Optical character recognitionComputer scienceCharacter (computing)Artificial intelligenceNatural language processingSpellingSource codeProcess (computing)Machine translation

摘要: Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said be degrading performance of NLP systems. Correcting these manually is a time-consuming process and great part automatic approaches have been relying on rules or supervised machine learning. We build previous work fully unsupervised extraction parallel data train character-based sequence-to-sequence NMT (neural translation) model conduct error correction designed for English, adapt it Finnish proposing solutions that take rich morphology language into account. Our new method shows increased while remaining unsupervised, with added benefit spelling normalisation. The source code models available GitHub Zenodo.

参考文章(37)
Radim Řehůřek, Petr Sojka, Software Framework for Topic Modelling with Large Corpora University of Malta. ,(2010)
Jörg Tiedemann, Eva Pettersson, Beáta Megyesi, An SMT Approach to Automatic Annotation of Historical Text Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18. pp. 54- 69 ,(2013)
Miikka Silfverberg, Jack Rueter, Can Morphological Analyzers Improve the Quality of Optical Character Recognition Septentrio Conference Series. pp. 45- 56 ,(2015) , 10.7557/5.3467
Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization arXiv: Learning. ,(2014)
Thang Luong, Hieu Pham, Christopher D. Manning, Effective Approaches to Attention-based Neural Machine Translation empirical methods in natural language processing. pp. 1412- 1421 ,(2015) , 10.18653/V1/D15-1166
Tobias Blanke, Mike Bryant, Kepa Joseba Rodriquez, Magdalena Luszczynska, Comparison of named entity recognition tools for raw OCR text Proceedings of KONVENS 2012. pp. 410- 414 ,(2012)
Simon Tanner, Pich Hemy Ros, Trevor Munoz, Measuring Mass Text Digitization Quality and Usefulness Dlib Magazine. ,vol. 15, ,(2009)
Majlis Bremer-Laamanen, THE PRESENT PAST : THE HISTORY OF NEWSPAPER DIGITISATION IN FINLAND IFLA publications. pp. 45- 57 ,(2008)