An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

作者： Mika Hämäläinen , Simon Hengchen , Quan Duong

DOI:

关键词: Optical character recognition 、 Computer science 、 Character (computing) 、 Artificial intelligence 、 Natural language processing 、 Spelling 、 Source code 、 Process (computing) 、 Machine translation

摘要: Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said be degrading performance of NLP systems. Correcting these manually is a time-consuming process and great part automatic approaches have been relying on rules or supervised machine learning. We build previous work fully unsupervised extraction parallel data train character-based sequence-to-sequence NMT (neural translation) model conduct error correction designed for English, adapt it Finnish proposing solutions that take rich morphology language into account. Our new method shows increased while remaining unsupervised, with added benefit spelling normalisation. The source code models available GitHub Zenodo.

参考文章(37)

Radim Řehůřek, Petr Sojka, Software Framework for Topic Modelling with Large Corpora University of Malta. ,(2010)

Jörg Tiedemann, Eva Pettersson, Beáta Megyesi, An SMT Approach to Automatic Annotation of Historical Text Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18. pp. 54- 69 ,(2013)

Miikka Silfverberg, Jack Rueter, Can Morphological Analyzers Improve the Quality of Optical Character Recognition Septentrio Conference Series. pp. 45- 56 ,(2015) , 10.7557/5.3467

Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization arXiv: Learning. ,(2014)

Thang Luong, Hieu Pham, Christopher D. Manning, Effective Approaches to Attention-based Neural Machine Translation empirical methods in natural language processing. pp. 1412- 1421 ,(2015) , 10.18653/V1/D15-1166

Margareta Benner, The digital archive of the Swedish East India Company, 1731‐1813: a joint project of a university library and a history department Online Information Review. ,vol. 27, pp. 328- 332 ,(2003) , 10.1108/14684520310502289

Tobias Blanke, Mike Bryant, Kepa Joseba Rodriquez, Magdalena Luszczynska, Comparison of named entity recognition tools for raw OCR text Proceedings of KONVENS 2012. pp. 410- 414 ,(2012)

Michael Piotrowski, Natural Language Processing for Historical Texts ,(2012)

Simon Tanner, Pich Hemy Ros, Trevor Munoz, Measuring Mass Text Digitization Quality and Usefulness Dlib Magazine. ,vol. 15, ,(2009)

10.

Majlis Bremer-Laamanen, THE PRESENT PAST : THE HISTORY OF NEWSPAPER DIGITISATION IN FINLAND IFLA publications. pp. 45- 57 ,(2008)

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

来源期刊

我的账户

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

来源期刊

相似文章 1

From Plenipotentiary to Puddingless: Users and Uses of New Words in Early English Letters

我的账户