作者: Mika Hämäläinen , Simon Hengchen , Quan Duong
DOI:
关键词: Optical character recognition 、 Computer science 、 Character (computing) 、 Artificial intelligence 、 Natural language processing 、 Spelling 、 Source code 、 Process (computing) 、 Machine translation
摘要: Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said be degrading performance of NLP systems. Correcting these manually is a time-consuming process and great part automatic approaches have been relying on rules or supervised machine learning. We build previous work fully unsupervised extraction parallel data train character-based sequence-to-sequence NMT (neural translation) model conduct error correction designed for English, adapt it Finnish proposing solutions that take rich morphology language into account. Our new method shows increased while remaining unsupervised, with added benefit spelling normalisation. The source code models available GitHub Zenodo.