Can Morphological Analyzers Improve the Quality of Optical Character Recognition

作者: Miikka Silfverberg , Jack Rueter

DOI: 10.7557/5.3467

关键词:

摘要: Optical Character Recognition (OCR) can substantially improve the usability of digitized documents. Language modeling using word lists is known to OCR quality for English. For morphologically rich languages, however, even large do not reach high coverage on unseen text. Morphological analyzers offer a more sophisticated approach, which useful in many language processing applications. is paper investigates open-source engine Tesseract morphological analyzers. We present experiments two Uralic languages Finnish and Erzya. According our experiments, may still be superior with morphology. Our error analysis indicates that cause amount real errors.

参考文章(16)
Miikka Silfverberg, Krister Linden, HFST runtime format: A compacted transducer format allowing for fast lookup Unknown host publication. ,(2009)
Kenneth R. Beesley, Lauri Karttunen, Finite State Morphology ,(2003)
Kimmo Koskenniemi, Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production University of Helsinki. Department of General Linguistics. ,(1983)
Sam Hardwick, Tommi A. Pirinen, Effect of Language and Error Models on Efficiency of Finite-State Spell-Checking and Correction finite-state methods and natural language processing. pp. 1- 9 ,(2012)
Krister Lindén, Erik Axelson, Senka Drobac, Sam Hardwick, Miikka Silfverberg, Tommi A. Pirinen, Using HFST for Creating Computational Linguistic Applications Computational Linguistics - Applications. pp. 3- 25 ,(2013) , 10.1007/978-3-642-34399-5_1
V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals Soviet physics. Doklady. ,vol. 10, pp. 707- 710 ,(1966)
Tommi A Pirinen, Modularisation of Finnish Finite-State Language Description – Towards Wide Collaboration in Open Source Development of a Morphological Analyser Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011). pp. 299- 302 ,(2011)
Kemal Oflazer, Cemaleddin Güzey, Spelling Correction in Agglutinative Languages conference on applied natural language processing. pp. 194- 195 ,(1994) , 10.3115/974358.974406
R. Smith, An Overview of the Tesseract OCR Engine international conference on document analysis and recognition. ,vol. 2, pp. 629- 633 ,(2007) , 10.1109/ICDAR.2007.4376991
KOICHI TAKEUCHI, YUJI MATSUMOTO, Japanese OCR Error Correction Using Stochastic Morphological Analyzer and Probabilistic Word N-gram Model International Journal of Computer Processing of Languages. ,vol. 13, pp. 69- 82 ,(2000) , 10.1142/S0219427900000089