作者: Miikka Silfverberg , Jack Rueter
DOI: 10.7557/5.3467
关键词:
摘要: Optical Character Recognition (OCR) can substantially improve the usability of digitized documents. Language modeling using word lists is known to OCR quality for English. For morphologically rich languages, however, even large do not reach high coverage on unseen text. Morphological analyzers offer a more sophisticated approach, which useful in many language processing applications. is paper investigates open-source engine Tesseract morphological analyzers. We present experiments two Uralic languages Finnish and Erzya. According our experiments, may still be superior with morphology. Our error analysis indicates that cause amount real errors.