作者: Felix Stahlberg , Stephan Vogel
DOI: 10.1109/DAS.2016.81
关键词: Intelligent word recognition 、 Natural language processing 、 Graphical user interface 、 Document processing 、 Web API 、 Tesseract 、 Intelligent character recognition 、 Computer science 、 Artificial intelligence 、 Optical character recognition 、 Character (computing)
摘要: Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction heritage collections in libraries is usually more challenging - e.g. consisting typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR such The based the Kaldi toolkit sophisticated text image normalization. This paper contains two main contributions: First, describe interface which consists both graphical user adding monitoring jobs web API automated access. Second, suggest novel approaches language modelling ligature continuous OCR. We test an print manuscript report substantial improvements 12.6% error rate with compared to 51.8% best product experimental setup (Tesseract).