QATIP -- An Optical Character Recognition System for Arabic Heritage Collections in Libraries

作者: Felix Stahlberg , Stephan Vogel

DOI: 10.1109/DAS.2016.81

关键词: Intelligent word recognitionNatural language processingGraphical user interfaceDocument processingWeb APITesseractIntelligent character recognitionComputer scienceArtificial intelligenceOptical character recognitionCharacter (computing)

摘要: Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction heritage collections in libraries is usually more challenging - e.g. consisting typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR such The based the Kaldi toolkit sophisticated text image normalization. This paper contains two main contributions: First, describe interface which consists both graphical user adding monitoring jobs web API automated access. Second, suggest novel approaches language modelling ligature continuous OCR. We test an print manuscript report substantial improvements 12.6% error rate with compared to 51.8% best product experimental setup (Tesseract).

参考文章(18)
Don Box, David Ehnebuske, Gopal Kakivaya, Andrew Layman, Noah Mendelsohn, Henrik Frystyk Nielsen, Satish Thatte, Dave Winer, Simple object access protocol (SOAP) 1.1 W3C Note. ,(2000)
Petr Motlicek, Georg Stemmer, Ondrej Glembek, Karel Vesely, Lukas Burget, Gilles Boulianne, Yanmin Qian, Mirko Hannemann, Nagendra Goel, Petr Schwarz, Arnab Ghoshal, Jan Silovsky, Daniel Povey, The Kaldi Speech Recognition Toolkit ieee automatic speech recognition and understanding workshop. ,(2011)
Sanjeev Khudanpur, Daniel Povey, Xiaohui Zhang, Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging international conference on learning representations. ,(2014)
Thomas M. Breuel, The OCRopus open source OCR system document recognition and retrieval. ,vol. 6815, ,(2008) , 10.1117/12.783598
Mahdi Hamdani, Amr El-Desoky Mousa, Hermann Ney, Open Vocabulary Arabic Handwriting Recognition Using Morphological Decomposition international conference on document analysis and recognition. pp. 280- 284 ,(2013) , 10.1109/ICDAR.2013.63
Werner Pantke, Martin Dennhardt, Daniel Fecker, Volker Margner, Tim Fingscheidt, An Historical Handwritten Arabic Dataset for Segmentation-Free Word Spotting - HADARA80P international conference on frontiers in handwriting recognition. pp. 15- 20 ,(2014) , 10.1109/ICFHR.2014.11
Irfan Ahmad, Gernot A. Fink, Sabri A. Mahmoud, Improvements in Sub-character HMM Model Based Arabic Text Recognition international conference on frontiers in handwriting recognition. pp. 537- 542 ,(2014) , 10.1109/ICFHR.2014.96
Sabri A. Mahmoud, Irfan Ahmad, Mohammad Alshayeb, Wasfi G. Al-Khatib, Mohammad Tanvir Parvez, Gernot A. Fink, Volker Margner, Haikal El Abed, KHATT: Arabic Offline Handwritten Text Database international conference on frontiers in handwriting recognition. pp. 449- 454 ,(2012) , 10.1109/ICFHR.2012.224
Philippe Dreuw, David Rybach, Christian Gollan, Hermann Ney, Writer Adaptive Training and Writing Variant Model Refinement for Offline Arabic Handwriting Recognition international conference on document analysis and recognition. pp. 21- 25 ,(2009) , 10.1109/ICDAR.2009.9