QATIP -- An Optical Character Recognition System for Arabic Heritage Collections in Libraries

作者： Felix Stahlberg , Stephan Vogel

DOI: 10.1109/DAS.2016.81

关键词: Intelligent word recognition 、 Natural language processing 、 Graphical user interface 、 Document processing 、 Web API 、 Tesseract 、 Intelligent character recognition 、 Computer science 、 Artificial intelligence 、 Optical character recognition 、 Character (computing)

摘要: Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction heritage collections in libraries is usually more challenging - e.g. consisting typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR such The based the Kaldi toolkit sophisticated text image normalization. This paper contains two main contributions: First, describe interface which consists both graphical user adding monitoring jobs web API automated access. Second, suggest novel approaches language modelling ligature continuous OCR. We test an print manuscript report substantial improvements 12.6% error rate with compared to 51.8% best product experimental setup (Tesseract).

参考文章(18)

Yannis Haralambous, The traditional Arabic typecase extended to the Unicode set of glyphs ,(2001)

Don Box, David Ehnebuske, Gopal Kakivaya, Andrew Layman, Noah Mendelsohn, Henrik Frystyk Nielsen, Satish Thatte, Dave Winer, Simple object access protocol (SOAP) 1.1 W3C Note. ,(2000)

Petr Motlicek, Georg Stemmer, Ondrej Glembek, Karel Vesely, Lukas Burget, Gilles Boulianne, Yanmin Qian, Mirko Hannemann, Nagendra Goel, Petr Schwarz, Arnab Ghoshal, Jan Silovsky, Daniel Povey, The Kaldi Speech Recognition Toolkit ieee automatic speech recognition and understanding workshop. ,(2011)

Sanjeev Khudanpur, Daniel Povey, Xiaohui Zhang, Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging international conference on learning representations. ,(2014)

Thomas M. Breuel, The OCRopus open source OCR system document recognition and retrieval. ,vol. 6815, ,(2008) , 10.1117/12.783598

Mahdi Hamdani, Amr El-Desoky Mousa, Hermann Ney, Open Vocabulary Arabic Handwriting Recognition Using Morphological Decomposition international conference on document analysis and recognition. pp. 280- 284 ,(2013) , 10.1109/ICDAR.2013.63

Werner Pantke, Martin Dennhardt, Daniel Fecker, Volker Margner, Tim Fingscheidt, An Historical Handwritten Arabic Dataset for Segmentation-Free Word Spotting - HADARA80P international conference on frontiers in handwriting recognition. pp. 15- 20 ,(2014) , 10.1109/ICFHR.2014.11

Irfan Ahmad, Gernot A. Fink, Sabri A. Mahmoud, Improvements in Sub-character HMM Model Based Arabic Text Recognition international conference on frontiers in handwriting recognition. pp. 537- 542 ,(2014) , 10.1109/ICFHR.2014.96

Sabri A. Mahmoud, Irfan Ahmad, Mohammad Alshayeb, Wasfi G. Al-Khatib, Mohammad Tanvir Parvez, Gernot A. Fink, Volker Margner, Haikal El Abed, KHATT: Arabic Offline Handwritten Text Database international conference on frontiers in handwriting recognition. pp. 449- 454 ,(2012) , 10.1109/ICFHR.2012.224

10.

Philippe Dreuw, David Rybach, Christian Gollan, Hermann Ney, Writer Adaptive Training and Writing Variant Model Refinement for Offline Arabic Handwriting Recognition international conference on document analysis and recognition. pp. 21- 25 ,(2009) , 10.1109/ICDAR.2009.9

QATIP -- An Optical Character Recognition System for Arabic Heritage Collections in Libraries

来源期刊

我的账户

QATIP -- An Optical Character Recognition System for Arabic Heritage Collections in Libraries

来源期刊

相似文章 4

Arabic word decomposition techniques for offline Arabic text transcription

Arabic calligraphy, typewritten and handwritten using optical character recognition (OCR) system

Automatic processing of Historical Arabic Documents: A comprehensive Survey

Arabic Documents Information Retrieval for Printed, Handwritten, and Calligraphy Image

我的账户