Enabling search over large collections of telugu document images – an automatic annotation based approach

作者: K. Pramod Sankar , C. V. Jawahar

DOI: 10.1007/11949619_75

关键词:

摘要: For the first time, search is enabled over a massive collection of 21 Million word images from digitized document images. This work advances state-of-the-art on multiple fronts: i) Indian language are made searchable by textual queries, ii) interactive content-level access provided to for and retrieval, iii) novel recognition-free approach, that does not require an OCR, adapted validated iv) suite image processing pattern classification algorithms proposed efficiently automate process v) scalability solution demonstrated large 500 digitised books consisting 75,000 pages. Character recognition based approaches yield poor results developing engines images, due complexity script quality documents. Recognition free approaches, word-spotting, directly scalable collections, computational matching in feature space. example, if it requires 1 mSec match two retrieval documents single query, like ours, would close day's time. In this paper we propose automatic annotation approach provide description With one offline effort, able build text-based system, annotated system has response time about 0.01 second. However, pay price form computation, which performed cluster 35 computers, month. Our procedure highly automatic, requiring minimal human intervention.

参考文章(19)
C. V. Jawahar, A. Balasubramanian, Million Meshesha, Searching in Document Images. indian conference on computer vision, graphics and image processing. pp. 622- 627 ,(2004)
Liu Wenyin, Susan T. Dumais, Mary Czerwinski, HongJiang Zhang, Yanfeng Sun, Brent A. Field, Semi-Automatic Image Annotation. international conference on human-computer interaction. pp. 326- 333 ,(2001)
A. Balasubramanian, Million Meshesha, C. V. Jawahar, Retrieval from document image collections document analysis systems. pp. 1- 12 ,(2006) , 10.1007/11669487_1
K. Pramod Sankar, Vamshi Ambati, Lakshmi Pratha, C. V. Jawahar, Digitizing a million books: challenges for document analysis document analysis systems. pp. 425- 436 ,(2006) , 10.1007/11669487_38
P. Duygulu, K. Barnard, J. F. G. de Freitas, D. A. Forsyth, Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary european conference on computer vision. ,vol. 2353, pp. 97- 112 ,(2002) , 10.1007/3-540-47979-1_7
M. Mitra, B.B. Chaudhuri, Information Retrieval from Documents: A Survey Information Retrieval. ,vol. 2, pp. 141- 163 ,(2000) , 10.1023/A:1009950525500
Sargur N. Srihari, Chen Huang, Harish Srinivasan, Search engine for handwritten documents document recognition and retrieval. ,vol. 5676, pp. 66- 75 ,(2005) , 10.1117/12.585883
S. Marinai, E. Marino, G. Soda, Font adaptive word indexing of modern printed documents IEEE Transactions on Pattern Analysis and Machine Intelligence. ,vol. 28, pp. 1187- 1199 ,(2006) , 10.1109/TPAMI.2006.162
Umapada Pal, BB Chaudhuri, Indian script character recognition: a survey Pattern Recognition. ,vol. 37, pp. 1887- 1899 ,(2004) , 10.1016/J.PATCOG.2004.02.003