作者: K. Pramod Sankar , C. V. Jawahar
DOI: 10.1007/11949619_75
关键词:
摘要: For the first time, search is enabled over a massive collection of 21 Million word images from digitized document images. This work advances state-of-the-art on multiple fronts: i) Indian language are made searchable by textual queries, ii) interactive content-level access provided to for and retrieval, iii) novel recognition-free approach, that does not require an OCR, adapted validated iv) suite image processing pattern classification algorithms proposed efficiently automate process v) scalability solution demonstrated large 500 digitised books consisting 75,000 pages. Character recognition based approaches yield poor results developing engines images, due complexity script quality documents. Recognition free approaches, word-spotting, directly scalable collections, computational matching in feature space. example, if it requires 1 mSec match two retrieval documents single query, like ours, would close day's time. In this paper we propose automatic annotation approach provide description With one offline effort, able build text-based system, annotated system has response time about 0.01 second. However, pay price form computation, which performed cluster 35 computers, month. Our procedure highly automatic, requiring minimal human intervention.