An Image Based Approach for Content Analysis in Document Collections

作者: Reinhold Huber-Mörk , Alexander Schindler

DOI: 10.1007/978-3-642-41939-3_27

关键词:

摘要: We consider the task of content based analysis and categorization in large-scale historical book scanning projects. Mixed content, deprecated language, noise unexpected distortions suggest an image approach. The use keypoint extractors combined with bag features approach is applied to scanned text documents. In order incorporate spatial information into we three methods verification. An on comparison statistical properties local such as size orientation scale showed comparable quality while being computationally much more efficient. Cluster delivers groups pages characterized by common properties, especially duplicated page detected high reliability.

参考文章(29)
Reinhold Huber-Mörk, Alexander Schindler, Quality assurance for document image collections in digital preservation advanced concepts for intelligent vision systems. pp. 108- 119 ,(2012) , 10.1007/978-3-642-33140-4_10
Joost van Beusekom, Faisal Shafait, Thomas M. Breuel, Image-matching for revision detection in printed historical documents dagm conference on pattern recognition. pp. 507- 516 ,(2007) , 10.1007/978-3-540-74936-3_51
Jan Knopp, Josef Sivic, Tomas Pajdla, Avoiding confusing features in place recognition european conference on computer vision. ,vol. 6311, pp. 748- 761 ,(2010) , 10.1007/978-3-642-15549-9_54
G. Csurka, Visual categorization with bags of keypoints european conference on computer vision. ,vol. 1, pp. 22- ,(2004)
Hans-Peter Kriegel, Martin Ester, Jörg Sander, Xiaowei Xu, A density-based algorithm for discovering clusters in large spatial Databases with Noise knowledge discovery and data mining. pp. 226- 231 ,(1996)
Adam Langley, Dan S. Bloomberg, Google Books: making the public domain universally accessible document recognition and retrieval. ,vol. 6500, ,(2007) , 10.1117/12.710609
Wan-Lei Zhao, Chong-Wah Ngo, Hung-Khoon Tan, Xiao Wu, Near-Duplicate Keyframe Identification With Interest Point Matching and Pattern Learning IEEE Transactions on Multimedia. ,vol. 9, pp. 1037- 1048 ,(2007) , 10.1109/TMM.2007.898928
Lykele Hazelhoff, Ivo Creusen, Dennis van de Wouw, Peter H. N. de With, Large-scale classification of traffic signs under real-world conditions Proceedings of SPIE. ,vol. 8304, ,(2012) , 10.1117/12.910490
Yan Ke, Rahul Sukthankar, Larry Huston, An efficient parts-based near-duplicate and sub-image retrieval system acm multimedia. pp. 869- 876 ,(2004) , 10.1145/1027527.1027729
Angelika Garz, Robert Sablatnig, Markus Diem, Layout Analysis for Historical Manuscripts Using Sift Features international conference on document analysis and recognition. pp. 508- 512 ,(2011) , 10.1109/ICDAR.2011.108