Automatic classification of defect page content in scanned document collections

作者: Reinhold Huber-Mork , Alexander Schindler

DOI: 10.1109/ISPA.2013.6703735

关键词:

摘要: We describe a method for defect detection and classification collections of digital images historical book documents. Undistorted text from various books characterized by strong variation language, font layout properties are discriminated typical errors in digitization processes such as occlusion an operator's hand, visible edge or image warping artifacts. A bag local features approach is compared to global characterization location, size orientation detected keypoints. Machine learning used discriminate between those classes. Results different the task discrimination undistorted major distortion class which presence where based on derived histograms achieved cross-validation accuracy better than 99 percent representative data set. Taking into account up three classes distortions still resulted accuracies beyond 90 using visual classifier input.

参考文章(20)
Joost van Beusekom, Faisal Shafait, Thomas M. Breuel, Image-matching for revision detection in printed historical documents dagm conference on pattern recognition. pp. 507- 516 ,(2007) , 10.1007/978-3-540-74936-3_51
G. Csurka, Visual categorization with bags of keypoints european conference on computer vision. ,vol. 1, pp. 22- ,(2004)
Adam Langley, Dan S. Bloomberg, Google Books: making the public domain universally accessible document recognition and retrieval. ,vol. 6500, ,(2007) , 10.1117/12.710609
Lykele Hazelhoff, Ivo Creusen, Dennis van de Wouw, Peter H. N. de With, Large-scale classification of traffic signs under real-world conditions Proceedings of SPIE. ,vol. 8304, ,(2012) , 10.1117/12.910490
Zhe Li, Matthias Schulte-Austum, Martin Neschen, Fast Logo Detection and Recognition in Document Images international conference on pattern recognition. pp. 2716- 2719 ,(2010) , 10.1109/ICPR.2010.665
Angelika Garz, Robert Sablatnig, Markus Diem, Layout Analysis for Historical Manuscripts Using Sift Features international conference on document analysis and recognition. pp. 508- 512 ,(2011) , 10.1109/ICDAR.2011.108
G. Zhu, D. Doermann, Automatic Document Logo Detection international conference on document analysis and recognition. ,vol. 2, pp. 864- 868 ,(2007) , 10.1109/ICDAR.2007.4377038
Sitaram Ramachandrula, Gopal Datt Joshi, S. Noushath, Pulkit Parikh, Vishal Gupta, PaperDiff: A Script Independent Automatic Method for Finding the Text Differences Between Two Document Images document analysis systems. pp. 585- 590 ,(2008) , 10.1109/DAS.2008.69
Krishnendu Chaudhury, Ankur Jain, Sriram Thirthala, Vivek Sahasranaman, Shobhit Saxena, Selvam Mahalingam, Google Newspaper Search Image Processing and Analysis Pipeline international conference on document analysis and recognition. pp. 621- 625 ,(2009) , 10.1109/ICDAR.2009.272
Yushi Jing, S. Baluja, VisualRank: Applying PageRank to Large-Scale Image Search IEEE Transactions on Pattern Analysis and Machine Intelligence. ,vol. 30, pp. 1877- 1890 ,(2008) , 10.1109/TPAMI.2008.121