Online template matching over a stream of digitized documents

作者: Michael Stockerl , Christoph Ringlstetter , Matthias Schubert , Eirini Ntoutsi , Hans-Peter Kriegel

DOI: 10.1145/2791347.2791354

关键词:

摘要: Although living in the information age for decades, paperwork is still a tedious part of everybody's life. Assistance systems that implement techniques digitization and document understanding may offer considerable reductions time effort users. A large portion paper documents like invoices, delivery receipts or admonitions are based on fixed company specific template therefore exhibit high degree similarity. In this work, we propose extraction method over stream incoming allocation assigning new instances from to most suitable templates. Our employs text augmented by layout represent digital image document. Document similarity assessed with respect both textual parts document; matching terms contribute accordingly their distance query terms. To be more robust against distortions due process, templates not static, rather they maintained an online fashion assigned documents. Real data experiments show combination continuous adaptation through update, improves identification quality earlier proposed methods.

参考文章(21)
Martin Ester, Aoying Zhou, Weining Qian, Feng Cao, Density-Based Clustering over an Evolving Data Stream with Noise. siam international conference on data mining. pp. 328- 339 ,(2006)
Piotr Indyk, Aristides Gionis, Rajeev Motwani, Similarity Search in High Dimensions via Hashing very large data bases. pp. 518- 529 ,(1999)
Simone Marinai, Introduction to Document Analysis and Recognition Machine Learning in Document Analysis and Recognition. pp. 1- 20 ,(2008) , 10.1007/978-3-540-76280-5_1
Daniel Esser, Daniel Schuster, Klemens Muthmann, Michael Berger, Alexander Schill, Automatic indexing of scanned documents: a layout-based approach document recognition and retrieval. ,vol. 8297, ,(2012) , 10.1117/12.908542
Minos Garofalakis, Johannes Gehrke, Rajeev Rastogi, Querying and mining data streams Proceedings of the 2002 ACM SIGMOD international conference on Management of data - SIGMOD '02. pp. 635- 635 ,(2002) , 10.1145/564691.564794
Jayant Kumar, Peng Ye, David Doermann, Structural similarity for document image classification and retrieval Pattern Recognition Letters. ,vol. 43, pp. 119- 126 ,(2014) , 10.1016/J.PATREC.2013.10.030
DIMITRIS PAPADIAS, YANNIS THEODORIDIS, Spatial relations, minimum bounding rectangles, and spatial data structures International Journal of Geographical Information Science. ,vol. 11, pp. 111- 138 ,(1997) , 10.1080/136588197242428
Jian Liang, David Doermann, Huiping Li, Camera-based analysis of text and documents: a survey International Journal on Document Analysis and Recognition. ,vol. 7, pp. 84- 104 ,(2005) , 10.1007/S10032-004-0138-Z
Yixin Chen, Li Tu, Density-based clustering for real-time stream data knowledge discovery and data mining. pp. 133- 142 ,(2007) , 10.1145/1281192.1281210
S. Marinai, E. Marino, G. Soda, Layout based document image retrieval by means of XY tree reduction international conference on document analysis and recognition. pp. 432- 436 ,(2005) , 10.1109/ICDAR.2005.150