Automatic indexing of scanned documents: a layout-based approach

作者: Daniel Esser , Daniel Schuster , Klemens Muthmann , Michael Berger , Alexander Schill

DOI: 10.1117/12.908542

关键词: Full text searchAutomatic indexingInformation extractionComputer scienceWell-formed documentIndex termDocument clusteringInformation retrievalIndex (publishing)Document management system

摘要: Archiving official written documents such as invoices, reminders and account statements in business private area gets more important. Creating appropriate index entries for document archives like sender's name, creation date or number is a tedious manual work. We present novel approach to handle automatic indexing of based on generic positional extraction terms. For this purpose we apply the knowledge templates stored common full text search find positions that were successfully extracted the past.

参考文章(12)
Kalina Bontcheva, Hamish Cunningham, Valentin Tablan, Diana Maynard, A framework and graphical development environment for robust NLP tools and applications. meeting of the association for computational linguistics. pp. 168- 175 ,(2002)
Jianying Hu, Ramanujan Kashi, Gordon Wilfong, Comparison and Classification of Documents Based on Layout Similarity Information Retrieval. ,vol. 2, pp. 227- 243 ,(2000) , 10.1023/A:1009910911387
Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma, Extracting Content Structure for Web Pages Based on Visual Representation Web Technologies and Applications. pp. 406- 417 ,(2003) , 10.1007/3-540-36901-5_42
Eno Thereska, Dushyanth Narayanan, Anastassia Ailamaki, Challenges inbuilding a DBMS Resource Advisor IEEE Data(base) Engineering Bulletin. ,vol. 29, pp. 40- 46 ,(2006)
Eric Saund, Scientific challenges underlying production document processing Document Recognition and Retrieval XVIII. ,vol. 7874, pp. 787402- ,(2011) , 10.1117/12.876948
Shian-Hua Lin, Jan-Ming Ho, Discovering informative content blocks from Web documents Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 588- 593 ,(2002) , 10.1145/775047.775134
Paolo Merialdo, Valter Crescenzi, Giansalvatore Mecca, RoadRunner: Towards Automatic Data Extraction from Large Web Sites very large data bases. pp. 109- 118 ,(2001)
F. Ashraf, T. Ozyer, R. Alhajj, Employing Clustering Techniques for Automatic Information Extraction From HTML Documents systems man and cybernetics. ,vol. 38, pp. 660- 673 ,(2008) , 10.1109/TSMCC.2008.923882
Fabrizio Sebastiani, Machine learning in automated text categorization ACM Computing Surveys. ,vol. 34, pp. 1- 47 ,(2002) , 10.1145/505282.505283
Li Zhang, Yue Pan, Tong Zhang, Focused named entity recognition using machine learning Proceedings of the 27th annual international conference on Research and development in information retrieval - SIGIR '04. pp. 281- 288 ,(2004) , 10.1145/1008992.1009042