Automatic indexing of scanned documents: a layout-based approach

作者： Daniel Esser , Daniel Schuster , Klemens Muthmann , Michael Berger , Alexander Schill

关键词: Full text search 、 Automatic indexing 、 Information extraction 、 Computer science 、 Well-formed document 、 Index term 、 Document clustering 、 Information retrieval 、 Index (publishing) 、 Document management system

摘要: Archiving official written documents such as invoices, reminders and account statements in business private area gets more important. Creating appropriate index entries for document archives like sender's name, creation date or number is a tedious manual work. We present novel approach to handle automatic indexing of based on generic positional extraction terms. For this purpose we apply the knowledge templates stored common full text search find positions that were successfully extracted the past.

参考文章(12)

Kalina Bontcheva, Hamish Cunningham, Valentin Tablan, Diana Maynard, A framework and graphical development environment for robust NLP tools and applications. meeting of the association for computational linguistics. pp. 168- 175 ,(2002)

Jianying Hu, Ramanujan Kashi, Gordon Wilfong, Comparison and Classification of Documents Based on Layout Similarity Information Retrieval. ,vol. 2, pp. 227- 243 ,(2000) , 10.1023/A:1009910911387

Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma, Extracting Content Structure for Web Pages Based on Visual Representation Web Technologies and Applications. pp. 406- 417 ,(2003) , 10.1007/3-540-36901-5_42

Eno Thereska, Dushyanth Narayanan, Anastassia Ailamaki, Challenges inbuilding a DBMS Resource Advisor IEEE Data(base) Engineering Bulletin. ,vol. 29, pp. 40- 46 ,(2006)

Eric Saund, Scientific challenges underlying production document processing Document Recognition and Retrieval XVIII. ,vol. 7874, pp. 787402- ,(2011) , 10.1117/12.876948

Shian-Hua Lin, Jan-Ming Ho, Discovering informative content blocks from Web documents Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 588- 593 ,(2002) , 10.1145/775047.775134

Paolo Merialdo, Valter Crescenzi, Giansalvatore Mecca, RoadRunner: Towards Automatic Data Extraction from Large Web Sites very large data bases. pp. 109- 118 ,(2001)

F. Ashraf, T. Ozyer, R. Alhajj, Employing Clustering Techniques for Automatic Information Extraction From HTML Documents systems man and cybernetics. ,vol. 38, pp. 660- 673 ,(2008) , 10.1109/TSMCC.2008.923882

Fabrizio Sebastiani, Machine learning in automated text categorization ACM Computing Surveys. ,vol. 34, pp. 1- 47 ,(2002) , 10.1145/505282.505283

10.

Li Zhang, Yue Pan, Tong Zhang, Focused named entity recognition using machine learning Proceedings of the 27th annual international conference on Research and development in information retrieval - SIGIR '04. pp. 281- 288 ,(2004) , 10.1145/1008992.1009042

Automatic indexing of scanned documents: a layout-based approach

来源期刊

我的账户

Automatic indexing of scanned documents: a layout-based approach

来源期刊

相似文章 10

我的账户