Analysis of Documents Born Digital.

作者: Jianying Hu , Ying Liu , None

DOI:

关键词:

摘要: While traditional document analysis has focused on printed media, an increasingly large portion of the documents today are generated in digital form from start. Such “documents born digital” range plain text such as emails to more sophisticated forms PDF and Web documents. On one hand, existence encoding eliminates need for scanning, image processing, character recognition most situations (a notable exception being prevalent use embedded images documents, elaborated upon section “Analysis Text Images”). other many higher-level processing tasks remain due fact that design purpose almost existing systems (i.e., HTML, PDF) is display or printing human consumption, not machine-level information exchange extraction. As such, significant amount still required automatic extraction, indexing, content repurposing challenges exist this process. This chapter describes detail key technologies digital, with a focus processing.

参考文章(48)
Dimosthenis A. Karatzas, Text Segmentation in Web Images Using Colour Perception and Topological Features University of Liverpool. ,(2003)
Nicholas Kushmerick, Barry Smyth, Aidan Finn, Fact or Fiction: Content Classification for Digital Libraries. DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries. ,(2001)
Paul Bohunsky, Wolfgang Gatterbauer, Table extraction using spatial reasoning on the CSS2 visual box model national conference on artificial intelligence. pp. 1313- 1318 ,(2006)
Oleg Okun, David Scott Doermann, Matti Pietikäinen, Page Segmentation and Zone Classification: The State of the Art Defense Technical Information Center. ,(1999) , 10.21236/ADA458676
Vasileios Hatzivassiloglou, Kathleen R McKeown, Simone Teufel, Regina Barzilay, Barry Schiffman, David Evans, Columbia multi-document summarization : Approach and evaluation Porc. of Document Understanding Conference 2001. ,(2001) , 10.7916/D82V2QHF
A.K. Jain, Bin Yu, Automatic text location in images and video frames international conference on pattern recognition. ,vol. 2, pp. 1497- 1499 ,(1998) , 10.1109/ICPR.1998.711990
Silvia Miksch, Burcu Yildiz, Katharina Kaiser, pdf2table: A Method to Extract Table Information from PDF Files. indian international conference on artificial intelligence. pp. 1773- 1785 ,(2005)
Hui Chao, Jian Fan, Layout and Content Extraction for PDF Documents Document Analysis Systems VI. pp. 213- 224 ,(2004) , 10.1007/978-3-540-28640-0_20
Daniel Lopresti, Jiangying Zhou, Locating and Recognizing Text in WWW Images Information Retrieval. ,vol. 2, pp. 177- 206 ,(2000) , 10.1023/A:1009954710479