Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents

作者: Jean-Luc Meunier , Herve Dejean

DOI:

关键词: Line (text file)Similarity (network science)Natural language processingCluster analysisArtificial intelligenceData miningComputer scienceHeaderPagination

摘要: A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable blocks as derived from the document. The textual variability lines comprised blocks, including different kinds within line is analyzed assessment variability. Header/footer zones are defined by having low An alternative embodiment identifies pagination constructs comparing selected text-boxes similarity and proximity clustering boxes satisfying predetermined value, wherein clustered deemed comprise constructs.

参考文章(31)
Stefan Klink, Thomas Kieninger, Andreas Dengel, Document Structure Analysis Based on Layout and Textual Features ,(2000)
Elizabeth Roberts, Gregory Thayer, Terry Gibson, System for generating a structured document ,(2001)
Jean-Luc Meunier, Herve Dejean, Olivier Fambon, Method and apparatus for detecting a table of contents and reference determination ,(2005)
Robert Haralick, Ihsin Phillips, Christian Shin, Dov Dori, David Doermann, Mitchell Buchman, David Ross, The representation of document structure: a generic object-process analysis University of Maryland at College Park. ,(1995)
Jeffrey E. Young, Michael C. Wexler, Structure extraction on electronic documents ,(1997)
Anthony Joseph Donnelly, David Keenan, Method and system for producing documents in a structured format ,(1999)
Masaki Kyojima, Noriyuki Kamibayashi, Makoto Takeoka, Koji Kusumoto, Method and system for automatically generating logical structures of electronic documents ,(1994)