作者: Jean-Luc Meunier , Herve Dejean
DOI:
关键词: Line (text file) 、 Similarity (network science) 、 Natural language processing 、 Cluster analysis 、 Artificial intelligence 、 Data mining 、 Computer science 、 Header 、 Pagination
摘要: A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable blocks as derived from the document. The textual variability lines comprised blocks, including different kinds within line is analyzed assessment variability. Header/footer zones are defined by having low An alternative embodiment identifies pagination constructs comparing selected text-boxes similarity and proximity clustering boxes satisfying predetermined value, wherein clustered deemed comprise constructs.