A graph-based method of newspaper article reconstruction

作者: Liangcai Gao , Zhi Tang , Xiaoyan Lin , Yongtao Wang

DOI:

关键词:

摘要: The primary information units in a newspaper are the articles. Article reconstruction from newspapers including article aggregation and reading order recovery is known to be quite challenging task due complexity of multi-article page layout. In this paper, we propose novel approach for using bipartite graph framework, which models complex relationships between text blocks as one-to-one correspondences, accomplishes by finding optimal match on graph. During optimization process, various sources, geometric layout, linguistic semantic content, deeply mined model deal with wide range layouts. Moreover, different existing methods, perform two sub-tasks reverse order, that is, detect orders first then use aggregate belonging same Experimental results 3312 pages 23184 articles demonstrate our method outperforms state-of-the-art methods reconstruction. addition, has been adopted several large-scale digitalization projects.

参考文章(11)
Marco Aiello, Andrea Pegoretti, TEXTUAL ARTICLE CLUSTERING IN NEWSPAPER PAGES Applied Artificial Intelligence. ,vol. 20, pp. 767- 796 ,(2006) , 10.1080/08839510600903858
Phaisarn Sutheebanjard, Wichian Premchaiswadi, A modified recursive x-y cut algorithm for solving block ordering problems international conference on computer engineering and technology. ,vol. 3, ,(2010) , 10.1109/ICCET.2010.5485882
Kenneth Steiglitz, Christos H. Papadimitriou, Combinatorial Optimization: Algorithms and Complexity ,(1981)
K. Hadjar, M. Rigamonti, D. Lalanne, R. Ingold, Xed: a new tool for extracting hidden structures from electronic documents First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings.. pp. 212- 224 ,(2004) , 10.1109/DIAL.2004.1263250
J.-L. Meunier, Optimized XY-cut for determining a page reading order international conference on document analysis and recognition. pp. 347- 351 ,(2005) , 10.1109/ICDAR.2005.182
Ming Chen, Xiaoqing Ding, Jian Liang, Analysis, understanding and representation of Chinese newspaper with complex layout international conference on image processing. ,vol. 2, pp. 590- 593 ,(2000) , 10.1109/ICIP.2000.899500
Raymond W. Smith, Hybrid Page Layout Analysis via Tab-Stop Detection international conference on document analysis and recognition. pp. 241- 245 ,(2009) , 10.1109/ICDAR.2009.257
Jean-Luc Bloechle, Catherine Pugin, Rolf Ingold, Dolores: An Interactive and Class-Free Approach for Document Logical Restructuring document analysis systems. pp. 644- 652 ,(2008) , 10.1109/DAS.2008.44