Machine Learning for Document Structure Recognition

作者: Gerhard Paaß , Iuliu Konya

DOI: 10.1007/978-3-642-22613-7_12

关键词:

摘要: The backbone of the information age is digital which may be searched, accessed, and transferred instantaneously. Therefore digitization paper documents extremely interesting. This chapter describes approaches for document structure recognition detecting hierarchy physical components in images documents, such as pages, paragraphs, figures, transforms this into a logical components, titles, authors, sections. structural improves readability useful indexing retrieving contained documents. First we present rule-based system segmenting image estimating role these zones. It extensively used processing newspaper collections showing world-class performance. In second part introduce several machine learning exploring large numbers interrelated features. They can adapted to geometrical models structure, set up linear sequence or general graph. These advanced require far more computational resources but show better performance than simpler alternatives might future.

参考文章(46)
Julian Besag, Statistical Analysis of Non-Lattice Data The Statistician. ,vol. 24, pp. 179- 195 ,(1975) , 10.2307/2987782
Balaraman Ravindran, Pranjal Awasthi, Aakanksha Gagrani, Image modeling using tree structured conditional random fields international joint conference on artificial intelligence. pp. 2060- 2065 ,(2007)
Gerhard Paaß, Frank Reichartz, Exploiting Semantic Constraints for Estimating Supersenses with CRFs. siam international conference on data mining. pp. 485- 496 ,(2009)
S. Messelodi, C. M. Modena, R. Cattoni, T. Coianiz, Geometric Layout Analysis Techniques for Document Image Understanding: a Review ,(2008)
O.T. Akindele, A. Belaid, Page segmentation by segment tracing Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93). pp. 341- 344 ,(1993) , 10.1109/ICDAR.1993.395719
Fuchun Peng, Andrew McCallum, Accurate Information Extraction from Research Papers using Conditional Random Fields north american chapter of the association for computational linguistics. pp. 329- 336 ,(2004)
Faisal Shafait, Daniel Keysers, Thomas M. Breuel, Performance comparison of six algorithms for page segmentation document analysis systems. pp. 368- 379 ,(2006) , 10.1007/11669487_33
Jaekyu Ha, R.M. Haralick, I.T. Phillips, Document page decomposition by the bounding-box project international conference on document analysis and recognition. ,vol. 2, pp. 1119- 1122 ,(1995) , 10.1109/ICDAR.1995.602115
Christopher D. Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing ,(1999)