作者: Huanfeng Ma , David S. Doermann
DOI: 10.1117/12.476058
关键词:
摘要: In this paper, we present an approach to the bootstrap learning of a page segmentation model. The idea evolves from attempts segment dictionaries that often have consistent structure, and is extended more general structured documents. cases highly regular layout can be learned examples only few pages. system first trained using small number samples, larger test set processed based on training result. After making corrections selected subset set, these corrected samples are combined with original generate samples. newly created used retrain system, refine features resegment This procedure applied iteratively until parameters stable. Using approach, do not need initially provide large We many documents such as dictionaries, phone books, spoken language transcripts, obtained satisfying performance.