作者: Mehdi Haji
DOI:
关键词:
摘要: Despite the existence of electronic media in today’s world, a considerable amount written communications is paper form such as books, bank cheques, contracts, etc. There an increasing demand for automation information extraction, classification, search, and retrieval documents. The goal this research to develop complete methodology spotting arbitrary keywords handwritten document images. We propose top-down approach Our composed two major steps: segmentation decision. In former, we generate word hypotheses. latter, decide whether generated hypothesis specific keyword or not. We carry out decision step through two-level classification where first, assign input image non-keyword class; then transcribe if it passed keyword. By reducing problem from domain text domain, do not only address search documents, but also retrieval, without need transcription whole image. The main contribution thesis development generalized minimum edit distance words, prove that equivalent Ergodic Hidden Markov Model (EHMM). To best our knowledge, work first present exact 2D model temporal handwriting while satisfying practical constraints. Some other contributions include: 1) removal page margins based on corner detection projection profiles; 2) noise patterns images using expectation maximization fuzzy inference systems; 3) extraction lines fast Fourier-based steerable filtering; 4) characters skeletal graphs; 5) merging broken graph partitioning. Our experiments with benchmark database English documents real-world collection French indicate that, even any word/document-level training, results are comparable state-of-the-art systems