作者: D. Lewis , G. Agam , S. Argamon , O. Frieder , D. Grossman
关键词:
摘要: Research and development of information access technology for scanned paper documents has been hampered by the lack public test collections realistic scope complexity. As part a project to create prototype system search mining masses document images, we are assembling 1.5 terabyte dataset support evaluation both end-to-end complex processing (CDIP) tasks (e.g., text retrieval data mining) as well component technologies such optical character recognition (OCR), structure analysis, signature matching, authorship attribution.