Building a test collection for complex document information processing

作者: D. Lewis , G. Agam , S. Argamon , O. Frieder , D. Grossman

DOI: 10.1145/1148170.1148307

关键词:

摘要: Research and development of information access technology for scanned paper documents has been hampered by the lack public test collections realistic scope complexity. As part a project to create prototype system search mining masses document images, we are assembling 1.5 terabyte dataset support evaluation both end-to-end complex processing (CDIP) tasks (e.g., text retrieval data mining) as well component technologies such optical character recognition (OCR), structure analysis, signature matching, authorship attribution.

参考文章(3)
Kazem Taghva, Julie Borsack, Allen Condit, Srinivas Erva, The effects of noisy data on text retrieval Journal of the American Society for Information Science. ,vol. 45, pp. 50- 58 ,(1994) , 10.1002/(SICI)1097-4571(199401)45:1<50::AID-ASI6>3.0.CO;2-B
S. Argamon, G. Agam, O. Frieder, D. Grossman, D. Lewis, G. Sohn, K. Voorhees, A complex document information processing prototype Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06. pp. 599- 600 ,(2006) , 10.1145/1148170.1148274