作者: Daniel D. Walker , Eric K. Ringger
DOI:
关键词:
摘要: Modern document collections are too large to annotate and curate manually. As increasingly amounts of data become available, historians, librarians other scholars need rely on automated systems efficiently accurately analyze the contents their find new interesting patterns therein. techniques in Bayesian text analytics becoming wide spread have potential revolutionize way that research is conducted. Much work has been done modeling community towards this end, though most it focussed modern, relatively clean data. We present for improved may contain textual noise or include real-valued metadata associated with documents. This class documents includes many historical collections. Indeed, our specific motivation help improve documents, which often noisy and/or context represented by metadata. Many digitized means Optical Character Recognition (OCR) from images old degraded original Historical also metadata, such as timestamps, can be incorporated an analysis topical content. techniques, topic models, developed automatically discover meaning text. While these methods useful, they break down presence OCR errors. We show extent performance breakdown occurs. The types analyses covered dissertation clustering, feature selection, unsupervised supervised without errors a model uses nonparametrics results each areas, emphasis studying effects algorithms In we effectively: state art both clustering modeling; introduce useful synthetic dataset researchers; empirically how existing