Bayesian text analytics for document collections

作者: Daniel D. Walker , Eric K. Ringger

DOI:

关键词:

摘要: Modern document collections are too large to annotate and curate manually. As increasingly amounts of data become available, historians, librarians other scholars need rely on automated systems efficiently accurately analyze the contents their find new interesting patterns therein. techniques in Bayesian text analytics becoming wide spread have potential revolutionize way that research is conducted. Much work has been done modeling community towards this end, though most it focussed modern, relatively clean data. We present for improved may contain textual noise or include real-valued metadata associated with documents. This class documents includes many historical collections. Indeed, our specific motivation help improve documents, which often noisy and/or context represented by metadata. Many digitized means Optical Character Recognition (OCR) from images old degraded original Historical also metadata, such as timestamps, can be incorporated an analysis topical content. techniques, topic models, developed automatically discover meaning text. While these methods useful, they break down presence OCR errors. We show extent performance breakdown occurs. The types analyses covered dissertation clustering, feature selection, unsupervised supervised without errors a model uses nonparametrics results each areas, emphasis studying effects algorithms In we effectively: state art both clustering modeling; introduce useful synthetic dataset researchers; empirically how existing

参考文章(91)
Steven M. Beitzel, Eric C. Jensen, David A. Grossman, A Survey of Retrieval Strategies for OCR Text Collections ,(2002)
John Lafferty, Kamal Nigam, Andrew McCallum, Using Maximum Entropy for Text Classification ,(1999)
David J. Aldous, Exchangeability and related topics Lecture Notes in Mathematics. ,vol. 1117, pp. 1- 198 ,(1985) , 10.1007/BFB0099421
Arindam Banerjee, Sugato Basu, Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning. siam international conference on data mining. pp. 431- 436 ,(2007)
Ken Lang, NewsWeeder: Learning to Filter Netnews Machine Learning Proceedings 1995. pp. 331- 339 ,(1995) , 10.1016/B978-1-55860-377-6.50048-7
Howard Raiffa, John W. Pratt, Robert Schlaifer, Introduction to Statistical Decision Theory ,(1995)
Kamal Nigam, Andrew McCallum, A comparison of event models for naive bayes text classification national conference on artificial intelligence. pp. 41- 48 ,(1998)
Henry S. Baird, Document image defect models Document image analysis. pp. 315- 325 ,(1995) , 10.1007/978-3-642-77281-8_26
Inderjit Dhillon, Jacob Kogan, Charles Nicholas, Feature Selection and Document Clustering Springer, New York, NY. pp. 73- 100 ,(2004) , 10.1007/978-1-4757-4305-0_4