Redundancy-Aware Topic Modeling for Patient Record Notes

作者: Raphael Cohen , Iddo Aviram , Michael Elhadad , Noémie Elhadad

DOI: 10.1371/JOURNAL.PONE.0087555

关键词:

摘要: The clinical notes in a given patient record contain much redundancy, large part due to clinicians’ documentation habit of copying from previous the and pasting into new note. Previous work has shown that this redundancy negative impact on quality text mining topic modeling particular. In paper we describe novel variant Latent Dirichlet Allocation (LDA) modeling, Red-LDA, which takes account inherent records when content notes. To assess value experiment with three baselines our redundancy-aware method: collection records, (i) apply vanilla LDA all documents input records; (ii) identify remove by chosing single representative document for each as LDA; (iii) redundant paragraphs record, leaving partial, non-redundant (iv) Red-LDA records. Both quantitative evaluation carried out through log-likelihood held-out data coherence produced topics qualitative assessement physicians show produces superior models baseline strategies. This research contributes emerging field understanding characteristics electronic health how them framework mining. code two redundancy-elimination is made publicly available community.

参考文章(27)
Alex A.T. Bui, Corey W. Arnold, Ricky Taira, Suzie M. El-Saden, Clinical Case-based Retrieval Using Latent Topic Analysis. american medical informatics association annual symposium. ,vol. 2010, pp. 26- 30 ,(2010)
Kostas Tsioutsiouliklis, Fabio Massimo Zanzotto, Marco Pennaccchiotti, Linguistic Redundancy in Twitter empirical methods in natural language processing. pp. 659- 669 ,(2011)
Hongyuan Zha, Steven P Crain, Shuang-Hong Yang, Yu Jiao, Dialect Topic Modeling for Improved Consumer Medical Search american medical informatics association annual symposium. ,vol. 2010, pp. 132- 136 ,(2010)
Daniel Walker, William B. Lund, Eric K. Ringger, Evaluating Models of Latent Document Semantics in the Presence of OCR Errors empirical methods in natural language processing. pp. 240- 250 ,(2010)
David M Blei, Andrew Y Ng, Michael I Jordan, None, Latent dirichlet allocation Journal of Machine Learning Research. ,vol. 3, pp. 993- 1022 ,(2003) , 10.5555/944919.944937
Daniel Ramage, David Hall, Ramesh Nallapati, Christopher D. Manning, Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora empirical methods in natural language processing. pp. 248- 256 ,(2009) , 10.3115/1699510.1699543
Aria Haghighi, Lucy Vanderwende, Exploring Content Models for Multi-Document Summarization north american chapter of the association for computational linguistics. pp. 362- 370 ,(2009) , 10.3115/1620754.1620807
Eugenia L. Siegler, Ronald Adelman, Copy and paste: a remediable hazard of electronic health records. The American Journal of Medicine. ,vol. 122, pp. 495- 496 ,(2009) , 10.1016/J.AMJMED.2009.02.010
Corey Arnold, William Speier, A topic model of clinical reports Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval - SIGIR '12. pp. 1031- 1032 ,(2012) , 10.1145/2348283.2348454
Liisa Holm, Chris Sander, Protein Structure Comparison by Alignment of Distance Matrices Journal of Molecular Biology. ,vol. 233, pp. 123- 138 ,(1993) , 10.1006/JMBI.1993.1489