Using statistical text classification to identify health information technology incidents.

作者: Kevin E K Chai , Stephen Anthony , Enrico Coiera , Farah Magrabi

DOI: 10.1136/AMIAJNL-2012-001409

关键词:

摘要: Objective To examine the feasibility of using statistical text classification to automatically identify health information technology (HIT) incidents in USA Food and Drug Administration (FDA) Manufacturer User Facility Device Experience (MAUDE) database. Design We used a subset 570 272 including 1534 HIT reported MAUDE between 1 January 2008 July 2010. Text classifiers regularized logistic regression were evaluated with both ‘balanced’ (50% HIT) ‘stratified’ (0.297% datasets for training, validation, testing. Dataset preparation, feature extraction, selection, cross-validation, classification, performance evaluation, error analysis performed iteratively further improve classifiers. Feature-selection techniques such as removing short words stop words, stemming, lemmatization, principal component examined. Measurements κ statistic, F1 score, precision recall. Results Classification was similar on stratified (0.954 score) balanced (0.995 datasets. Stemming most effective technique, reducing set size 79% while maintaining comparable performance. Training improved recall (0.989) but reduced (0.165). Conclusions Statistical appears be feasible method identifying reports within large databases incidents. Automated identification should enable more problems detected, analyzed, addressed timely manner. Semi-supervised learning may necessary when applying machine big data patient safety requires investigation.

参考文章(34)
Rosa L Figueroa, Qing Zeng-Treitler, Long H Ngo, Sergey Goryachev, Eduardo P Wiechmann, Active learning for clinical text classification: is it better than random sampling? Journal of the American Medical Informatics Association. ,vol. 19, pp. 809- 816 ,(2012) , 10.1136/AMIAJNL-2011-000648
Paul Komarek, Andrew W. Moore, Fast Robust Logistic Regression for Large Sparse Datasets with Binary Outputs. international conference on artificial intelligence and statistics. ,(2003)
Joan S Ash, Marc Berg, Enrico Coiera, Some Unintended Consequences of Information Technology in Health Care: The Nature of Patient Care Information System-related Errors Journal of the American Medical Informatics Association. ,vol. 11, pp. 104- 112 ,(2003) , 10.1197/JAMIA.M1471
G. K. Savova, J. E. Olson, S. P. Murphy, V. L. Cafourek, F. J. Couch, M. P. Goetz, J. N. Ingle, V. J. Suman, C. G. Chute, R. M. Weinshilboum, Automated discovery of drug treatment patterns for endocrine therapy of breast cancer within an electronic medical record Journal of the American Medical Informatics Association. ,vol. 19, ,(2012) , 10.1136/AMIAJNL-2011-000295
Gary M. Weiss, Mining with rarity ACM SIGKDD Explorations Newsletter. ,vol. 6, pp. 7- 19 ,(2004) , 10.1145/1007730.1007734
Jesse Davis, Mark Goadrich, The relationship between Precision-Recall and ROC curves Proceedings of the 23rd international conference on Machine learning - ICML '06. ,vol. 148, pp. 233- 240 ,(2006) , 10.1145/1143844.1143874
Chris Seiffert, Taghi M. Khoshgoftaar, Jason Van Hulse, Amri Napolitano, Comparative Study of Clustering Techniques for the Organization of Software Repositories international conference on tools with artificial intelligence. ,vol. 1, pp. 210- 214 ,(2007) , 10.1109/ICTAI.2007.71
Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, Josh Attenberg, Feature hashing for large scale multitask learning Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09. pp. 1113- 1120 ,(2009) , 10.1145/1553374.1553516