作者: Kevin E K Chai , Stephen Anthony , Enrico Coiera , Farah Magrabi
DOI: 10.1136/AMIAJNL-2012-001409
关键词:
摘要: Objective To examine the feasibility of using statistical text classification to automatically identify health information technology (HIT) incidents in USA Food and Drug Administration (FDA) Manufacturer User Facility Device Experience (MAUDE) database. Design We used a subset 570 272 including 1534 HIT reported MAUDE between 1 January 2008 July 2010. Text classifiers regularized logistic regression were evaluated with both ‘balanced’ (50% HIT) ‘stratified’ (0.297% datasets for training, validation, testing. Dataset preparation, feature extraction, selection, cross-validation, classification, performance evaluation, error analysis performed iteratively further improve classifiers. Feature-selection techniques such as removing short words stop words, stemming, lemmatization, principal component examined. Measurements κ statistic, F1 score, precision recall. Results Classification was similar on stratified (0.954 score) balanced (0.995 datasets. Stemming most effective technique, reducing set size 79% while maintaining comparable performance. Training improved recall (0.989) but reduced (0.165). Conclusions Statistical appears be feasible method identifying reports within large databases incidents. Automated identification should enable more problems detected, analyzed, addressed timely manner. Semi-supervised learning may necessary when applying machine big data patient safety requires investigation.