Context Similarity for Retrieval-Based Imputation

作者: Ahmad Ahmadov , Maik Thiele , Wolfgang Lehner , Robert Wrembel

DOI: 10.1145/3110025.3110161

关键词: Missing dataData qualityMatching methodsImputation (statistics)Data miningExternal dataWeb tablesComputer scienceInference

摘要: Completeness as one of the four major dimensions data quality is a pervasive issue in modern databases. Although imputation has been studied extensively literature, most research focused on inference-based approach. We propose to harness Web tables an external source effectively and efficiently retrieve missing while taking into account inherent uncertainty lack veracity that they contain. Existing approaches mostly rely standard retrieval techniques out-of-the-box matching methods which result very low precision, especially when dealing with numerical data. We, therefore, novel approach by applying context similarity measures results significant increase precision procedure, ensuring imputed values are same domain magnitude local values, thus resulting accurate imputation. use Dresden Table Corpus comprised more than 125 million web extracted from Common Crawl our knowledge source. The comprehensive experimental demonstrate proposed method well outperforms default

参考文章(22)
M. Mostafizur Rahman, D. N. Davis, Machine Learning-Based Missing Value Imputation Method for Clinical Datasets Springer, Dordrecht. pp. 245- 257 ,(2013) , 10.1007/978-94-007-6190-2_19
Stef van Buuren, Flexible Imputation of Missing Data ,(2012)
Zhixu Li, Mohamed A. Sharaf, Laurianne Sitbon, Shazia Sadiq, Marta Indulska, Xiaofang Zhou, WebPut: efficient web-based data imputation web information systems engineering. ,vol. 7651, pp. 243- 256 ,(2012) , 10.1007/978-3-642-35063-4_18
Maria Carolina Monard, Gustavo E. A. P. A. Batista, A Study of K-Nearest Neighbour as an Imputation Method. HIS. pp. 251- 260 ,(2002)
Chih-Hung Wu, Chian-Huei Wun, Hung-Ju Chou, Using association rules for completing missing data international conference hybrid intelligent systems. pp. 236- 241 ,(2004) , 10.1109/ICHIS.2004.91
James Honaker, Gary King, Matthew Blackwell, AmeliaII: A Program for Missing Data Journal of Statistical Software. ,vol. 45, pp. 1- 47 ,(2011) , 10.18637/JSS.V045.I07
José M Jerez, Ignacio Molina, Pedro J García-Laencina, Emilio Alba, Nuria Ribelles, Miguel Martín, Leonardo Franco, None, Missing data imputation using statistical and machine learning methods in a real breast cancer problem Artificial Intelligence in Medicine. ,vol. 50, pp. 105- 115 ,(2010) , 10.1016/J.ARTMED.2010.05.002
Donald P. Ballou, Harold L. Pazer, Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems Management Science. ,vol. 31, pp. 150- 162 ,(1985) , 10.1287/MNSC.31.2.150
Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, Surajit Chaudhuri, InfoGather Proceedings of the 2012 international conference on Management of Data - SIGMOD '12. pp. 97- 108 ,(2012) , 10.1145/2213836.2213848
Roderick JA Little, Donald B Rubin, None, Statistical Analysis with Missing Data ,(1987)