Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning

作者: Marcus Suresh , Ronnie Taib , Yanchang Zhao , Warren Jin

DOI: 10.1007/978-3-030-35288-2_18

关键词:

摘要: Incomplete data are quite common which can deteriorate statistical inference, often affecting evidence-based policymaking. A typical example is the Business Longitudinal Analysis Data Environment (BLADE), an Australian Government’s national asset. In this paper, motivated by helping BLADE practitioners select and implement advanced imputation methods with a solid understanding of impact different will have on accuracy reliability, we examine performance techniques based 12 machine learning algorithms. They range from linear regression to neural networks. We compare these algorithms assess various settings, including number input features length time spans. To generalisability, also impute two distinct characteristics. Experimental results show that three ensemble algorithms: extra trees regressor, bagging regressor random forest consistently maintain high over benchmark across metrics. Among them, would recommend for its computational efficiency.

参考文章(9)
Roderick JA Little, Donald B Rubin, None, Statistical Analysis with Missing Data ,(1987)
Robert Merton Solow, A Contribution to the Theory of Economic Growth The Quarterly Journal of Economics. ,vol. 70, pp. 65- 94 ,(1956) , 10.2307/1884513
DONALD B. RUBIN, Inference and missing data Biometrika. ,vol. 63, pp. 581- 592 ,(1976) , 10.1093/BIOMET/63.3.581
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Andreas Müller, Joel Nothman, Gilles Louppe, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay, Scikit-learn: Machine Learning in Python Journal of Machine Learning Research. ,vol. 12, pp. 2825- 2830 ,(2011)
Huidong Jin, Man-Leung Wong, K-S Leung, Scalable model-based clustering for large databases based on data summarization IEEE Transactions on Pattern Analysis and Machine Intelligence. ,vol. 27, pp. 1710- 1719 ,(2005) , 10.1109/TPAMI.2005.226
Sasan Bakhtiari, Entrepreneurship Dynamics in Australia: Lessons from Micro‐data Economic Record. ,vol. 95, pp. 114- 140 ,(2019) , 10.1111/1475-4932.12460
Alex Mihailidis, Shehroz S. Khan, Amir Ahmad, Bootstrapping and Multiple Imputation Ensemble Approaches for Missing Data. arXiv: Learning. ,(2018)
K. Shuvo Bakar, Huidong Jin, Areal prediction of survey data using Bayesian spatial generalised linear models Communications in Statistics - Simulation and Computation. ,vol. 49, pp. 2963- 2978 ,(2020) , 10.1080/03610918.2018.1530787