作者: Maytal Saar-Tsechansky , Foster Provost
DOI:
关键词:
摘要: Much work has studied the effect of different treatments missing values on model induction, but little analyzed for common case at prediction time. This paper first compares several methods---predictive value imputation, distribution-based imputation used by C4.5, and using reduced models---for applying classification trees to instances with (and also shows evidence that results generalize bagged logistic regression). The show two most popular treatments, each is preferable under conditions. Strikingly reduced-models approach, seldom mentioned or used, consistently outperforms other methods, sometimes a large margin. lack attention modeling may be due in part its (perceived) expense terms computation storage. Therefore, we then introduce evaluate alternative, hybrid approaches allow users balance between more accurate computationally expensive other, less treatments. methods can scale gracefully amount investment computation/storage, they outperform even small investments.