A Reality Check for Data Snooping

作者: Halbert White

DOI: 10.1111/1468-0262.00152

关键词: Null hypothesisModel selectionArtificial intelligenceStatistical theorySet (abstract data type)Benchmark (computing)MathematicsMachine learningInferenceStatistical hypothesis testingEconometricsMistake

摘要: Data snooping occurs when a given set of data is used more than once for purposes inference or model selection. When such reuse occurs, there always the possibility that any satisfactory results obtained may simply be due to chance rather merit inherent in method yielding results. This problem practically unavoidable analysis time-series data, as typically only single history measuring phenomenon interest available analysis. It widely acknowledged by empirical researchers dangerous practice avoided, but fact it endemic. The main has been lack sufficiently simple practical methods capable assessing potential dangers situation. Our purpose here provide specifying straightforward procedure testing null hypothesis best encountered specification search no predictive superiority over benchmark model. permits undertaken with some degree confidence one will not mistake could have generated genuinely good

参考文章(47)
M. Stone, Cross-Validatory Choice and Assessment of Statistical Predictions (With Discussion) Journal of the Royal Statistical Society: Series B (Methodological). ,vol. 38, pp. 102- 102 ,(1976) , 10.1111/J.2517-6161.1976.TB01573.X
Regina Y Liu, Kesar Singh, Moving blocks jackknife and bootstrap capture weak dependence Exploring the Limits of Bootstrap. ,(1992)
David D. Jensen, Paul R. Cohen, Multiple Comparisons in Induction Algorithms Machine Learning. ,vol. 38, pp. 309- 338 ,(2000) , 10.1023/A:1007631014630
Allan G. Timmermann, Halbert L. White, Ryan Sullivan, Dangers of Data-Driven Inference: The Case of Calendar Effects in Stock Returns Social Science Research Network. ,(1998)
M. Stone, Cross-validation:a review 2 Statistics. ,vol. 9, pp. 127- 139 ,(1978) , 10.1080/02331887808801414
T. Kloek, NOTE ON A LARGE-SAMPLE RESULT IN SPECIFICATION ANALYSIS Econometrica. ,vol. 43, pp. 933- 936 ,(1975) , 10.2307/1911335
H. R. Wills, Estimating Input Demand Equations by Direct and Indirect Methods The Review of Economic Studies. ,vol. 48, pp. 255- 270 ,(1981) , 10.2307/2296883
Chor-Yiu Sin, Halbert White, Information criteria for selecting possibly misspecified parametric models Journal of Econometrics. ,vol. 71, pp. 207- 225 ,(1996) , 10.1016/0304-4076(94)01701-8