Random Forests for Generating Partially Synthetic, Categorical Data

作者: Jerome P. Reiter , Gregory Caiola

DOI: 10.5555/1747335.1747337

关键词: Statistical modelEmpirical researchSynthetic dataStatistical data typeRandom forestCategorical variableParametric statisticsComputer scienceMicrodata (statistics)Data mining

摘要: Several national statistical agencies are now releasing partially synthetic, public use microdata. These comprise the units in original database with sensitive or identifying values replaced simulated from models. Specifying synthesis models can be daunting databases that includemany variables of diverse types. variablesmay related inways difficult to capture standard parametric tools. In this article, we describe how random forests adapted generate synthetic data for categorical variables. Using an empirical study, illustrate forest synthesizer preserve relationships reasonably well while providing low disclosure risks. The has some appealing features agencies: it applied minimal tuning, easily incorporates numerical, categorical, and mixed as predictors, operates efficiently high dimensions, automatically fits non-linear relationships.

参考文章(21)
J.P. Reiter, Using CART to generate partially synthetic public use microdata Journal of Official Statistics. ,vol. 21, pp. 441- 462 ,(2005)
Mi-Ja Woo, Jerome P. Reiter, Anna Oganian, Alan F. Karr, Global Measures of Data Utility for Microdata Masked for Disclosure Limitation Journal of Privacy and Confidentiality. ,vol. 1, pp. 7- ,(2009) , 10.29012/JPC.V1I1.568
John M. Abowd, Simon D. Woodcock, Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data privacy in statistical databases. pp. 290- 297 ,(2004) , 10.1007/978-3-540-25955-8_23
Roderick J. A. Little, Fang Liu, Trivellore E. Raghunathan, Statistical Disclosure Techniques Based on Multiple Imputation John Wiley & Sons, Ltd. pp. 141- 152 ,(2005) , 10.1002/0470090456.CH13
Leon Cornelis Roelof Johannes Willenborg, Ton de Waal, Elements of Statistical Disclosure Control ,(2000)
Richard A Olshen, Charles J Stone, Leo Breiman, Jerome H Friedman, Classification and regression trees ,(1983)
Jerome P. Reiter, Anna Oganian, Alan F. Karr, Verification servers: Enabling analysts to assess the quality of inferences from public use data Computational Statistics & Data Analysis. ,vol. 53, pp. 1475- 1482 ,(2009) , 10.1016/J.CSDA.2008.10.006