作者: Jerome P. Reiter , Gregory Caiola
关键词: Statistical model 、 Empirical research 、 Synthetic data 、 Statistical data type 、 Random forest 、 Categorical variable 、 Parametric statistics 、 Computer science 、 Microdata (statistics) 、 Data mining
摘要: Several national statistical agencies are now releasing partially synthetic, public use microdata. These comprise the units in original database with sensitive or identifying values replaced simulated from models. Specifying synthesis models can be daunting databases that includemany variables of diverse types. variablesmay related inways difficult to capture standard parametric tools. In this article, we describe how random forests adapted generate synthetic data for categorical variables. Using an empirical study, illustrate forest synthesizer preserve relationships reasonably well while providing low disclosure risks. The has some appealing features agencies: it applied minimal tuning, easily incorporates numerical, categorical, and mixed as predictors, operates efficiently high dimensions, automatically fits non-linear relationships.