作者: Yongjoo Park , Jingyi Qing , Xiaoyang Shen , Barzan Mozafari
关键词: Poisson regression 、 Probabilistic logic 、 Speedup 、 Sampling (statistics) 、 Entropy (energy dispersal) 、 Entropy (arrow of time) 、 Entropy (information theory) 、 Linear regression 、 Logistic regression 、 Computer science 、 Algorithm 、 Entropy (statistical thermodynamics) 、 Generalized linear model 、 Maximum likelihood 、 Entropy (classical thermodynamics) 、 Entropy (order and disorder) 、 Hyperparameter
摘要: The rising volume of datasets has made training machine learning (ML) models a major computational cost in the enterprise. Given iterative nature model and parameter tuning, many analysts use small sample their entire data during initial stage analysis to make quick decisions (e.g., what features or hyperparameters use) dataset only later stages (i.e., when they have converged specific model). This sampling, however, is performed an ad-hoc fashion. Most practitioners cannot precisely capture effect sampling on quality model, eventually decision-making process tuning phase. Moreover, without systematic support for operators, optimizations reuse opportunities are lost. In this paper, we introduce BlinkML, system fast, quality-guaranteed ML training. BlinkML allows users error-computation tradeoffs: instead full model), can quickly train approximate with guarantees using sample. ensure that, high probability, makes same predictions as model. currently supports any that relies maximum likelihood estimation (MLE), which includes Generalized Linear Models linear regression, logistic max entropy classifier, Poisson regression) well PPCA (Probabilistic Principal Component Analysis). Our experiments show speed up large-scale tasks by 6.26x-629x while guaranteeing predictions, 95%