作者: Sahaana Suri , Peter Bailis
关键词: Dimensionality reduction 、 Drop (telecommunication) 、 Computer science 、 Data mining 、 Principal component analysis 、 Scaling 、 Fast Fourier transform
摘要: Dimensionality reduction is a critical step in scaling machine learning pipelines. Principal component analysis (PCA) standard tool for dimensionality reduction, but performing PCA over full dataset can be prohibitively expensive. As result, theoretical work has studied the effectiveness of iterative, stochastic methods that operate data samples. However, termination conditions either execute predetermined number iterations, or until convergence solution, frequently sampling too many few datapoints end-to-end runtime improvements. We show how accounting downstream analytics operations during DR via allows to efficiently terminate after operating small (e.g., 1%) subsamples input data, reducing whole workload runtime. Leveraging this, we propose DROP, optimizer enables speedups up 5x Singular-Value-Decomposition-based techniques, and exceeds conventional approaches like FFT PAA by 16x workloads.