DROP: Dimensionality Reduction Optimization for Time Series.

作者: Sahaana Suri , Peter Bailis

DOI: 10.1145/3329486.3329490

关键词: Dimensionality reductionDrop (telecommunication)Computer scienceData miningPrincipal component analysisScalingFast Fourier transform

摘要: Dimensionality reduction is a critical step in scaling machine learning pipelines. Principal component analysis (PCA) standard tool for dimensionality reduction, but performing PCA over full dataset can be prohibitively expensive. As result, theoretical work has studied the effectiveness of iterative, stochastic methods that operate data samples. However, termination conditions either execute predetermined number iterations, or until convergence solution, frequently sampling too many few datapoints end-to-end runtime improvements. We show how accounting downstream analytics operations during DR via allows to efficiently terminate after operating small (e.g., 1%) subsamples input data, reducing whole workload runtime. Leveraging this, we propose DROP, optimizer enables speedups up 5x Singular-Value-Decomposition-based techniques, and exceeds conventional approaches like FFT PAA by 16x workloads.

参考文章(77)
Alexandra Meliou, Sudeepa Roy, Dan Suciu, Causality and explanations in databases Proceedings of the VLDB Endowment. ,vol. 7, pp. 1715- 1716 ,(2014) , 10.14778/2733004.2733070
Kaushik Chakrabarti, Sharad Mehrotra, Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces very large data bases. pp. 89- 100 ,(2000)
Piotr Indyk, Aristides Gionis, Rajeev Motwani, Similarity Search in High Dimensions via Hashing very large data bases. pp. 518- 529 ,(1999)
Christopher M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics) Springer-Verlag New York, Inc.. ,(2006)
Vivek Narasayya, Surajit Chaudhuri, Self-tuning database systems: a decade of progress very large data bases. pp. 3- 14 ,(2007)
I K Fodor, A Survey of Dimension Reduction Techniques Office of Scientific and Technical Information (OSTI). ,(2002) , 10.2172/15002155
Lloyd N. Trefethen, David Bau, Numerical Linear Algebra ,(1997)
Hans-Peter Kriegel, Martin Ester, Jörg Sander, Xiaowei Xu, A density-based algorithm for discovering clusters in large spatial Databases with Noise knowledge discovery and data mining. pp. 226- 231 ,(1996)
John A. Lee, Michel Verleysen, Nonlinear Dimensionality Reduction ,(2007)