Visualization-aware sampling for very large databases

作者: Yongjoo Park , Michael Cafarella , Barzan Mozafari

DOI: 10.1109/ICDE.2016.7498287

关键词: Data miningSampling (statistics)Computer scienceDatabaseScatter plotCluster analysisDensity estimationVisualizationStratified samplingSet (abstract data type)

摘要: Interactive visualizations are crucial in ad hoc data exploration and analysis. However, with the growing number of massive datasets, generating interactive timescales is increasingly challenging. One approach for improving speed visualization tool via reduction order to reduce computational overhead, but at a potential cost accuracy. Common techniques, such as uniform stratified sampling, do not exploit fact that sampled tuples will be transformed into human consumption. We propose visualization-aware sampling (VAS) guarantees high quality small subset entire dataset. validate our method when applied scatter map plots three common goals: regression, density estimation, clustering. The key method's success choosing set minimizes visualization-inspired loss function. While existing approaches minimize error aggregation queries, we focus on function maximizes visual fidelity plots. Our user study confirms proposed correlates strongly using resulting visualizations. experiments show (i) VAS improves user's by up 35% various tasks, (ii) can achieve required 400× faster.

参考文章(43)
Xing Xie, Wei-Ying Ma, Yu Zheng, GeoLife: A Collaborative Social Networking Service among User, Location and Trajectory. IEEE Data(base) Engineering Bulletin. ,vol. 33, pp. 32- 39 ,(2010)
Albert Kim, Eric Blais, Aditya Parameswaran, Piotr Indyk, Sam Madden, Ronitt Rubinfeld, Rapid sampling for visualizations with ordering guarantees Proceedings of the VLDB Endowment. ,vol. 8, pp. 521- 532 ,(2015) , 10.14778/2735479.2735485
G. L. Nemhauser, L. A. Wolsey, M. L. Fisher, An analysis of approximations for maximizing submodular set functions--I Mathematical Programming. ,vol. 14, pp. 265- 294 ,(1978) , 10.1007/BF01588971
Jeffrey Heer, Sean Kandel, Interactive analysis of big data ACM Crossroads Student Magazine. ,vol. 19, pp. 50- 54 ,(2012) , 10.1145/2331042.2331058
E. A. Nadaraya, On Estimating Regression Theory of Probability and Its Applications. ,vol. 9, pp. 141- 142 ,(1964) , 10.1137/1109020
Joseph Cottam, Andrew Lumsdaine, Peter Wang, Overplotting: Unified solutions under Abstract Rendering international conference on big data. pp. 9- 16 ,(2013) , 10.1109/BIGDATA.2013.6691712
Zhicheng Liu, Jeffrey Heer, The Effects of Interactive Latency on Exploratory Visual Analysis IEEE Transactions on Visualization and Computer Graphics. ,vol. 20, pp. 2122- 2131 ,(2014) , 10.1109/TVCG.2014.2346452
Mike Barnett, Badrish Chandramouli, Robert DeLine, Steven Drucker, Danyel Fisher, Jonathan Goldstein, Patrick Morrison, John Platt, Stat!: an interactive analytics environment for big data international conference on management of data. pp. 1013- 1016 ,(2013) , 10.1145/2463676.2463683
U. Feige, D. Peleg, G. Kortsarz, The Dense k -Subgraph Problem Algorithmica. ,vol. 29, pp. 410- 421 ,(2001) , 10.1007/S004530010050
Brian Babcock, Surajit Chaudhuri, Gautam Das, Dynamic sample selection for approximate query processing international conference on management of data. pp. 539- 550 ,(2003) , 10.1145/872757.872822