A Sketch-based Sampling Algorithm on Sparse Data

作者: Trevor J. Hastie , Kenneth W. Church , Ping Li

DOI:

关键词: Hash functionSketchSparse approximationK-SVDSparse matrixRandomized algorithmHistogramComputer scienceSampling (statistics)Algorithm

摘要: We propose a sketch-based sampling algorithm, which effectively exploits the data sparsity. Sampling methods have become popular in large-scale mining and information retrieval, where high sparsity is norm. A distinct feature of our algorithm that it combines advantages both conventional random more modern randomized algorithms such as local sensitive hashing (LSH). While most are designed for specific summary statistics, proposed general purpose technique, useful estimating any statistics including two-way multi-way distances joint histograms.

参考文章(78)
Christos H. Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, Santosh Vempala, Latent semantic indexing: a probabilistic analysis symposium on principles of database systems. pp. 159- 168 ,(1998) , 10.1145/275487.275505
Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, Jong Soo Park, Fast algorithms for projected clustering international conference on management of data. ,vol. 28, pp. 61- 72 ,(1999) , 10.1145/304181.304188
Andrei Z. Broder, Moses Charikar, Michael Mitzenmacher, A derandomization using min-wise independent permutations Journal of Discrete Algorithms. ,vol. 1, pp. 11- 20 ,(2003) , 10.1016/S1570-8667(03)00003-0
Sergey Brin, Lawrence Page, The anatomy of a large-scale hypertextual Web search engine the web conference. ,vol. 30, pp. 107- 117 ,(1998) , 10.1016/S0169-7552(98)00110-X
Warren R. Greiff, A theory of term weighting based on exploratory data analysis international acm sigir conference on research and development in information retrieval. pp. 11- 19 ,(1998) , 10.1145/290941.290948
Bengt Rosen, Asymptotic Theory for Successive Sampling with Varying Probabilities Without Replacement, II Annals of Mathematical Statistics. ,vol. 43, pp. 373- 397 ,(1972) , 10.1214/AOMS/1177692543
David Aldous, Persi Diaconis, Shuffling Cards and Stopping Times The American Mathematical Monthly. ,vol. 93, pp. 333- 348 ,(1986) , 10.2307/2323590
Andrei Z Broder, Moses Charikar, Alan M Frieze, Michael Mitzenmacher, Min-Wise Independent Permutations symposium on the theory of computing. ,vol. 60, pp. 630- 659 ,(2000) , 10.1006/JCSS.1999.1690
Bing Liu, Yiming Ma, Philip S. Yu, Discovering unexpected information from your competitors' web sites knowledge discovery and data mining. pp. 144- 153 ,(2001) , 10.1145/502512.502534
Ella Bingham, Heikki Mannila, Random projection in dimensionality reduction: applications to image and text data knowledge discovery and data mining. pp. 245- 250 ,(2001) , 10.1145/502512.502546