b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions

作者： Anshumali Shrivastava , Arnd Christian Konig , Ping Li

DOI:

关键词:

摘要: In this paper, we study several critical issues which must be tackled before one can apply b-bit minwise hashing to the volumes of data often used industrial applications, especially in context search. 1. (b-bit) Minwise requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying corresponding permutations for each vector. We developed a parallelization scheme using GPUs and observed time reduced by factor 20-80 becomes substantially smaller than loading time. 2. One major advantage is it reduce amount memory required batch learning. However, as online algorithms become increasingly popular large-scale learning search, not clear if yields significant improvements them. This paper demonstrates $b$-bit provides effective size/dimension reduction hence dramatically epoch training process. because many 10 100) epochs reach sufficient accuracy. 3. Another issue very large sets impossible store (fully) random permutation matrix, due its space requirements. Our first demonstrate implemented simple hash functions, e.g., 2-universal (2U) 4-universal (4U) families, produce similar results fully permutations. Experiments on datasets up 200GB are presented.

参考文章(35)

Léon Bottou, Large-Scale Machine Learning with Stochastic Gradient Descent Proceedings of COMPSTAT'2010. pp. 177- 186 ,(2010) , 10.1007/978-3-7908-2604-3_16

Mikkel Thorup, Yin Zhang, Tabulation based 5-universal hashing and linear probing algorithm engineering and experimentation. pp. 62- 76 ,(2010)

Martin Dietzfelbinger, Universal Hashing and k-Wise Independent Random Variables via Integer Arithmetic without Primes symposium on theoretical aspects of computer science. pp. 569- 580 ,(1996) , 10.1007/3-540-60922-9_46

Wei Xu, Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent arXiv: Learning. ,(2011)

Mihai Pǎtraşcu, Mikkel Thorup, On the k-Independence Required by Linear Probing and Minwise Independence Automata, Languages and Programming. pp. 715- 726 ,(2010) , 10.1007/978-3-642-14165-2_60

Kai-Wei Chang, Dan Roth, Selective block minimization for faster convergence of limited memory large-scale linear models Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '11. pp. 699- 707 ,(2011) , 10.1145/2020408.2020517

Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher, Min-wise independent permutations (extended abstract) symposium on the theory of computing. pp. 327- 336 ,(1998) , 10.1145/276698.276781

Piotr Indyk, A small approximately min-wise independent family of hash functions symposium on discrete algorithms. ,vol. 38, pp. 454- 456 ,(1999) , 10.1006/JAGM.2000.1131

Ludmila Cherkasova, Kave Eshghi, Charles B. Morrey, Joseph Tucek, Alistair Veitch, Applying syntactic similarity algorithms for enterprise information management Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09. pp. 1087- 1096 ,(2009) , 10.1145/1557019.1557137

10.

Massimiliano Ciaramita, Vanessa Murdock, Vassilis Plachouras, Online learning from click data for sponsored search Proceeding of the 17th international conference on World Wide Web - WWW '08. pp. 227- 236 ,(2008) , 10.1145/1367497.1367529

b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions

来源期刊

我的账户

b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions

来源期刊

相似文章 6

b-Bit Minwise Hashing for Large-Scale Learning

GPU-based minwise hashing: GPU-based minwise hashing

Min-wise hashing for large-scale regression and classication with sparse data

One Permutation Hashing

On b-bit min-wise hashing for large-scale regression and classification with sparse data

Probabilistic Hashing Techniques For Big Data

我的账户