b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions

作者: Anshumali Shrivastava , Arnd Christian Konig , Ping Li

DOI:

关键词:

摘要: In this paper, we study several critical issues which must be tackled before one can apply b-bit minwise hashing to the volumes of data often used industrial applications, especially in context search. 1. (b-bit) Minwise requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying corresponding permutations for each vector. We developed a parallelization scheme using GPUs and observed time reduced by factor 20-80 becomes substantially smaller than loading time. 2. One major advantage is it reduce amount memory required batch learning. However, as online algorithms become increasingly popular large-scale learning search, not clear if yields significant improvements them. This paper demonstrates $b$-bit provides effective size/dimension reduction hence dramatically epoch training process. because many 10 100) epochs reach sufficient accuracy. 3. Another issue very large sets impossible store (fully) random permutation matrix, due its space requirements. Our first demonstrate implemented simple hash functions, e.g., 2-universal (2U) 4-universal (4U) families, produce similar results fully permutations. Experiments on datasets up 200GB are presented.

参考文章(35)
Léon Bottou, Large-Scale Machine Learning with Stochastic Gradient Descent Proceedings of COMPSTAT'2010. pp. 177- 186 ,(2010) , 10.1007/978-3-7908-2604-3_16
Mikkel Thorup, Yin Zhang, Tabulation based 5-universal hashing and linear probing algorithm engineering and experimentation. pp. 62- 76 ,(2010)
Martin Dietzfelbinger, Universal Hashing and k-Wise Independent Random Variables via Integer Arithmetic without Primes symposium on theoretical aspects of computer science. pp. 569- 580 ,(1996) , 10.1007/3-540-60922-9_46
Mihai Pǎtraşcu, Mikkel Thorup, On the k-Independence Required by Linear Probing and Minwise Independence Automata, Languages and Programming. pp. 715- 726 ,(2010) , 10.1007/978-3-642-14165-2_60
Kai-Wei Chang, Dan Roth, Selective block minimization for faster convergence of limited memory large-scale linear models Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '11. pp. 699- 707 ,(2011) , 10.1145/2020408.2020517
Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher, Min-wise independent permutations (extended abstract) symposium on the theory of computing. pp. 327- 336 ,(1998) , 10.1145/276698.276781
Piotr Indyk, A small approximately min-wise independent family of hash functions symposium on discrete algorithms. ,vol. 38, pp. 454- 456 ,(1999) , 10.1006/JAGM.2000.1131
Ludmila Cherkasova, Kave Eshghi, Charles B. Morrey, Joseph Tucek, Alistair Veitch, Applying syntactic similarity algorithms for enterprise information management Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09. pp. 1087- 1096 ,(2009) , 10.1145/1557019.1557137
Massimiliano Ciaramita, Vanessa Murdock, Vassilis Plachouras, Online learning from click data for sponsored search Proceeding of the 17th international conference on World Wide Web - WWW '08. pp. 227- 236 ,(2008) , 10.1145/1367497.1367529