作者: Anshumali Shrivastava , Arnd Christian Konig , Ping Li
DOI:
关键词:
摘要: In this paper, we study several critical issues which must be tackled before one can apply b-bit minwise hashing to the volumes of data often used industrial applications, especially in context search. 1. (b-bit) Minwise requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying corresponding permutations for each vector. We developed a parallelization scheme using GPUs and observed time reduced by factor 20-80 becomes substantially smaller than loading time. 2. One major advantage is it reduce amount memory required batch learning. However, as online algorithms become increasingly popular large-scale learning search, not clear if yields significant improvements them. This paper demonstrates $b$-bit provides effective size/dimension reduction hence dramatically epoch training process. because many 10 100) epochs reach sufficient accuracy. 3. Another issue very large sets impossible store (fully) random permutation matrix, due its space requirements. Our first demonstrate implemented simple hash functions, e.g., 2-universal (2U) 4-universal (4U) families, produce similar results fully permutations. Experiments on datasets up 200GB are presented.