作者: Wei Tan , Liana L. Fong , Yun Liang , Xiaolong Xie
DOI:
关键词:
摘要: Matrix factorization (MF) has been widely used in e.g., recommender systems, topic modeling and word embedding. Stochastic gradient descent (SGD) is popular solving MF problems because it can deal with large data sets easy to do incremental learning. We observed that SGD for memory bound. Meanwhile, single-node CPU systems caching performs well only small sets; distributed have higher aggregated bandwidth but suffer from relatively slow network connection. This observation inspires us accelerate by utilizing GPUs's high fast intra-node present cuMF_SGD, a CUDA-based solution large-scale problems. On single CPU, we design two workload schedule schemes, i.e., batch-Hogwild! wavefront-update fully exploit the massive amount of cores. Especially, as vectorized version Hogwild! overcomes issue discontinuity. also develop highly-optimized kernels update, leveraging cache, warp-shuffle instructions half-precision floats. partition scheme utilize multiple GPUs while addressing well-known convergence when parallelizing SGD. three one Maxwell or Pascal GPU, cuMF_SGD runs 3.1X-28.2X compared state-of-art solutions on 1-64 nodes. Evaluations show scales sets.