Matrix Factorization on GPUs with Memory Optimization and Approximate Computing

作者: Wei Tan , Shiyu Chang , Liana Fong , Cheng Li , Zijun Wang

DOI: 10.1145/3225058.3225096

关键词:

摘要: Matrix factorization (MF) discovers latent features from observations, which has shown great promises in the fields of collaborative filtering, data compression, feature extraction, word embedding, etc. While many problem-specific optimization techniques have been proposed, alternating least square (ALS) remains popular due to its general applicability (e.g. easy handle positive-unlabeled inputs), fast convergence and parallelization capability. Current MF implementations are either optimized for a single machine or with need large computer cluster but still insufficent. This is because provides limited compute power large-scale while multiple machines suffer network communication bottleneck. To address aforementioned challenge, accelerating ALS on garphics processing units (GPUs) promising direction. We propose novel approach enhancing efficiency via both memory approximate computing. The former exploits GPU hierarchy increase reuse, later reduces unneccessary computing without hurting learning algorithms. Extensive experiments datasets show that our solution not only outperforms competing CPU solutions by margin also 2x-4x performance gain compared state-of-the-art solutions. Our open-sourced publicly available.

参考文章(35)
Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, Rong Pan, Large-Scale Parallel Collaborative Filtering for the Netflix Prize Algorithmic Aspects in Information and Management. pp. 337- 348 ,(2008) , 10.1007/978-3-540-68880-8_32
Sebastian Schelter, Venu Satuluri, Reza Zadeh, None, Factorbird - a Parameter Server Approach to Distributed Matrix Factorization. arXiv: Learning. ,(2014)
John L. Hennessy, David A. Patterson, Computer Architecture: A Quantitative Approach ,(1989)
Yusuke Nishioka, Kenjiro Taura, Scalable Task-Parallel SGD on Matrix Factorization in Multicore Architectures international parallel and distributed processing symposium. pp. 1178- 1184 ,(2015) , 10.1109/IPDPSW.2015.135
Wen-mei W. Hwu, David B. Kirk, Programming Massively Parallel Processors: A Hands-on Approach Morgan Kaufmann. ,(2012)
Gregory R. Ganger, Eric P. Xing, Jinliang Wei, Seunghak Lee, James Cipar, Abhimanu Kumar, Phillip B. Gibbons, Henggang Cui, Qirong Ho, Wei Dai, Garth A. Gibson, Jin Kyu Kim, Exploiting bounded staleness to speed up big data analytics usenix annual technical conference. pp. 37- 48 ,(2014)
Rainer Gemulla, Erik Nijkamp, Peter J. Haas, Yannis Sismanis, Large-scale matrix factorization with distributed stochastic gradient descent Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '11. pp. 69- 77 ,(2011) , 10.1145/2020408.2020426
Samuel Williams, Andrew Waterman, David Patterson, Roofline Communications of the ACM. ,vol. 52, pp. 65- 76 ,(2009) , 10.1145/1498765.1498785
Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, Inderjit Dhillon, Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems 2012 IEEE 12th International Conference on Data Mining. pp. 765- 774 ,(2012) , 10.1109/ICDM.2012.168
Jinoh Oh, Wook-Shin Han, Hwanjo Yu, Xiaoqian Jiang, Fast and Robust Parallel SGD Matrix Factorization knowledge discovery and data mining. pp. 865- 874 ,(2015) , 10.1145/2783258.2783322