作者: Wei Tan , Shiyu Chang , Liana Fong , Cheng Li , Zijun Wang
关键词:
摘要: Matrix factorization (MF) discovers latent features from observations, which has shown great promises in the fields of collaborative filtering, data compression, feature extraction, word embedding, etc. While many problem-specific optimization techniques have been proposed, alternating least square (ALS) remains popular due to its general applicability (e.g. easy handle positive-unlabeled inputs), fast convergence and parallelization capability. Current MF implementations are either optimized for a single machine or with need large computer cluster but still insufficent. This is because provides limited compute power large-scale while multiple machines suffer network communication bottleneck. To address aforementioned challenge, accelerating ALS on garphics processing units (GPUs) promising direction. We propose novel approach enhancing efficiency via both memory approximate computing. The former exploits GPU hierarchy increase reuse, later reduces unneccessary computing without hurting learning algorithms. Extensive experiments datasets show that our solution not only outperforms competing CPU solutions by margin also 2x-4x performance gain compared state-of-the-art solutions. Our open-sourced publicly available.