作者: Jing Chen , Jianbin Fang , Weifeng Liu , Tao Tang , Xuhao Chen
关键词: Thread (computing) 、 Recommender system 、 Sparse matrix 、 Instruction set 、 Solver 、 Software portability 、 Parallel computing 、 Matrix decomposition 、 Computer science 、 Speedup
摘要: Alternating least squares (ALS) has been proved to be an effective solver of matrix factorization for recommender systems. To speedup factorizing performance, various parallel ALS solvers have proposed leverage modern multi-core CPUs and many-core GPUs/MICs. Existing implementations are limited in either speed or portability (constrained certain platforms). In this paper, we present efficient portable On the one hand, diagnose baseline implementation observe that it lacks awareness hierarchical thread organization on hardware. achieve high apply batching technique three architecture-specific optimizations. other implement OpenCL so can run platforms (CPUs, GPUs, MICs). Based architectural specifics, select a suitable code variant each platform efficiently mapping underlying The experimental results show our performs 5.5 faster 16-core CPU 21.2 K20c than implementation. Our also outperforms cuMF datasets.