作者: Jing Chen , Jianbin Fang , Weifeng Liu , Tao Tang , Canqun Yang
DOI: 10.1016/J.FUTURE.2018.04.071
关键词: Computer science 、 Parallel computing 、 Matrix decomposition 、 Solver 、 Factorization 、 Leverage (statistics) 、 Linear algebra 、 Speedup
摘要: Abstract Alternating least squares (ALS) has been proved to be an effective solver for matrix factorization in recommender systems. To speed up factorizing performance, various parallel ALS solvers have proposed leverage modern multi-cores and many-cores. Existing implementations are limited either or portability. In this paper, we present efficient portable ( clMF ) On one hand, diagnose the baseline implementation observe that it lacks of awareness hierarchical thread organization on hardware. achieve high apply batching technique, fine-grained tiling technique three architecture-specific optimizations. other implement OpenCL so can run platforms (CPUs, GPUs MICs). Based architectural specifics, select a suitable code variant each platform efficiently map underlying The experimental results show our performs 2.8 × –15.7 faster Intel 16-core CPU, 23.9 –87.9 NVIDIA K20C GPU 34.6 –97.1 AMD Fury X than implementation. GPU, also outperforms cuMF over different latent features ranging from 10 100 with real-world recommendation datasets.