Fast Conjugate Gradients with Multiple GPUs

作者: Ali Cevahir , Akira Nukada , Satoshi Matsuoka

DOI: 10.1007/978-3-642-01970-8_90

关键词:

摘要: The limiting factor for efficiency of sparse linear solvers is the memory bandwidth. In this work, we describe a fast Conjugate Gradient solver unstructured problems, which runs on multiple GPUs installed single mainboard. achieves double precision accuracy with GPUs, using mixed iterative refinement algorithm. To achieve high computation speed, propose matrix-vector multiplication algorithm, core operation solvers. proposed algorithm efficiently utilizes GPU resources via caching, coalesced accesses and load balance between running threads. Experiments wide range matrices show that our up to 11.6 Gflops GeForce 8800 GTS card CG implementation 24.6 four GPUs.

参考文章(18)
Dominik Goddeke, Robert Strzodka, Stefan Turek, Accelerating Double Precision FEM Simulations with GPUs hgpu.org. ,(2011)
Satoshi Matsuoka, The Road to TSUBAME and Beyond Springer, Berlin, Heidelberg. pp. 265- 267 ,(2008) , 10.1007/978-3-540-74384-2_19
D. Blythe, Rise of the Graphics Processor Proceedings of the IEEE. ,vol. 96, pp. 761- 778 ,(2008) , 10.1109/JPROC.2008.917718
Dominik Goddeke, Robert Strzodka, Jamaludin Mohd Yusof, Patrick McCormick, Hilmar Wobker, Christian Becker, Stefan Turek, Using GPUs to improve multigrid solver performance on a cluster computational science and engineering. ,vol. 4, pp. 36- 55 ,(2008) , 10.1504/IJCSE.2008.021111
Gene H. Golub, Qiang Ye, Inexact Preconditioned Conjugate Gradient Method with Inner-Outer Iteration SIAM Journal on Scientific Computing. ,vol. 21, pp. 1305- 1320 ,(1999) , 10.1137/S1064827597323415
R. S. Martin, G. Peters, J. H. Wilkinson, Iterative Refinement of the Solution of a Positive Definite System of Equations Numerische Mathematik. ,vol. 8, pp. 31- 44 ,(1966) , 10.1007/978-3-642-86940-2_2
Luc Buatois, Guillaume Caumon, Bruno Lévy, Concurrent number cruncher: a GPU implementation of a general sparse linear solver International Journal of Parallel, Emergent and Distributed Systems. ,vol. 24, pp. 205- 223 ,(2009) , 10.1080/17445760802337010
Richard Vuduc, James W Demmel, Katherine A Yelick, OSKI: A Library of Automatically Tuned Sparse Matrix Kernels Presented at: SciDAC 2005 Proceedings (Journal of Physics), San Francisco, CA, United States, Jun 26 - Jun 30, 2005. ,vol. 16, pp. 521- 530 ,(2005) , 10.1088/1742-6596/16/1/071
J.C. Pichel, D.B. Heras, J.C. Cabaleiro, F.F. Rivera, Improving the locality of the sparse matrix-vector product on shared memory multiprocessors parallel, distributed and network-based processing. pp. 66- 71 ,(2004) , 10.1109/EMPDP.2004.1271429
Alfredo Buttari, Jack Dongarra, Jakub Kurzak, Piotr Luszczek, Stanimir Tomov, Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy ACM Transactions on Mathematical Software. ,vol. 34, pp. 1- 22 ,(2008) , 10.1145/1377596.1377597