A High Performance and Memory Efficient LU Decomposer on FPGAs

作者: Guiming Wu , Yong Dou , Junqing Sun , Gregory D. Peterson

DOI: 10.1109/TC.2010.278

关键词: Matrix (mathematics)Hardware architectureFLOPSField-programmable gate arrayBlock LU decompositionComputer scienceLoop tilingParallel computingLU decompositionMatrix decomposition

摘要: LU decomposition for dense matrices is an important linear algebra kernel that widely used in both scientific and engineering applications. To efficiently perform large matrix on FPGAs with limited local memory, a block algorithm applicable to arbitrary size proposed. Our applies series of transformations, including loop blocking space-time mapping, onto sequential nonblocking decomposition. We also introduce high performance memory efficient hardware architecture, which mainly consists array processing elements (PEs), implement our algorithm. design can achieve optimum under various resource constraints. Furthermore, be easily extended the multi-FPGA platform by using block-cyclic data distribution inter-FPGA communication scheme. A total 36 PEs integrated into Xilinx Virtex-5 XC5VLX330 FPGA self-designed PCI-Express card, reaching sustained 8.50 GFLOPS at 133 MHz 16,384, outperforms several general-purpose processors. For Virtex-6 XC6VLX760, newer FPGA, we predict 180 integrated, 70.66 200 MHz. Compared previous work, integrate twice number same has significantly higher performance.

参考文章(37)
Dave Strenski, Olaf O Storaasli, Exploring Accelerating Science Applications with FPGAs ,(2007)
Gokul Govindu, Viktor K. Prasanna, V. Sridhar, Sridhar Gangadharpalli, Vikash Daga, Efficient Floating-point Based Block LU Decomposition on FPGAs. ERSA. pp. 276- 279 ,(2004)
Seonil Choi, Viktor K. Prasanna, Time and Energy Efficient Matrix Factorization Using FPGAs field-programmable logic and applications. pp. 507- 519 ,(2003) , 10.1007/978-3-540-45234-8_50
Aravind Dasu, Arvind Sudarsanam, Thomas Hauser, Seth Young, Performance of a LU decomposition on a multi-FPGA system compared to a low power commodity microprocessor system Scalable Computing: Practice and Experience. ,vol. 8, ,(2007) , 10.12694/SCPE.V8I4.432
Marc Baboulin, Jack Dongarra, Stanimire Tomov, Some issues in dense linear algebra for multicore and special purpose architectures Centro de Matemática da Universidade de Coimbra. ,(2008)
R. Clint Whaley, Antoine Petitet, Jack J. Dongarra, New trends in high performance computing ieee international conference on high performance computing data and analytics. ,vol. 27, pp. 3- 35 ,(2001) , 10.1016/S0167-8191(00)00087-9
Qing Yi, Ken Kennedy, Haihang You, Keith Seymour, Jack Dongarra, Automatic blocking of QR and LU factorizations for locality Proceedings of the 2004 workshop on Memory system performance - MSP '04. pp. 12- 22 ,(2004) , 10.1145/1065895.1065898
Yong Dou, S. Vassiliadis, G. K. Kuzmanov, G. N. Gaydadjiev, 64-bit floating-point FPGA matrix multiplication Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays - FPGA '05. pp. 86- 95 ,(2005) , 10.1145/1046192.1046204
Ioannis E. Venetis, Guang R. Gao, Mapping the LU decomposition on a many-core architecture Proceedings of the 6th ACM conference on Computing frontiers - CF '09. pp. 71- 80 ,(2009) , 10.1145/1531743.1531756