作者: Guiming Wu , Yong Dou , Junqing Sun , Gregory D. Peterson
DOI: 10.1109/TC.2010.278
关键词: Matrix (mathematics) 、 Hardware architecture 、 FLOPS 、 Field-programmable gate array 、 Block LU decomposition 、 Computer science 、 Loop tiling 、 Parallel computing 、 LU decomposition 、 Matrix decomposition
摘要: LU decomposition for dense matrices is an important linear algebra kernel that widely used in both scientific and engineering applications. To efficiently perform large matrix on FPGAs with limited local memory, a block algorithm applicable to arbitrary size proposed. Our applies series of transformations, including loop blocking space-time mapping, onto sequential nonblocking decomposition. We also introduce high performance memory efficient hardware architecture, which mainly consists array processing elements (PEs), implement our algorithm. design can achieve optimum under various resource constraints. Furthermore, be easily extended the multi-FPGA platform by using block-cyclic data distribution inter-FPGA communication scheme. A total 36 PEs integrated into Xilinx Virtex-5 XC5VLX330 FPGA self-designed PCI-Express card, reaching sustained 8.50 GFLOPS at 133 MHz 16,384, outperforms several general-purpose processors. For Virtex-6 XC6VLX760, newer FPGA, we predict 180 integrated, 70.66 200 MHz. Compared previous work, integrate twice number same has significantly higher performance.