作者: P. E. Bjørstad , T. Sørevik
DOI: 10.1007/978-94-015-8196-7_2
关键词:
摘要: We consider a data-parallel implementation of LU-factorization based on the LAPACK routine DGETRF. analyze performance required BLAS routines and show that high is inhibited by current compiler limitations. In particular, we optimal data movement when performing rank-1 updates crucial. The update available as BLAS-2 can also easily be expressed using intrinsic SPREAD in Fortran 90. However, order to minimize processor communication, this operation should explicitly inlined computational kernels. Using observation identify need for an explicit applied single block. With freedom adjust block-size hardware, much simpler task than writing full code low level machine language. extension, achievable without modifying block structure routine. expect similar observations hold other modules LAPACK.