Provably Efficient Algorithms for Numerical Tensor Algebra

作者: Edgar Solomonik

DOI:

关键词:

摘要: Author(s): Solomonik, Edgar | Advisor(s): Demmel, James Abstract: This thesis targets the design of parallelizable algorithms and communication-efficient parallel schedules for numerical linear algebra as well computations with higher-order tensors. Communication is a growing bottleneck in execution most on computers, which manifests itself data movement both through network connecting different processors memory hierarchy each processor synchronization between processors. We provide rigorous theoretical model communication derive lower bounds this model. Our analysis concerns two broad areas tensor contractions. demonstrate practical quality new theoretically-improved by presenting results show that our implementations outperform standard libraries traditional algorithms. costs associated local computation, interprocessor synchronization, to cache transfers schedule based expensive path schedule. introduce technique deriving tradeoffs these apply them dense sparse graph These are attained what we refer 2.5D algorithms, give matrix multiplication, Gaussian elimination, QR factorization, symmetric eigenvalue problem, Floyd-Warshall all-pairs shortest-paths algorithm. achieve bandwidth cost exploiting auxiliary memory. Algorithms employing known have been derived BSP LU alternate versions measurable performance improvements over their counterparts, first evaluations performance. also explore network-topology-aware mapping torus networks multiplication LU, showing how can efficiently exploit collective communication, introducing an adaptation Cannon's algorithm better suited dimension larger than two. For additionally solving challenges memory-bandwidth efficiency arise problem. efficient Krylov subspace methods (repeated vector sparse-matrix), motivated application bound techniques The latter half contains tensors, particular motivating work family coupled-cluster methods, solve many-body Schrodinger equation chemically-accurate electronic structure molecules chemical reactions where electron correlation plays significant role. computation dominated contraction antisymmetric Cyclops Tensor Framework, provides automated mechanism decomposition redistribution data. It leverages perform contractions communication-efficiently. framework capable symmetry antisymmetry tensors utilizes distributed packed-symmetric storage format. Finally, consider theoretically novel number multiplications necessary via computing some redundant terms allow preservation then cancelling out low-order cost. analyze stability adaptations Hermitian matrices. has promising potential accelerating cost, improvement BLAS routines complex

参考文章(141)
Sartaj Sahni, Jing-Fu Jenq, ALL PAIRS SHORTEST PATHS ON A HYPERCUBE MULTIPROCESSOR. international conference on parallel processing. pp. 713- 716 ,(1987)
J. A. Pople, R. K. Nesbet, Self‐Consistent Orbitals for Radicals The Journal of Chemical Physics. ,vol. 22, pp. 571- 572 ,(1954) , 10.1063/1.1740120
Xiaobai Sun, C. Bischof, On orthogonal block elimination ,(1996)
Edgar Solomonik, James Demmel, Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms international conference on parallel processing. pp. 90- 109 ,(2011) , 10.1007/978-3-642-23397-5_10
L R Ford, NETWORK FLOW THEORY ,(1956)
Erik Elmroth, Fred Gustavson, New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems parallel computing. ,vol. 1541, pp. 120- 128 ,(1998) , 10.1007/BFB0095328
William Gropp, Ewing Lusk, Anthony Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface ,(1994)
Amedeo R. Odoni, Richard C. Larson, Urban Operations Research ,(1981)