Bilkent University
Department of Computer Engineering


Scaling Iterative Solvers by Avoiding Latency Overhead of Parallel Sparse Matrix Vector Multiplies


Reha Oğuz Selvitopi
PhD Student
Computer Engineering Department
Bilkent University

Iterative solvers are widely used for the solution of large sparse system of linear equations. In parallelization of these solvers, the partitioning of the sparse matrix utilized in the solver is carried out with intelligent partitioning algorithms. There are two basic types of communication operations in the parallel solver: Point-to-point (P2P) communication incurred by sparse-matrix vector multiplication (SpMxV) and collective communication incurred by reducing results of the inner product computations. These two different type of communication operations cause separate synchronization points in an iteration of the solver. In this work, we propose a novel parallelization method that completely avoids message latency overheads of SpMxV operations. Our method relies on the observation that each SpMxV is followed by an inner product computation that involves the output vector of SpMxV in most Krylov subspace methods. The proposed parallelization method contains a computational and a communication rearrangement scheme. First, we resolve the write/read dependency on the output vector of SpMxV with a computational rearrangement scheme, which helps communication dependencies to also vanish and allows the P2P and collective communication phases to be performed in a single communication phase. Then, we propose a communication rearrangement scheme that embeds P2P messages into global collective communication operations. This approach eliminates the message latency costs of P2P communications and reduces the synchronization overheads. Moreover, we provide an exact value on the number messages being communicated, which prove to be invaluable for modern large-scale supercomputer systems. On the downside, however, our approach increases the communication volume due to store-and-forward scheme utilized in the collective communication operations. Yet, we propose two mapping heuristics to further reduce the increased communication volume. Our experiments on two supercomputers, a Cray XE6 and a BlueGene/Q machine, up to 2048 processors with several matrices from different domains, show that our parallelization method exhibits superior scalable performance compared to conventional parallelization method. The results indicate that the key to scaling parallel sparse iterative solvers is to reduce the message latency overheads.


DATE: 04 November, 2013, Monday @ 16:15