HPC applications make intensive use of large sparse matrices with the matrix-vector product representing a significant fraction of the total run-time. These matrices are characterized by non-uniform matrix structures and irregular memory accesses that make it difficult to achieve a good scalability in modern HPC platforms with multi-or many-cores, SIMD and high-speed communication networks. One of the reasons for this drawback in scalability is caused by communication due to imbalance in both message synchronization and size.
In this work we analyze such load imbalance in the sparse matrix vector product (SpMV) when running in a multi-node cluster using high-speed interconnection networks. The experimental alternatives to diminish communication load imbalance are evaluated on two programming models MPI+fork-join and MPI+task-based parallelism) using certain optimizations (i.e. computation-communication overlap or parallel send messages). The performance achieved for large matrix sizes can be up to 9%.