2DECOMP&FFT - Padded Alltoall Optimisation

Padded Alltoall Optimisation

For transpose-based algorithms, if a global data set cannot be evenly distributed among processors, then the data redistribution requires MPI_ALLTOALLV communication. However, one has the option to pad the smaller messages with extra bytes in order to use MPI_ALLTOALL. This technique is in some literature referred to as padded alltoall optimisation. It is particularly useful for small messages (for example defined by default as <2K bytes on some Cray XT/XE systems) where so-called store-and-forward algorithm can be used. This algorithm is to trade reduced latency with increased bandwidth and hopefully to increase the overall communication performance.

For in-depth discussion on this optimisation technique, the best reference is probably the comments of the MPICH2 source code¹ in which a number of communication algorithms used in MPICH2's alltoall and alltoallv implementations are described in detail.

To turn on this feature, append the -DEVEN flag to the OPTIONS variable in src/Makefile.inc before compiling the 2DECOMP&FFT library.

In practice, this optimisation is unlikely to work well for CFD type of applications with large distributed arrays, but can be highly beneficial for Molecular Dynamics type of applications² where for example a large number of small distributed FFTs are required.

Footnotes

1. Check the comments in source files alltoall.c and alltoallv.c.

2. One successful application of this technique is reported at http://www.hector.ac.uk/cse/distributedcse/reports/gww.