|Library for 2D pencil decomposition and distributed Fast Fourier Transform|
When using 2DECOMP&FFT, application users need to be aware that they have the freedom to choose the 2D processor grid P_row*P_col. Depending on hardware, in particular the network layout, some processor-grid options deliver much better performance than others. Application users are highly recommended to benchmark this before running large simulations.
The figure below shows the performance of a test application using 256 MPI ranks on the old HECToR (Cray XT4 with quad-core nodes and SeaStar interconnect). It can be seen that in this particular test, P_row<<P_col is the best possible combination. There are several technical reasons for such behaviour. First of all the hardware was equipped with quad-core processors (4-way SMP). When P_row is smaller than or equal to 4, half of the MPI_ALLTOALLV communications are done entirely within physical nodes which can be very fast (at the MPI library level the communication could be implemented as local memory copying). Second, as the communication library handles ijk-ordered arrays, small P_row (therefore larger nx/P_row, the inner-most loop count of 3D loops) tends to offer better cache efficiency for the computational parts of the code.
Performance as a function of the processor grid.
Please note that the behaviour reported above is by no means representative. In fact application performance is also highly dependent on several other factors: the time-varying system workload, the size and shape of the global mesh, and of course the hardware. For example, on Cray XE6 systems, the benefit of doing a lot of intra-node communications is seen to be much smaller, possibly because of the much improved Gemini interconnect (faster inter-node communication) and fairly complex NUMA arrangement within each 24-core node (relatively less efficient intra-node communications).
To help application users address this issue to a certain degree, an auto-tuning algorithm is included in the library and can be switched on as follows:
call decomp_2d_init(nx, ny, nz, 0, 0)
When initialising the 2D decomposition library, if a processor grid is specified as 0*0, an auto-tuning algorithm is used to determine the best processor grid at runtime. This algorithm, however, only takes into account the communication costs. Computation intensive codes might benefit more from the cache efficiency factor. So application users are eventually responsible for selecting the best processor grid.