Library for 2D pencil decomposition and distributed Fast Fourier Transform |
HECToR is the UK's front-line national supercomputing service. The system started its phase 1 service in 2007 - a 60 Tera FLOPs Cray XT4 with 5660 AMD Opteron dual-core processors. In summer 2009 it was upgraded to phase 2a - a 208 TFLOPs XT4 with 5664 Opteron quad-core processors. In late 2010, it was upgraded again to phase 2b - using 1856 24-core nodes and the new Gemini interconnect, making it world's first production Cray XE6 system (374 TFLOPs).
Small-scale Tests
The performance of a distributed FFT library is determined by both the computation efficiency of the underlying 1D FFT algorithm and the communication efficiency of the data transposition algorithm. The following table shows the speed-up this library can achieve over the serial runs using FFTW's 3D FFT interface. The times reported (on HECToR phase 2a hardware) are for forward c2c transforms and all the transforms were planned using FFTW_ESTIMATE.
Data size N^{3} | Serial FFTW | Parallel 2DECOMP&FFT | |||
Planning | Execution | 16 cores | 64 cores | 256 cores | |
64^{3} | 0.359 | 0.00509 | 0.00222 | - | - |
128^{3} | 1.98 | 0.0525 | 0.0223 | 0.00576 | 0.00397 |
256^{3} | 8.03 | 0.551 | 0.179 | 0.0505 | 0.0138 |
512^{3} | 37.5 | 5.38 | 1.74 | 0.536 | 0.249 |
1024^{3} | - | - | - | 4.59 | 1.27 |
2048^{3} | - | - | - | - | 17.9 |
It can be seen that due to the communication cost, the absolute speed-up over the serial library is not great (only about 20-40 times on 256 cores). However, the parallel library does allow much larger problems to be computed quite efficiently. In particular, for smaller core count (16 and 64), each time the problem size is increased by 8 times, the computing time increases by 8-10 times, following the trend of the underlying serial library very well.
2DECOMP&FFT Scaling on Phase 2a System
This set of benchmarks was performed in March 2010 on the HECToR phase 2a system. A small number of runs were also performed on Jaguar^{1} - the world No. 1 system at that time - for reference.
Large-scale parallel benchmarks of the FFT interface were performed, using problem size up to 8192^3. The results presented are the time spent to compute a pair of forward and backward transforms on random signals. Both c2c and r2c/c2r transforms were tested. The underlying FFT engine is the ACML FFT (version 4.3). In all cases, the original signals were recovered to machine accuracy after the backward transforms - a good validation for the library. Up to 16384 cores were used on HECToR and each case was repeated 3 times and the fastest results were recorded. On Jaguar, a few very large tests were arranged using up to 131072 cores. Note that the runtime performance does vary a lot for such communication intensive applications, particularly on busy systems.
It can be seen that the FFT interface scales almost perfectly on HECToR for all the tests. As expected, r2c/c2r transforms are twice as fast as c2c transforms. On Jaguar, the scaling is less good for larger core counts but the efficiency is still at a respectable 81% for the largest test. For a particular configuration - 4096^3 mesh on 16384 cores - the time spent on Jaguar is almost twice of that on HECToR. This is not unexpected. Jaguar had two 6-core chips. In order to achieve better load balance, the problem sizes need to have a factor of 6 which was not the case in these tests. Also the problem size 8192^3 , while quite large for real-world applications, is indeed too small when distributing over 10^5 cores.
Scaling of the Share-memory Implementation
Please refer to this page.
Stretch HECToR to Its Limits
In April 2011, one of my collaborators in China (who is lucky enough to have access to the whole Tianhe-1A system there for his research) asked me to perform some reference FFT computations on really large problem sizes, such as 12288*12288*12288 and 14336*14336*7168.
The largest tests done before was on problem size 8192^3, but on either much larger systems (such as Jaguar) or on systems with lots of memory. Memory per core on HECToR has decreased from the initial 3 GB to only 1.33 GB with the arrival of 24-core nodes, making it much more difficult to run larger tests.
The library code had to be optimised first to minimise the memory footprint. The ACML implementation of the library was optimised by using inplace transforms wherever possible. A software option was also introduced to allow the FFT input to be overwritten.^{2} In order to increase the problem size further, the 24-core nodes can be under-populated - by using only 16 cores per node, each core will have access to about 50% more memory.
Problem Size | Fully-packed Node | Under-populated Node |
4096*4096*4096 | 7.27 | |
8192*4096*4096 | 15.82 | 12.81 |
8192*8192*4096 | 34.10 | 27.20 |
8192*8192*8192 | 70.76 | 56.39 |
12288*8192*8192 | out of memory | 82.76 |
The table summarises all the test cases done using 16384 cores. For under-populated cases, 24576 cores (1024 nodes, the largest-possible HECToR job) had to be reserved. The figures reported are number of seconds to perform a pair (forward+backward) of single-precision complex-to-complex FFTs. As shown, the largest problem size achieved is 12288*8192*8192. The scaling of the library is very good - each time the problem size is doubled, the time required is only slightly more than doubled. Also shown is that when running in under-populated mode, the code is consistently 20% faster.
Footnotes
1. This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725.
2. The FFT library itself is not supporting inplace transforms at the moment (the input and output point to separate memory addresses). But the sub-steps (1D FFTs) can be implemented using inplace transforms provided by the FFT engines. Allowing the input to be overwritten makes it possible to reuse the memory as scratch space.