2DECOMP&FFT - Case Study

Case Study -- Vortex generation using FFT in a CFD code

As already covered in other parts of the documentation, one major strength of the 2DECOMP&FFT library is its user-friendliness. This case study demonstrates the parallelisation practice of a real-world CFD application.

Background information

I worked for the CSE team supporting HECToR, the UK's national supercomputing facility. I was one day contacted by Dr. Shankar Balakrishnan from Southampton University regarding the use of FFT in his CFD code. He later provided the following summary:

"For direct numerical simulations of fluid flow phenomenon involving vortex rings, the distribution of vorticity within the core of the vortex ring is initialised by an analytical expression usually a Gaussian distribution. From this known vorticity field the corresponding velocity field needs to be obtained to initialise the flow. A three-dimensional Fourier-transform is performed on the vorticity field and the Fourier coefficients are obtained in 'wave-space'. The Fourier components of the velocity field are then obtained and an inverse Fourier transform is performed to obtain the velocity field in real-space.

The computational time for this initialisation process is typically small compared to the overall runtime for the simulation of the evolution of the fluid flow with time. Hence the gain in computational time does not justify the parallelisation of the code used for the initialisation process. However, the memory required for the process becomes too large to be performed on a single node as the domain size increases. For this reason the initialisation process needs to be parallelised to distribute the work over multiple nodes. Using a code for performing fast-Fourier transforms in parallel, Vorticity field-Velocity field transformations can be performed for computational domains of indefinitely large sizes."

Without knowing much scientific details of his work, it was possible to parallelise his code using the 2DECOMP&FFT library in a matter of days. Some essential parts of the serial and the the parallel code are compared side-by-side below to show how easy it is to use the library.

Serial Code	Parallel Code
program vort implicit none ! use FFTW for the Fourier transforms include "fftw3.f" integer, parameter :: mytype = 8 ! global size integer, parameter :: n1=8, n2=8, n3=8 real(mytype), allocatable, dimension(:,:,:) :: & ox,oy,oz, uu,vv,ww complex(mytype), allocatable, dimension(:,:,:) :: & Ax,Ay,Az,U,V,W integer8 :: plan ! allocate space for real-space vorticity field allocate(ox(0:n1-1,0:n2-1,0:n3-1)) allocate(oy(0:n1-1,0:n2-1,0:n3-1)) allocate(oz(0:n1-1,0:n2-1,0:n3-1)) ! initialise vorticity in real-space do k=0,n3-1 do j=0,n2-1 do i=0,n1-1 ! init ox, oy, oz call vortexOmega( & ox(i,j,k),oy(i,j,k),oz(i,j,k), & xgs+dx(i+0.5-1), ygs+dy(j+0.5-1), & zgs+dz(k+0.5-1)) end do end do end do ! allocate space for wave-space variables allocate(Ax(0:n1/2,0:n2-1,0:n3-1)) allocate(Ay(0:n1/2,0:n2-1,0:n3-1)) allocate(Az(0:n1/2,0:n2-1,0:n3-1)) allocate(U(0:n1/2,0:n2-1,0:n3-1)) allocate(V(0:n1/2,0:n2-1,0:n3-1)) allocate(W(0:n1/2,0:n2-1,0:n3-1)) ! Using FFTW, ! forward FFT - ox,oy,oz -> Ax,Ay,Az call dfftw_plan_dft_r2c_3d( & plan,n1,n2,n3,ox,Ax,FFTW_ESTIMATE) call dfftw_execute_dft_r2c(plan,ox,Ax) call dfftw_plan_dft_r2c_3d( & plan,n1,n2,n3,oy,Ay,FFTW_ESTIMATE) call dfftw_execute_dft_r2c(plan,oy,Ay) call dfftw_plan_dft_r2c_3d( & plan,n1,n2,n3,oz,Az,FFTW_ESTIMATE) call dfftw_execute_dft_r2c(plan,oz,Az) deallocate(ox,oy,oz) ! Fourier domain stuff do k=0,n3-1 do j=0,n2-1 do i=0,n1/2 ! wave-space calculation omitted U(i,j,k) = ...... V(i,j,k) = ...... W(i,j,k) = ...... end do end do end do deallocate(Ax,Ay,Az) ! allocate real-space velocity variables allocate(uu(0:n1-1,0:n2-1,0:n3-1)) allocate(vv(0:n1-1,0:n2-1,0:n3-1)) allocate(ww(0:n1-1,0:n2-1,0:n3-1)) ! Using FFTW, ! inverse FFT U,V,W -> uu,vv,ww call dfftw_plan_dft_c2r_3d( & plan,n1,n2,n3,U,uu,FFTW_ESTIMATE) call dfftw_execute_dft_c2r(plan,U,uu) call dfftw_plan_dft_c2r_3d( & plan,n1,n2,n3,V,vv,FFTW_ESTIMATE) call dfftw_execute_dft_c2r(plan,V,vv) call dfftw_plan_dft_c2r_3d( & plan,n1,n2,n3,W,ww,FFTW_ESTIMATE) call dfftw_execute_dft_c2r(plan,W,WW) ! rescale uu = uuscale vv = vvscale ww = ww*scale deallocate(U,V,W) ! write out using Fortran direct access inquire(iolength=iol) uu(1,1,1) open(10, FILE='data_serial', & FORM='unformatted', & ACCESS='DIRECT', RECL=iol) n=1 do k=0,n3-1 do j=0,n2-1 do i=0,n1-1 write(10,rec=n) uu(i,j,k) n=n+1 end do end do end do do k=0,n3-1 do j=0,n2-1 do i=0,n1-1 write(10,rec=n) vv(i,j,k) n=n+1 end do end do end do do k=0,n3-1 do j=0,n2-1 do i=0,n1-1 write(10,rec=n) ww(i,j,k) n=n+1 end do end do end do close(10) deallocate(uu,vv,ww) end program vort	program vort ! use the 2DECOMP&FFT instead of FFTW use decomp_2d use decomp_2d_fft use decomp_2d_io use MPI implicit none ! global size integer, parameter :: n1=8, n2=8, n3=8 ! 2D processor grid integer, parameter :: p_row=2, p_col=2 integer (kind=MPI_OFFSET_KIND) :: filesize, disp integer, dimension(3) :: fft_start, fft_end, fft_size real(mytype), allocatable, dimension(:,:,:) :: & ox,oy,oz, uu,vv,ww complex(mytype), allocatable, dimension(:,:,:) :: & Ax,Ay,Az,U,V,W ! initialisation of 2DECOMP&FFT call MPI_INIT(ierror) call decomp_2d_init(n1,n2,n3,p_row,p_col) ! allocate space for real-space vorticity field allocate(ox(xstart(1)-1:xend(1)-1, & xstart(2)-1:xend(2)-1, & xstart(3)-1:xend(3)-1)) allocate(oy(xstart(1)-1:xend(1)-1, & xstart(2)-1:xend(2)-1, & xstart(3)-1:xend(3)-1)) allocate(oz(xstart(1)-1:xend(1)-1, & xstart(2)-1:xend(2)-1, & xstart(3)-1:xend(3)-1)) ! initialise vorticity in real-space do k=xstart(3)-1,xend(3)-1 do j=xstart(2)-1,xend(2)-1 do i=xstart(1)-1,xend(1)-1 ! subroutine identical in serial ! and parallel version call vortexOmega( & ox(i,j,k),oy(i,j,k),oz(i,j,k), & xgs+dx(i+0.5-1), ygs+dy(j+0.5-1), & zgs+dz(k+0.5-1)) end do end do end do ! initialise FFT library call decomp_2d_fft_init call decomp_2d_fft_get_size( & fft_start,fft_end,fft_size) ! allocate space for wave-space variables allocate(Ax(fft_start(1)-1:fft_end(1)-1, & fft_start(2)-1:fft_end(2)-1, & fft_start(3)-1:fft_end(3)-1) ) allocate(Ay(fft_start(1)-1:fft_end(1)-1, & fft_start(2)-1:fft_end(2)-1, & fft_start(3)-1:fft_end(3)-1) ) allocate(Az(fft_start(1)-1:fft_end(1)-1, & fft_start(2)-1:fft_end(2)-1, & fft_start(3)-1:fft_end(3)-1) ) allocate(U(fft_start(1)-1:fft_end(1)-1, & fft_start(2)-1:fft_end(2)-1, & fft_start(3)-1:fft_end(3)-1) ) allocate(V(fft_start(1)-1:fft_end(1)-1, & fft_start(2)-1:fft_end(2)-1, & fft_start(3)-1:fft_end(3)-1) ) allocate(W(fft_start(1)-1:fft_end(1)-1, & fft_start(2)-1:fft_end(2)-1, & fft_start(3)-1:fft_end(3)-1) ) ! Using 2DECOMP&FFT, ! forward FFT - ox,oy,oz -> Ax,Ay,Az call decomp_2d_fft_3d(ox, Ax) call decomp_2d_fft_3d(oy, Ay) call decomp_2d_fft_3d(oz, Az) deallocate(ox,oy,oz) ! Fourier domain stuff do k=fft_start(3)-1,fft_end(3)-1 do j=fft_start(2)-1,fft_end(2)-1 do i=fft_start(1)-1,fft_end(1)-1 ! wave-space calculation omitted U(i,j,k) = ...... V(i,j,k) = ...... W(i,j,k) = ...... end do end do end do deallocate(Ax,Ay,Az) ! allocate real-space velocity variables allocate(uu(xstart(1)-1:xend(1)-1, & xstart(2)-1:xend(2)-1, & xstart(3)-1:xend(3)-1)) allocate(vv(xstart(1)-1:xend(1)-1, & xstart(2)-1:xend(2)-1, & xstart(3)-1:xend(3)-1)) allocate(ww(xstart(1)-1:xend(1)-1, & xstart(2)-1:xend(2)-1, & xstart(3)-1:xend(3)-1)) ! Using 2DECOMP&FFT, !inverse FFT U,V,W -> uu,vv,ww call decomp_2d_fft_3d(U,uu) call decomp_2d_fft_3d(V,vv) call decomp_2d_fft_3d(W,ww) ! rescale uu = uuscale vv = vvscale ww = wwscale deallocate(U,V,W) ! write out using 2DECOMP&FFT MPI-IO routines call MPI_FILE_OPEN(MPI_COMM_WORLD, & 'data_parallel', & MPI_MODE_CREATE+MPI_MODE_WRONLY, & MPI_INFO_NULL, & fh, ierror) ! guarantee overwriting filesize = 0_MPI_OFFSET_KIND call MPI_FILE_SET_SIZE(fh,filesize,ierror) disp = 0_MPI_OFFSET_KIND call decomp_2d_write_var(fh,disp,n1,n2,n3,1,uu) call decomp_2d_write_var(fh,disp,n1,n2,n3,1,vv) call decomp_2d_write_var(fh,disp,n1,n2,n3,1,ww) call MPI_FILE_CLOSE(fh,ierror) deallocate(uu,vv,ww) ! clean up call decomp_2d_fft_finalize call decomp_2d_finalize call MPI_FINALIZE(ierror) end program vort

Serial Code

Parallel Code

program  vort

  implicit none

  ! use FFTW for the Fourier transforms
  include "fftw3.f"


  integer, parameter :: mytype = 8
  
  ! global size
  integer, parameter :: n1=8, n2=8, n3=8  






  real(mytype), allocatable, dimension(:,:,:) :: &
      ox,oy,oz, uu,vv,ww
  complex(mytype), allocatable, dimension(:,:,:) :: &
      Ax,Ay,Az,U,V,W

  integer*8 :: plan



  ! allocate space for real-space vorticity field
  allocate(ox(0:n1-1,0:n2-1,0:n3-1))
  allocate(oy(0:n1-1,0:n2-1,0:n3-1))
  allocate(oz(0:n1-1,0:n2-1,0:n3-1))






 
  ! initialise vorticity in real-space  
  do k=0,n3-1
     do j=0,n2-1
        do i=0,n1-1

           ! init ox, oy, oz
           call vortexOmega( &
             ox(i,j,k),oy(i,j,k),oz(i,j,k), &
             xgs+dx*(i+0.5-1), ygs+dy*(j+0.5-1), &
             zgs+dz*(k+0.5-1))
        end do
     end do
  end do






  ! allocate space for wave-space variables
  allocate(Ax(0:n1/2,0:n2-1,0:n3-1))
  allocate(Ay(0:n1/2,0:n2-1,0:n3-1))
  allocate(Az(0:n1/2,0:n2-1,0:n3-1))
  allocate(U(0:n1/2,0:n2-1,0:n3-1))
  allocate(V(0:n1/2,0:n2-1,0:n3-1))
  allocate(W(0:n1/2,0:n2-1,0:n3-1))













  ! Using FFTW, 
  ! forward FFT - ox,oy,oz -> Ax,Ay,Az
  call dfftw_plan_dft_r2c_3d( &
       plan,n1,n2,n3,ox,Ax,FFTW_ESTIMATE)
  call dfftw_execute_dft_r2c(plan,ox,Ax)
  call dfftw_plan_dft_r2c_3d( &
       plan,n1,n2,n3,oy,Ay,FFTW_ESTIMATE)
  call dfftw_execute_dft_r2c(plan,oy,Ay)
  call dfftw_plan_dft_r2c_3d( &
       plan,n1,n2,n3,oz,Az,FFTW_ESTIMATE)
  call dfftw_execute_dft_r2c(plan,oz,Az)

  deallocate(ox,oy,oz)

  ! Fourier domain stuff
  do k=0,n3-1
     do j=0,n2-1
        do i=0,n1/2

           ! wave-space calculation omitted
           U(i,j,k) = ......
           V(i,j,k) = ......
           W(i,j,k) = ......

        end do
     end do
  end do

  deallocate(Ax,Ay,Az)
  
  ! allocate real-space velocity variables
  allocate(uu(0:n1-1,0:n2-1,0:n3-1))
  allocate(vv(0:n1-1,0:n2-1,0:n3-1))
  allocate(ww(0:n1-1,0:n2-1,0:n3-1))







  ! Using FFTW, 
  ! inverse FFT U,V,W -> uu,vv,ww
  call dfftw_plan_dft_c2r_3d( &
       plan,n1,n2,n3,U,uu,FFTW_ESTIMATE)
  call dfftw_execute_dft_c2r(plan,U,uu)
  call dfftw_plan_dft_c2r_3d( &
       plan,n1,n2,n3,V,vv,FFTW_ESTIMATE)
  call dfftw_execute_dft_c2r(plan,V,vv)
  call dfftw_plan_dft_c2r_3d( &
       plan,n1,n2,n3,W,ww,FFTW_ESTIMATE)
  call dfftw_execute_dft_c2r(plan,W,WW)

  ! rescale
  uu = uu*scale
  vv = vv*scale
  ww = ww*scale

  deallocate(U,V,W)

  ! write out using Fortran direct access 
  inquire(iolength=iol) uu(1,1,1)
  open(10, FILE='data_serial', &
       FORM='unformatted', &
       ACCESS='DIRECT', RECL=iol)
  n=1
  do k=0,n3-1
     do j=0,n2-1
        do i=0,n1-1
           write(10,rec=n) uu(i,j,k)
           n=n+1
        end do
     end do
  end do
  do k=0,n3-1
     do j=0,n2-1
        do i=0,n1-1
           write(10,rec=n) vv(i,j,k)
           n=n+1
        end do
     end do
  end do
  do k=0,n3-1
     do j=0,n2-1
        do i=0,n1-1
           write(10,rec=n) ww(i,j,k)
           n=n+1
        end do
     end do
  end do
  close(10)

  deallocate(uu,vv,ww)






end program vort

program vort

  ! use the 2DECOMP&FFT instead of FFTW
  use decomp_2d
  use decomp_2d_fft
  use decomp_2d_io
  use MPI

  implicit none

  ! global size
  integer, parameter :: n1=8, n2=8, n3=8  
  ! 2D processor grid
  integer, parameter :: p_row=2, p_col=2  

  integer (kind=MPI_OFFSET_KIND) :: filesize, disp
  integer, dimension(3) :: fft_start, fft_end, fft_size

  real(mytype), allocatable, dimension(:,:,:) :: &
      ox,oy,oz, uu,vv,ww
  complex(mytype), allocatable, dimension(:,:,:) :: &
      Ax,Ay,Az,U,V,W

  ! initialisation of 2DECOMP&FFT
  call MPI_INIT(ierror)
  call decomp_2d_init(n1,n2,n3,p_row,p_col)

  ! allocate space for real-space vorticity field
  allocate(ox(xstart(1)-1:xend(1)-1, &
              xstart(2)-1:xend(2)-1, &
              xstart(3)-1:xend(3)-1))
  allocate(oy(xstart(1)-1:xend(1)-1, &
              xstart(2)-1:xend(2)-1, &
              xstart(3)-1:xend(3)-1))
  allocate(oz(xstart(1)-1:xend(1)-1, &
              xstart(2)-1:xend(2)-1, &
              xstart(3)-1:xend(3)-1))

  ! initialise vorticity in real-space
  do k=xstart(3)-1,xend(3)-1
     do j=xstart(2)-1,xend(2)-1
        do i=xstart(1)-1,xend(1)-1
           ! subroutine identical in serial 
           ! and parallel version
           call vortexOmega( &
             ox(i,j,k),oy(i,j,k),oz(i,j,k), &
             xgs+dx*(i+0.5-1), ygs+dy*(j+0.5-1), &
             zgs+dz*(k+0.5-1))
        end do
     end do
  end do

  ! initialise FFT library
  call decomp_2d_fft_init
  call decomp_2d_fft_get_size( &
       fft_start,fft_end,fft_size)
  
  ! allocate space for wave-space variables
  allocate(Ax(fft_start(1)-1:fft_end(1)-1, &
              fft_start(2)-1:fft_end(2)-1, &
              fft_start(3)-1:fft_end(3)-1) )
  allocate(Ay(fft_start(1)-1:fft_end(1)-1, &
              fft_start(2)-1:fft_end(2)-1, &
              fft_start(3)-1:fft_end(3)-1) )
  allocate(Az(fft_start(1)-1:fft_end(1)-1, &
              fft_start(2)-1:fft_end(2)-1, &
              fft_start(3)-1:fft_end(3)-1) )
  allocate(U(fft_start(1)-1:fft_end(1)-1, &
              fft_start(2)-1:fft_end(2)-1, &
              fft_start(3)-1:fft_end(3)-1) )
  allocate(V(fft_start(1)-1:fft_end(1)-1, &
              fft_start(2)-1:fft_end(2)-1, &
              fft_start(3)-1:fft_end(3)-1) )
  allocate(W(fft_start(1)-1:fft_end(1)-1, &
              fft_start(2)-1:fft_end(2)-1, &
              fft_start(3)-1:fft_end(3)-1) )

  ! Using 2DECOMP&FFT, 
  ! forward FFT - ox,oy,oz -> Ax,Ay,Az
  call decomp_2d_fft_3d(ox, Ax)


  call decomp_2d_fft_3d(oy, Ay)


  call decomp_2d_fft_3d(oz, Az)



  deallocate(ox,oy,oz)

  ! Fourier domain stuff
  do k=fft_start(3)-1,fft_end(3)-1
     do j=fft_start(2)-1,fft_end(2)-1
        do i=fft_start(1)-1,fft_end(1)-1

           ! wave-space calculation omitted
           U(i,j,k) = ......
           V(i,j,k) = ......
           W(i,j,k) = ......

        end do
     end do
  end do

  deallocate(Ax,Ay,Az)
  
  ! allocate real-space velocity variables
  allocate(uu(xstart(1)-1:xend(1)-1, &
              xstart(2)-1:xend(2)-1, &
              xstart(3)-1:xend(3)-1))
  allocate(vv(xstart(1)-1:xend(1)-1, &
              xstart(2)-1:xend(2)-1, &
              xstart(3)-1:xend(3)-1))
  allocate(ww(xstart(1)-1:xend(1)-1, &
              xstart(2)-1:xend(2)-1, &
              xstart(3)-1:xend(3)-1))

  ! Using 2DECOMP&FFT, 
  !inverse FFT U,V,W -> uu,vv,ww
  call decomp_2d_fft_3d(U,uu)


  call decomp_2d_fft_3d(V,vv)


  call decomp_2d_fft_3d(W,ww)



  ! rescale
  uu = uu*scale
  vv = vv*scale
  ww = ww*scale

  deallocate(U,V,W)

  ! write out using 2DECOMP&FFT MPI-IO routines

  call MPI_FILE_OPEN(MPI_COMM_WORLD, &
      'data_parallel', &
       MPI_MODE_CREATE+MPI_MODE_WRONLY, &
       MPI_INFO_NULL, &
       fh, ierror)
  
  ! guarantee overwriting
  filesize = 0_MPI_OFFSET_KIND
  call MPI_FILE_SET_SIZE(fh,filesize,ierror) 
  disp = 0_MPI_OFFSET_KIND
  call decomp_2d_write_var(fh,disp,n1,n2,n3,1,uu)
  call decomp_2d_write_var(fh,disp,n1,n2,n3,1,vv)
  call decomp_2d_write_var(fh,disp,n1,n2,n3,1,ww)















  call MPI_FILE_CLOSE(fh,ierror)

  deallocate(uu,vv,ww)

  ! clean up 
  call decomp_2d_fft_finalize
  call decomp_2d_finalize
  call MPI_FINALIZE(ierror)

end program vort

As can be seen, the code structure remains unchanged in the parallel version. The main effort in the parallel code is to properly define the portion of data locally resided on each processor (in allocation statements and loop count variables) - all such decomposition information is available through public variables after the initialisation of the library. The code actually looks messier than it should be because it was translated from C with a 0-based index system. The actual FFT computations are much simpler because the use of FFTW (or whatever other 3rd party libraries) is handled internally by 2DECOMP&FFT. It is also noticeable that very few MPI routines are called in the parallel code - the library hides as much communication details as possible.

In the end, Dr Balakrishnan reported back that the data generated and written out using the MPI-IO routines in the parallel code is identical to that written out using Fortran direct-access IO from the serial code. He is now able to initialise much bigger data sets for his direct numerical simulations.