P3DFFT at XSEDE

P3DFFT is a highly scalable numerical library for programs running on parallel computers. P3DFFT implements Fourier Transforms in three dimensions as well as related algorithms, in highly efficient manner. It is typically used in simulations requiring large number of cores, in fields ranging from turbulence to astrophysics, oceanography, material science and molecular dynamics. It has a simple interface that makes it easy to incorporate in any C or Fortran program.

P3DFFT has been tested and used on many high-end computational systems. It uses two-dimensional domain decomposition in order to overcome a scaling bottleneck of one-dimensional decomposition. This allows programs compiled with the library to scale well to a large number of cores, consistent with bisection bandwidth scaling of the interconnect of the underlying hardware system.

P3DFFT v. 2.7.1 features include:

  • Real-to-complex and complex-to-real Fast Fourier Transforms (FFT) in 3D.
  • Cosine, sine, and combined Fourier-Chebyshev transform (FFT in 2D and Chebyshev in the third dimension). Alternatively users can substitute their own transform in the third dimension, for example a compact scheme.
  • Fortran and C interfaces
  • Built for performance and scalability
  • In-place and out-of-place transforms
  • Pruned input/output transforms
  • Multivariable transforms

P3DFFT is available as an open source package here: http://code.google.com/p/p3dfft.

XSEDE and P3DFFT

P3DFFT is installed on SDSC's Gordon and TACC's Stampede HPC systems.

Easy-to-follow instructions are available if a custom installation is needed.

Because the P3DFFT library is dependent upon the FFTW (version 3.0 and higher) and MPI libraries, you must load those modules prior to loading the P3DFFT module.

login1$ module load fftw 
login1$ module load mpi 
login1$ module load p3dfft 

This module defines an environment variable defining the P3DFFT Home Directory. For example, on Gordon this is P3DFFTHOME variable. The Home Directory contains /include and /lib subdirectories.

If you wish to use P3DFFT with your program, compile by adding the following options to your compile/link command:

-I($P3DFFTHOME)/include -L($P3DFFTHOME)/lib -lp3dfft

The above will link with the default version of P3DFFT, which is double precision and with STRIDE1 flag turned on (this flag allows for contiguous stride-1 data access on both input and output). If single precision and/or noncontiguous versions are desired, please link with "-lp3dfft-single", "-lp3dfft-double-noncontiguous" or "-lp3dfft-single-noncontiguous" respectively instead of the last argument in the string above.

The P3DFFT interface is defined at length in the User Guide.

Some brief instructions:

  • the user must initiate P3DFFT by defining the dimensions of the 3D grid. Below are the Fortran and C prototypes:

    call p3dfft_setup(proc_dims,nx,ny,nz,MPI_COMM_WORLD)
    Cp3dfft_setup(dims,nx,ny,nz,MPI_Comm_c2f(MPI_COMM_WORLD));
  • After this, one can obtain local array dimensions on input and output by calling get_dims. Based on this information arrays can be allocated, initialized as needed. Then you are ready to call forward or backward 3D FFT routines (p3dfft_ftran_r2c and p3dfft_btran_c2r). Presently only forward real-to-complex and backward (inverse) complex-to-real FFT transforms are available, as well as sine/cosine/Chebyshev/user defined variants in the third dimension.

Example programs are provided in the distribution (can be checked out from SVN repository as described at http://code.google.com/p/p3dfft/source/checkout or browsed online). Executables of the test programs are installed in $P3DFFTHOME/share. These programs provide examples of various uses of P3DFFT, including working with array data structures, using forward, backward 3DFFT transforms, Chebyshev transforms. These programs can also be used to time the FFT routines, for example to benchmark them on various compute systems.

Multivariable transforms are available in version 2.7.1, through routines p3dfft_ftran_r2c_many and p3dfft_btran_c2r_many. These transforms can be used when there are several independent variables to be transformed (such as velocity components etc).

Pruned input/output transforms are available for cases when a portion of output is not needed in subsequent calculations, and/or a portion of input is a series of zeros. An example of this is 2/3 rule to implement dealiasing in DNS turbulence simulations.

P3DFFT Tips & Tricks

  • P3DFFT uses 2D domain decomposition in order to achieve scalability at high number of core counts. In practice it operates on a virtual Cartesian 2D grid in processor space, defined by two dimensions "P1" and "P2" (such that P1 x P2 = P, the total number of cores). It is up to the user to define P1 and P2. It is worth experimenting with the choice of P1/P2 (the aspect ratio of the grid) to achieve optimal performance. As a rough guideline, consider setting P1 to be equal or less than the number of cores on a node of the system the software is running on. Doing so will result in one of the two all-to-all exchanges proceeding entirely within each node and will therefore cut in half the total volume of the expensive internode exchanges and save communication time.

  • P3DFFT relies on MPI_alltoall routine for the bulk of inter-processor communication. Therefore its performance is determined by two main factors: the efficiency of implementation of MPI_Alltoall in the underlying MPI implementation (system software factor), and the available bisection bandwidth of the underlying interconnect fabric (system hardware factor). The latter factor has to do with topology of the network. On systems preserving full bisection bandwidth (such as fat trees) in principle it is possible to obtain almost perfect scaling on a large number of core counts (this assumes ideal conditions such as contention-free networks). On other systems (such as tori networks) bisection bandwidth is not scaling linearly with the system size (for example it scales as P2/3 in the case of 3D torus). Therefore scaling of P3DFFT even in the ideal case will be subject to this limitation.

References

Last update: May 7, 2015