Application of memory reduced NUFFT to multi-dimensional non-Cartesian MRI
Jyh-Miin Lin1, Grzegorz Kowalik 2, Javier Montalt Tordera1, Benoit Sarthou3, Philippe Ciuciu3, Jennifer Steeden4, and Vivek Muthurangu4,5

1Institute of Cardiovascular Science, University College London, London, United Kingdom, 2University College London, London, United Kingdom, 3CEA/NeuroSpin & INRIA-CEA Parietal team, Gif-sur-Yvette, France, 4Children's Cardiovascular Disease, University College London, London, United Kingdom, 5Great Ormond Street Hospital, London, United Kingdom


A precomputed interpolation matrix on a GPU has been commonly used for fast iterative NUFFT MRI reconstructions, but the size of a 3D interpolation matrix may exceed the memory available on a single GPU. We propose a memory reduced interpolation method that would reduce the size of a multidimensional non-Cartesian interpolation matrix on a GPU. The memory reduced NUFFT reduces the matrix size by more than 90%, while allowing a performance of 38% - 106% of the precomputed version. We also apply the memory reduced NUFFT to a large scale 3D generalized basis pursuit denoising algorithm (GBPDNA) reconstruction.


Fully precomputed multi-dimensional non-uniform fast Fourier transform (NUFFT) on a GPU has been routinely used to accelerate iterative 2D MRI reconstructions.1,2 However, a very large 3D interpolation matrix may exceed the memory capacity of a single GPU. This issue of insufficient memory is one kind of the "curse of dimensionality" in the high-dimensional interpolation problem. Here we propose a new memory reduced interpolation on a GPU, which reuses the 1D interpolator and reduces the required memory on a GPU. We test the new interpolation method using simulated and real-world 3D trajectories. We compare the storage and run-time performance of the fully precomputed version and the memory reduced version on a 3D simulated phantom.


The standard multi-dimensional min-max interpolator of NUFFT is built from the Kronecker product of all 1D interpolators:3 V = uxuyuz, where V is the 3D interpolator, which is the Kronecker products of ux, uy, uz. It is admissible to move the Kronecker product to a GPU, since this change causes only a moderate degree of performance degradation. For the adjoint NUFFT, the data and the mesh-indices can be reused without additional storage cost and we distribute the gridding to different GPU blocks. This redistribution lowers the chance of collision of the atomicAdd(). The pseudocodes are listed in Figure 1.

We tested the primal-dual generalized basis pursuit denoising algorithm (GBPDNA) reconstruction method using the new algorithm on a Nvidia GTX 1060 6GB, an NVIDIA Quadro P6000, and an NVIDIA Pascal Quadro GP100 (NVIDIA Inc., Santa Clara, CA, USA). We also measured the performance on a system equipped with a deca-core CPU (Intel Core i9 7900-X, Intel Inc., Santa Clara, CA, USA), 128GB memory and one GPU (NVIDIA Pascal Quadro GP100 HBM2 16GB, NVIDIA Inc., Santa Clara, CA, USA). A 3D Shepp-Logan phantom was used for testing purposes.4


The memory reduced NUFFT uses less than 10% of the storage required for the fully precomputed NUFFT, and its runtimes are 94%-260% of the fully precomputed NUFFT (equivalent to the performance 38% - 106% of the precomputed NUFFT on a GPU). In Figure 2, a memory reduced forward NUFFT on GPU achieves an acceleration of 10× - 20× faster than the precomputed mode on a deca-core CPU (Figure 2(A)), while the adjoint NUFFT on GPU is 2× - 5× faster than the precomputed mode on a CPU (Figure 2(B)). The iterative GBPDNA total-variation algorithm using memory reduced NUFFT is 53% faster than the precomputed mode on the deca-core CPU, but it is18% slower than the precomputed mode on a GPU. The image quality improves with the increased number of iterations (Figure 3).

Discussion and conclusion

The memory reduced NUFFT allows for a larger number of samples to be computed on a single GPU. The advantage of precomputation lies in its ability to adopt a non-analytic interpolator, which may improve the run-time performance and accuracy in iterative MRI reconstructions. However, a fully precomputed 3D interpolation matrix can exceed the memory capacity on a single GPU. In this case, the full interpolation matrix cannot be moved to GPU for fast reconstructions. Given the reduced memory and the moderate performance degradation of the memory reduced NUFFT, we are currently developing reconstruction algorithms for high dimensional image reconstruction algorithms, such as 3D compressed sensing image reconstructions and 4D spatio-temporal reconstructions. The future improvement may involve a well designed lookup table for multidimensional interpolation, which may further reduce the storage and at the same time attain a high run-time performance beyond 3D.5


We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P6000 GPU used for this research.


1. Jyh-Miin Lin. Python non-uniform fast Fourier transform (PyNUFFT): An acceleratednon-Cartesian mri package on a heterogeneous platform (CPU/GPU).

2. Grzegorz Tomasz Kowalik, Jennifer Anne Steeden, Bejal Pandya, FreddyOdille, David Atkinson, Andrew Taylor, and Vivek Muthurangu. Real-timeflow with fast GPU reconstruction for continuous assessment of cardiac out-put. Journal of Magnetic Resonance Imaging, 36(6):1477–1482, 2012.

3. Jeffrey A Fessler and Bradley P Sutton. Nonuniform fast Fourier transforms using min-max interpolation. IEEE Transactions on Signal Process-ing, 51(2):560–574, 2003.

4. M Schabel. 3D Shepp-Logan phantom. MATLAB Central File Exchange,pages 1–35, 2006.

5. Florian Knoll, Andreas Schwarzl, Clemens Diwoky, and Daniel K Sodickson. gpuNUFFT-an open source GPU library for 3D regridding with direct Matlab interface. In International Society for Magnetic Resonance in Medicine: Scientific Meeting & Exhibition. , 2014.


Pseudocode for the memory reduced forward interpolation and adjoint interpolation

Runtimes of the memory reduced (A) forward NUFFT and (B) adjoint NUFFT

One slice of a simulated 3D phantom, and the images reconstructed on an NVIDIA GTX 1060 6GB from 5 million samples with 20 iterations and 200 iterations.

Proc. Intl. Soc. Mag. Reson. Med. 27 (2019)