Skip to content

Reduce GPU memory usage

Pierre Paleo requested to merge reduce_memory_usage into master

About

Reduce GPU memory usage for pipeline, update required memory estimations.

To do

  • Update estimate_required_memory
  • Update estimate_max_chunk_size
  • Update CudaVerticalShifts (close #414 (closed) )
    • Implement in-place mul-add with x-y subregion
      • MulAdd
      • CudaMulAdd
      • Unit test
    • Use it in CudaVerticalShifts
  • Reduce memory usage for padding
    • Avoid creating huge indices arrays for CudaPadding and OpenCLPadding
  • Reduce memory usage for FBP
  • End-to-end reconstruction test

Notes

Vertical shifts cuda implementation

CudaVerticalShifts did a lot of self._d_radio_tmp[:-s0] += radio[s0:] * f, where the RHS creates one array for each iteration. To avoid this, a new "mul-add" kernel was added.

Padding Cuda/OpenCL implementation

XXPadding provides a generic padding through coordinate transform, but currently it allocates two 2D images (coords_rows and coords_cols) for the coordinates transform. Only one 1D array is needed for each:

from nabu.processing.padding_base import PaddingBase

for mode in set(PaddingBase.supported_modes) - set(["constant"]):
    pad = PaddingBase((12,13), ((5,6), (7,8)), mode=mode)
    assert np.max(np.std(pad.coords_cols, axis=0)) == 0
    assert np.max(np.std(pad.coords_rows, axis=1)) == 0

FBP

Given a sinogram of shape (n_a, n_x), the memory footprint for FBP is n_x * (5 n_a + n_x), assuming R2C transforms for filtering, and no FFT plans stored:

  • sino: (n_a, n_x)
  • sino_padded: (n_a, 2*n_x) (in the best case! usually next_power(2*n) > 2*n)
  • sino_padded_fourier: (n_a, 2*n_x//2 + 1) complex values
  • reco: (n_x, n_x)

This memory usage could be reduced to the minimum (sinogram + reconstructed slice) if the filtering is done in-place. This would have two drawbacks:

  • The user "loses" the input sinogram, though it's probably fine in most cases
  • Filtering is not as efficient: batched 1D FFT becomes a series of 1D FFT, with the overhead of python loops. A compromise could be batchs of hundreds of lines.

With these modifications, a test (using VKFFT backend) on a sinogram of shape (n_a, n_x) = (43200, 16384) will use 21 GB. On the other hand 5 * sino.nbytes/1e9 + rec_big.nbytes/1e9 gives 18 GB.

Edited by Pierre Paleo

Merge request reports