Reduce GPU memory usage
About
Reduce GPU memory usage for pipeline, update required memory estimations.
To do
-
Update estimate_required_memory
-
Update estimate_max_chunk_size
-
Update CudaVerticalShifts
(close #414 (closed) )-
Implement in-place mul-add with x-y subregion -
MulAdd
-
CudaMulAdd
-
Unit test
-
-
Use it in CudaVerticalShifts
-
-
Reduce memory usage for padding -
Avoid creating huge indices arrays for CudaPadding
andOpenCLPadding
-
-
Reduce memory usage for FBP -
End-to-end reconstruction test
Notes
Vertical shifts cuda implementation
CudaVerticalShifts
did a lot of self._d_radio_tmp[:-s0] += radio[s0:] * f
, where the RHS creates one array for each iteration.
To avoid this, a new "mul-add" kernel was added.
Padding Cuda/OpenCL implementation
XXPadding
provides a generic padding through coordinate transform, but currently it allocates two 2D images (coords_rows
and coords_cols
) for the coordinates transform. Only one 1D array is needed for each:
from nabu.processing.padding_base import PaddingBase
for mode in set(PaddingBase.supported_modes) - set(["constant"]):
pad = PaddingBase((12,13), ((5,6), (7,8)), mode=mode)
assert np.max(np.std(pad.coords_cols, axis=0)) == 0
assert np.max(np.std(pad.coords_rows, axis=1)) == 0
FBP
Given a sinogram of shape (n_a, n_x)
, the memory footprint for FBP is n_x * (5 n_a + n_x)
, assuming R2C transforms for filtering, and no FFT plans stored:
- sino:
(n_a, n_x)
- sino_padded:
(n_a, 2*n_x)
(in the best case! usuallynext_power(2*n) > 2*n
) - sino_padded_fourier:
(n_a, 2*n_x//2 + 1)
complex values - reco:
(n_x, n_x)
This memory usage could be reduced to the minimum (sinogram + reconstructed slice) if the filtering is done in-place. This would have two drawbacks:
- The user "loses" the input sinogram, though it's probably fine in most cases
- Filtering is not as efficient: batched 1D FFT becomes a series of 1D FFT, with the overhead of python loops. A compromise could be batchs of hundreds of lines.
With these modifications, a test (using VKFFT backend) on a sinogram of shape (n_a, n_x) = (43200, 16384)
will use 21 GB. On the other hand 5 * sino.nbytes/1e9 + rec_big.nbytes/1e9
gives 18 GB.