Performance issue when using projections subsampling
Need for data subsampling
For a quick assessment of reconstruction, it'd be useful to subsample projections (ex. read one projection out of 10). Recent example: POA dataset with 24k projections of 16k pixels wide.
The problem
Unfortunately, the lib HDF5 seem to suffer from performances issue when using subsampling.
Works:
- slicing vertically (
[:, :100, :]
) for both virtual dataset and non-virtual dataset, ifchunk=None
Does not work (awfully slow):
- subsampling (
[::10, :, :]
) for both virtual and non-virtual dataset when chunk is image-wise. Even trivial subsampling (along the chunking axis) is slow!
I initially thought the culprit was virtual datasets (see https://github.com/h5py/h5py/issues/1597 and https://github.com/h5py/h5py/issues/2155)
Consider this piece of code:
import h5py
from time import time
t0 = time()
with h5py.File("/data/scisofttmp/tomo_datasets/rings_1/S59_68p41b_16mm_6p5mm_F8_0001_1_1.nx", "r") as f:
d = f["entry0000/instrument/detector/data"][slice(151, 7751, 10)]
print(time() - t0)
- It takes 600 seconds on scisoft15, that's 14 MB/s (!).
- If
slice(151, 7751)
is used (no subsampling), the above takes 57 secs, that's 1.4 GB/s. - If a non-virtual dataset is used, the above piece of code takes 5 secs (instead of 600)
Possible solutions
- Read the whole data chunk and subsample in-memory with numpy. In this case we gain nothing in I/O nor in memory !
- Only accept subsampling when dealing with non-virtual datasets
- Try to bypass the virtual layers and go straight to the actual data source. This entails cumbersome code on our side.
EDIT
It looks like even non-virtual datasets are crazy slow with slicing when chunking is image-wise.
This takes forever:
with h5py.File("/data/scisofttmp/tomo_datasets/rings_1/scan0004/pco2linux_0000.h5", "r") as f:
d00 = f["/entry_0000/instrument/pco2linux/data"][::10, :, :]
The above works (decent speed) when chunk=None
.
Edited by Pierre Paleo