Performance issue when using projections subsampling

Need for data subsampling

For a quick assessment of reconstruction, it'd be useful to subsample projections (ex. read one projection out of 10). Recent example: POA dataset with 24k projections of 16k pixels wide.

The problem

Unfortunately, the lib HDF5 seem to suffer from performances issue when using subsampling.

Works:

slicing vertically ([:, :100, :]) for both virtual dataset and non-virtual dataset, if chunk=None

Does not work (awfully slow):

subsampling ([::10, :, :]) for both virtual and non-virtual dataset when chunk is image-wise. Even trivial subsampling (along the chunking axis) is slow!

I initially thought the culprit was virtual datasets (see https://github.com/h5py/h5py/issues/1597 and https://github.com/h5py/h5py/issues/2155)

Consider this piece of code:

import h5py
from time import time

t0 = time()
with h5py.File("/data/scisofttmp/tomo_datasets/rings_1/S59_68p41b_16mm_6p5mm_F8_0001_1_1.nx", "r") as f:
    d = f["entry0000/instrument/detector/data"][slice(151, 7751, 10)]
print(time() - t0)

It takes 600 seconds on scisoft15, that's 14 MB/s (!).
If slice(151, 7751) is used (no subsampling), the above takes 57 secs, that's 1.4 GB/s.
If a non-virtual dataset is used, the above piece of code takes 5 secs (instead of 600)

Possible solutions

Read the whole data chunk and subsample in-memory with numpy. In this case we gain nothing in I/O nor in memory !
Only accept subsampling when dealing with non-virtual datasets
Try to bypass the virtual layers and go straight to the actual data source. This entails cumbersome code on our side.

EDIT

It looks like even non-virtual datasets are crazy slow with slicing when chunking is image-wise.

This takes forever:

with h5py.File("/data/scisofttmp/tomo_datasets/rings_1/scan0004/pco2linux_0000.h5", "r") as f:
    d00 = f["/entry_0000/instrument/pco2linux/data"][::10, :, :]

The above works (decent speed) when chunk=None.

Edited Oct 06, 2022 by Pierre Paleo