Speed-up binning

Binning is supposed to speed-up the process, it's especially important for the BM18 project where volumes have to be reconstructed as fast as possible.

However in practice:

Full radios must still be loaded (the subsampling is done after averaging if we want to do things correctly), so I/O are the same
The binning operation is slow, even the "clever" numpy approaches are single-threaded and take several ms for each frame.

So binning actually makes the data loading slower: we have IO + binning instead of IO.

# No binning, 4600 projs
C0 = ChunkReader(proc.dataset_info.projections, sub_region=(None, None, 0, 510), convert_float=True)
%time C0.load_data() # 5.88s for 12 GB => 2.04 GB/s

# Horizontal binning
C1 = ChunkReader(proc.dataset_info.projections, sub_region=(None, None, 0, 510), convert_float=True, binning=(2, 1))
%time C1.load_data() # 12s for 6 "final GB"

# (2, 2) binning
C1 = ChunkReader(proc.dataset_info.projections, sub_region=(None, None, 0, 510), convert_float=True, binning=2)
%time C1.load_data() # 11.2s for 3 "final GB"

The only way to speed-up IO is to use rough subsampling at the HDF5 level, not sure it's acceptable as there might be aliasing.
Therefore, binning has to be sped up. I tried:

Numba, @numba.njit(parallel=True) (multi-threading binning() on each image): no real speed-up, and it does not work on power9 (needs llvmlite)
Cython (Multi-threading binning() on each image): Good: 5.87s with 16 threads
ThreadPool (distribute binning on threads): simple and efficient, 4.25s with 32 threads

Edited Sep 24, 2021 by Pierre Paleo