Skip to content

Resolve "nexus writer: chunking and compression"

Wout De Nolf requested to merge 2362-nexus-writer-chunking-and-compression into master

Closes #2362 (closed)

Needs !3184 (merged)

Proper HDF5 chunking and compression:

  • The chunk shape is calculated with h5py's guess_chunk
    • can handle variable length dimensions marked by a zero (variable length scans or variable detector dimensions like a sampling diode in SAMPLES mode)
    • the border case of scalar datasets (0D detector of a ct scan) cannot be handled by guess_chunk but we don't need chunking anyway in this case
  • gzip compression is used when the total dataset size > 10KB
    • variable length scans like a timescan always use gzip compression

The writer will buffer data and save it in multiples of the chunk shape (so datasets are also resized in multiples of the chunk shape). When the data arrives too slow (it takes longer than 3 seconds to accumulate 1 chunk of data) it will save data not aligned to the chunks. This will reduce write performance but as this happens only for slow data rates, it shouldn't be a problem.

In addition, buffered data is flushed (so ignoring chunk-aligned writing) as part of the finalization and error handling.

Note that none of this affects the lima data. This is for 0D detectors (diodes) and 1D detectors (MCAs).

@wright @sole @jerome.kieffer What do you think? The 10KB and 3 seconds are currently an uneducated guess.

Edit: rules for chunking and compression changed (discussion below)

Edited by Wout De Nolf

Merge request reports