nexus writer: chunking and compression
Reported by @jerome.kieffer @sole
We need to change to way chunking/compression/maxshape is handled in the writer. Currently things are done in the writer to handle variable dataset shapes which are not known on beforehand. However this results in files that are inefficient to read.
2D data (lima) is not involved in this discussion as lima/dectris does the chunking/compression and the writer creates a VDS at the end of the scan (VDS cannot grow dynamically as far as I know).
Lets take the example of a known length scan (i.e. a scan that publishes the requested number of points in Redis). The writer currently creates datasets with the following arguments:
# 0D detector (all kinds of counters)
{'shape': (0,), 'chunks': True, 'maxshape': (None,), 'compression': 'gzip'}
# 1D detector with variable size (e.g. sampling diodes which take x samples per point)
{'shape': (0, 1), 'chunks': True, 'maxshape': (None, None), 'compression': 'gzip'}
# 1D detector with fixed size (e.g. MCA, mythen)
{'shape': (0, 1024), 'chunks': True, 'maxshape': (None, None), 'compression': 'gzip'}
Notes on shape
and maxshape
:
-
The first dimension is the scan dimension and its maxshape is
None
as it needs to grow. This needs to be variable, even for a known length scan. Not only because you can abort the scan, but mostly because not all detectors produce n points when the scan asks for n points (e.g. MUSST counters -> so all zap scans). -
The next dimensions are the detector dimensions and their maxshape is also
None
as the writer doesn't know whether the shape is variable or not.
So nice and flexible for writing but not efficient for reading:
-
As I use auto-chunking, the chunk size (which is calculated based on shape/maxshape) is too small because at the time of dataset creation, the dataset shape is small (even zero in the scan dimension).
-
As all dimensions are variable, compression is used to anticipate lots of data.
What I propose to do:
-
Buffer data in the writer before creating the dataset so that the auto-chunking will calculate a reasonable chunk shape. I'll have to think about the size of the buffer. Or perhaps calculate the chunk shape myself. Once the dataset is created, buffering subsequent data can be based on the chunk shape (so write aligned with the chunks).
-
Compression for known length scans: ignore the fact that all dimensions could be variable and compress only when the total expected size is >1MB.
-
Compression for unknown length scans: this is tricky. Either the length of the scan is fixed but not published in Redis or the length is really variable (like a timescan). I only need to decide the compression after buffering and upon the dataset creation. I can check the total data size at that point. This could mean that the data of long scans is not compressed.