Abysmal performances on reading EDF datasets
Problem description
On EDF datasets, reading the data is impossibly long - more than 95% of all processing time.
Example on ha_refill
(reconstruction took 2h15):
This seems quite recent.
Hypothesis 1: EDF reader should be optimized
Nabu uses the EdfImage
module from silx (which is a stripped down version from pymca).
Let's try with a supposedly faster reader, the one from fabio which offers a "fast mode".
from time import time
import numpy as np
from silx.third_party.EdfFile import EdfFile as silx_EdfImage
from fabio.edfimage import EdfImage as fabio_EdfImage
def bench_read_chunk_fabio(files, chunk):
reader = fabio_EdfImage()
shp = reader.read(files[0])
data = np.zeros(
(len(files), ) + (chunk[0].stop - chunk[0].start, chunk[1].stop - chunk[1].start),
dtype="f"
)
t0 = time()
for i, fname in enumerate(files):
data[i] = reader.fast_read_roi(fname, chunk)
el = time() - t0
return el, data
def bench_read_chunk_silx(files, chunk):
pos = (chunk[1].start, chunk[0].start)
size = (chunk[1].stop - chunk[1].start, chunk[0].stop - chunk[0].start)
data = np.zeros(
(len(files), ) + size[::-1],
"f"
)
t0 = time()
for i, fname in enumerate(files):
reader = silx_EdfImage(fname, access="r", fastedf=True)
data[i] = reader.GetData(0, Pos=pos, Size=size)
el = time() - t0
return el, data
from glob import glob
fl = glob("/data/scisofttmp/tomo_datasets/ha_refill/HA-800_2.25um_FO-20.122ULZ-OLZ-OLP_008_/HA-800_2.25um_FO-20.122ULZ-OLZ-OLP_008_????.edf")
fl.sort()
el_s, d_s = bench_read_chunk_silx(fl, (slice(0, 100), slice(0, 2048)))
el_f, d_f = bench_read_chunk_fabio(fl, (slice(0, 100), slice(0, 2048)))
results:
silx 357 secs
fabio 376 secs
This does not seem to help.
Hypothesis 2: regression in nabu
It might also be a regression somewhere in nabu. But I tested with both 2021.2.0-beta1
and 2020.5.0
(version from almost one year ago) and the behavior is the same.
Hypothesis 3: file system has an issue and/or is not optimized for such pattern
To be investigated.
Current solution
Use the "grouped pipeline" when full volumes are to be reconstructed.
Edited by Pierre Paleo