Skip to content

Resolve "icat-sync-raw: needs caching"

Wout De Nolf requested to merge 29-icat-sync-raw-needs-caching into main

Closes #29 (closed)

This MR improves icat-sync-raw

  • Replace raw dictionaries with dataclasses
  • These dataclasses facilitate the implementation of the caching with invalidation (they are JSON serializable)
  • In addition, the dataclasses are also used to simplify the test_icat_sync tests
  • The dataclasses in sync_types are test covered but ExperimentalSessionStore is not
icat-sync-raw --save-dir ./ --cache-dir /data/scisoft/icat_sync

When --save-dir is provided, CSV files and bash scripts are generated to resolve datasets that were not properly registered.

When --cache-dir is provided, JSON files are created to cache session information from ICAT and HDF5. This session cache will be invalidated when needed: directory no longer exists or --register is run on the session.

Currently there are ~2300 sessions on /data/visitor. From my compute it took 12 hours to run the first time and ~1 min in subsequent runs. It creates about 1200 mount points. There are about 40 sessions marked as "todo". We can use /data/scisoft/icat_sync as a shared cache.

We can probably speedup the first run time by parallelization but it might stress ICAT+ and/or the disk too much (we open every single HDF5 file and ask ICAT for the list of datasets for every investigation). We could do it in another MR if needed.

Every run it invalidates sessions which are gone from the cache. For example

Invalidate /data/visitor/hg188/id11/20221201 (no longer exists)
$ ls /data/visitor/hg188/id11/20221201
ls: cannot access '/data/visitor/hg188/id11/20221201': No such file or directory

$ ls /data/visitor/hg188/id11/
20221201.to_delete  20240118
Edited by Wout De Nolf

Merge request reports