Resolve "icat-sync-raw: needs caching"
Closes #29 (closed)
This MR improves icat-sync-raw
- Replace raw dictionaries with dataclasses
- These dataclasses facilitate the implementation of the caching with invalidation (they are JSON serializable)
- In addition, the dataclasses are also used to simplify the
test_icat_sync
tests - The dataclasses in
sync_types
are test covered butExperimentalSessionStore
is not
icat-sync-raw --save-dir ./ --cache-dir /data/scisoft/icat_sync
When --save-dir
is provided, CSV files and bash scripts are generated to resolve datasets that were not properly registered.
When --cache-dir
is provided, JSON files are created to cache session information from ICAT and HDF5. This session cache will be invalidated when needed: directory no longer exists or --register
is run on the session.
Currently there are ~2300 sessions on /data/visitor. From my compute it took 12 hours to run the first time and ~1 min in subsequent runs. It creates about 1200 mount points. There are about 40 sessions marked as "todo". We can use /data/scisoft/icat_sync
as a shared cache.
We can probably speedup the first run time by parallelization but it might stress ICAT+ and/or the disk too much (we open every single HDF5 file and ask ICAT for the list of datasets for every investigation). We could do it in another MR if needed.
Every run it invalidates sessions which are gone from the cache. For example
Invalidate /data/visitor/hg188/id11/20221201 (no longer exists)
$ ls /data/visitor/hg188/id11/20221201
ls: cannot access '/data/visitor/hg188/id11/20221201': No such file or directory
$ ls /data/visitor/hg188/id11/
20221201.to_delete 20240118