NeXus writer: dynamic disk space warnings
200TB disk quota of /data/visitor/in1172 was exceeded and our disk space checks were insufficient to catch this, most likely caused by the fixed disk space limits which were too low for the data rate.
https://requests.esrf.fr/browse/SCHLP-24728
Bliss and the NeXus writer already monitor disk space and notify the user or prevent starting a scan:
Do not start a scan when "free_space < required_disk_space"
During a scan, print a warning when "free_space < recommended_disk_space" (checked every 3 seconds)
The NeXus writer stops writing when "free_space < required_disk_space" (checked every 3 seconds)
In all three cases, the user should get an easy to understand warning or error message. We check disk space like this:
import os
required_disk_space = 200 # MB, configurable
recommended_disk_space = 1024 # MB, configurable
dataset_directory = "/data/visitor/in1172/bm18/20240221/RAW_DATA/helical_HA2200_33.94um_Hauser_pte/helical_HA2200_33.94um_Hauser_pte_0001/"
stat = os.statvfs(dataset_directory)
free_space = stat.f_frsize * stat.f_bavail / 1024**2 # MB
The problem is that when the data rate is larger than "required_disk_space/3" MB/s (which most likely is the case at BM18) this check will come too late. At least I think that's why the check was not working and the users did not get an informative error message. Another reason could have been that the os.statvfs call takes several seconds (we know this can happen on NFS). In that case you keep scanning until the call returns (executed in a separate thread).
The error message the user saw was RuntimeError: ('GroupingMaster', 'Nexus writer is in FAULT state (Driver truncate request failed (slist already enabled?))')
In the writer error logs I see blissadm@lbm18ctrl:/var/log$ grep FAULT nexus_writer.log
... ERROR 2024-02-21 20:29:14,239 nexus_writer_service.subscribers.session_writer: [MRTOMO-4 (RUNNING)] [2_dark images-4 (FAULT)] [Errno 122] Unable to open file (unable to close file, errno = 122, error message = 'Disk quota exceeded') ... ERROR 2024-02-21 20:29:14,350 nexus_writer_service.subscribers.session_writer: [MRTOMO-3 (RUNNING)] [fullturn-3 (FAULT)] Driver truncate request failed (slist already enabled?)
So the users saw the second error but not the first. So what happened here is that scan "2_dark images" failed which got ignored (I don't understand how) and then the second scan was started "fullturn" got started and failed with a cryptic error message about truncate request failing.
What we can try to do is make the recommended_disk_space and required_disk_space limits dynamic, depending on the data rate during the scan (could be different for every scan). However if the os.statvfs call takes several seconds to resolve (not sure how often this happens) we cannot do anything.