Real life experiment: fd leaks and Redis memory limit
I started stress testing the Nexus writer beyond the unit tests and there are currently two clear issues:
-
Long scans cause Redis to reach its limit. With 8 MCA channels this happens after ~25000 points (takes about 5 hours with the mockup controllers). We will be running scans much larger scans.
-
Running many scans will cause the writer process to run out of file descriptors due to sockets not being closed after listening to Redis events.
Due to these issues, a real life experiment is currently not possible.
The second issue is a matter of finding the leaks (related to #1246 (closed)). But how are we going to solve the first issue? Even when the MCA data is moved out of Redis, there is still a limit on how long a scan can run, which is unacceptable imo. Shouldn't we allow Redis to grow as large as the system can take during one scan?
Exception when running out of memory:
Traceback (most recent call last):
File "/users/opid21/dev/bliss/nexus_writer_service/scan_writers/writer_base.py", line 889, in _run
File "/users/opid21/dev/bliss/nexus_writer_service/scan_writers/writer_base.py", line 929, in _listen_scan_events
File "/users/opid21/dev/bliss/nexus_writer_service/scan_writers/writer_base.py", line 945, in _listen_scan_events_loop
File "/users/opid21/dev/bliss/bliss/data/node.py", line 332, in walk_events
File "/users/opid21/dev/bliss/bliss/data/node.py", line 386, in wait_for_event
File "/users/opid21/dev/bliss/bliss/data/node.py", line 103, in get_node
File "/users/opid21/dev/bliss/bliss/data/node.py", line 121, in get_nodes
File "/users/opid21/.pyenv/versions/miniconda3-latest/envs/bliss_env/lib/python3.7/site-packages/redis/client.py", line 3691, in execute
return execute(conn, stack, raise_on_error)
File "/users/opid21/.pyenv/versions/miniconda3-latest/envs/bliss_env/lib/python3.7/site-packages/redis/client.py", line 3589, in _execute_transaction
raise errors[0][1]
File "/users/opid21/.pyenv/versions/miniconda3-latest/envs/bliss_env/lib/python3.7/site-packages/redis/client.py", line 3576, in _execute_transaction
self.parse_response(connection, '_')
File "/users/opid21/.pyenv/versions/miniconda3-latest/envs/bliss_env/lib/python3.7/site-packages/redis/client.py", line 3650, in parse_response
self, connection, command_name, **options)
File "/users/opid21/.pyenv/versions/miniconda3-latest/envs/bliss_env/lib/python3.7/site-packages/redis/client.py", line 853, in parse_response
response = connection.read_response()
File "/users/opid21/.pyenv/versions/miniconda3-latest/envs/bliss_env/lib/python3.7/site-packages/redis/connection.py", line 718, in read_response
raise response
redis.exceptions.ResponseError: Command # 1 (HGET nexus_writer_config:data:id21:tmp:nexustest:1_amesh:axis:timer:epoch name) of pipeline caused error: OOM command not allowed when used memory > 'maxmemory'.
Exception when running out of file descriptors:
Unable to create file (unable to open file: name = '/data/id21/tmp/nexustest_many/nexus_writer_config/test_external.h5', errno = 24, error message = 'Too many open files', flags = 15, o_flags = c2)
INFO:nexus_writer_service.session_writer: [nexus_writer_config-0 (RUNNING)] [945_loopscan-168 (ON)] title = 'loopscan'
ERROR:nexus_writer_service.session_writer: [nexus_writer_config-0 (RUNNING)] [946_ascan-169 (FAULT)] Stop writer due to exception:
Traceback (most recent call last):
File "/users/opid21/.pyenv/versions/miniconda3-latest/envs/bliss_env/lib/python3.7/site-packages/h5py/_hl/files.py", line 182, in make_fid
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 85, in h5py.h5f.open
OSError: Unable to open file (unable to open file: name = '/data/id21/tmp/nexustest_many/nexus_writer_config/test_external.h5', errno = 24, error message = 'Too many open files', flags = 1, o_flags = 2)