[writer] timeouts

Timeouts occur occasionally when calling a method or attribute of the Nexus writer tango device.

I can reproduce the issue with these 3 python processes:

Nexus writer server: green mode (not mixed with threads, so the current writer architecture)
client1: Bliss which is continuously performing scans: checks the state every 3 seconds during the scan + does some tango calls before and after each scan
client2: polling the scan_uris attribute (this attribute is not read by client1). The result is a DevVarStringArray which grows over time (one element per scan).

This is how a normal scan_uris attribute read from client2 looks like:

1600874217 [140508102915840] DEBUG dserver/nexuswriter/testwriters Device_5Impl::read_attributes_5 arrived for dev id00/bliss_nxwriter/nexus_writer_session, att[0] = scan_uris
1600874217 [140508102915840] DEBUG dserver/nexuswriter/testwriters SubDevDiag::get_associated_device() entering ...
1600874217 [140508102915840] DEBUG dserver/nexuswriter/testwriters SubDevDiag::get_associated_device() found : No associated device name!
1600874217 [140508102915840] DEBUG dserver/nexuswriter/testwriters SubDevDiag::set_associated_device() entering ...
1600874217 [140508102915840] DEBUG dserver/nexuswriter/testwriters Locking attribute mutex for attribute scan_uris
1600874217 [140508102915840] DEBUG dserver/nexuswriter/testwriters Giving attribute mutex to CORBA structure for attribute scan_uris
1600874217 [140508102915840] DEBUG dserver/nexuswriter/testwriters SubDevDiag::set_associated_device() entering ...
1600874217 [140508102915840] DEBUG dserver/nexuswriter/testwriters Leaving Device_3Impl::read_attributes_no_except

This is how an unsuccessful scan_uris attribute read from client2 looks like (call fails with 3 second timeout):

1600874217 [140508102915840] DEBUG dserver/nexuswriter/testwriters Device_5Impl::read_attributes_5 arrived for dev id00/bliss_nxwriter/nexus_writer_session, att[0] = scan_uris
1600874217 [140508102915840] DEBUG dserver/nexuswriter/testwriters SubDevDiag::get_associated_device() entering ...
1600874217 [140508102915840] DEBUG dserver/nexuswriter/testwriters SubDevDiag::get_associated_device() found : No associated device name!
1600874217 [140508102915840] DEBUG dserver/nexuswriter/testwriters SubDevDiag::set_associated_device() entering ...
1600874217 [140508102915840] DEBUG dserver/nexuswriter/testwriters Locking attribute mutex for attribute scan_uris
... other attribute reads from client1 are being processes succesfully ...
... "Giving attribute mutex to CORBA structure for attribute scan_uris" does not show up anymore ...

So it seems that after some time of having client1 and client2 running in parallel, a call from client2 fails (3 second timeout) while the calls from client1 are being processed normally.

I also tried running the server with a gevent monitor thread which detects if the gevent loop was blocked for more than 1 second:

export GEVENT_MONITOR_THREAD_ENABLE=true;
export GEVENT_MAX_BLOCKING_TIME=1

This doesn't get triggered and the calls from client1 are being processed normally. Both indicate that the gevent loop is not blocked.

In the real case, you don't have client2 (unless you have jive open) and a call from client1 (the only client) times out anyway (occasionally). Currently Bliss handles this by trying a few times and printing a warning if it fails.

Edited Sep 24, 2020 by Wout De Nolf