Workers: CPU usage

Introduction

Nabu processing can be made using GPU and/or CPU.

GPU: For ESRF Nvidia hardware, the preferred back-end is Cuda.
CPU: here OpenCL is required.

In practice, GPU workers use only a small fraction of a node's CPU resources. Therefore, to make full usage of a node resources, we can spawn

2 GPU workers (if the node has 2 GPUs)
1 or 2 CPU workers

The problem

dask.distributed/dask_jobqueue are not designed for such heterogeneous configuration, at least for now. See the related discussion on dask.distributed. There might be an idea of "pool of heterogeneous workers" in the future.

Possible solutions

There are several options to achieve full usage of a compute node:

Allocate an entire node, and dispatch the work to "sub-workers"
Allocate Ng GPU workers, and N_c CPU workers ; leave the batch scheduler decide where they will be allocated
Allocate Ng GPU workers. Each of these GPU workers can also spawn a sub-worker to use the CPU.
Once a Cluster object is created by using only GPU workers, create new clusters for CPU workers using the same nodes (with slurm: -w <node>).

Discussion

Approach (1) achieve our goal by design. However:

it is not in the spirit of batch schedulers (resources are not shared anymore)
the "sub-worker" idea might be cumbersome to implement (ex. one slurm scheduler, connecting to a localCluster scheduler).

Approach (2) is at risk of using 2 GPUS of a node, and the CPU cores of another node. So in the end, the entire node might not be used.

Approach (3) seems to be a good compromise. If the cluster node resources are known in advance (likely), then allocating 2 GPU workers and 2 CPU workers will allocate the entire node[1]. The two main drawbacks are

it entails using exactly as many CPU workers as GPU workers
somehow the "sub-worker" idea is back: client submits job to a "GPU worker", which in turn dispatches work to its "CPU worker companion".

Approach (4) is nice in principle, but it results in a collection of clusters/clients (instead of one). How to distribute the work ? We are back implementing the same thing as dask.distributed.

It seems there is no elegant way to achieve our goal with batch schedulers, as the assumption is that all the nodes (in a given partition) have roughly the same resources. For local computations, things are easier because we can register individual workers to the scheduler.

The easiest is clearly (2). Approach (3) can be implemented with some effort. See also: recent discussion on similar issue.

Notes

[1] In SLURM, allocating N/2 cores in one job, then the other N/2 cores in another job, will make the node entirely allocated.

Edited Jan 17, 2020 by Pierre Paleo