Workers: CPU usage
Introduction
Nabu processing can be made using GPU and/or CPU.
- GPU: For ESRF Nvidia hardware, the preferred back-end is Cuda.
- CPU: here OpenCL is required.
In practice, GPU workers use only a small fraction of a node's CPU resources. Therefore, to make full usage of a node resources, we can spawn
- 2 GPU workers (if the node has 2 GPUs)
- 1 or 2 CPU workers
The problem
dask.distributed
/dask_jobqueue
are not designed for such heterogeneous configuration, at least for now. See the related discussion on dask.distributed. There might be an idea of "pool of heterogeneous workers" in the future.
Possible solutions
There are several options to achieve full usage of a compute node:
- Allocate an entire node, and dispatch the work to "sub-workers"
- Allocate
Ng
GPU workers, andN_c
CPU workers ; leave the batch scheduler decide where they will be allocated - Allocate
Ng
GPU workers. Each of these GPU workers can also spawn a sub-worker to use the CPU. - Once a
Cluster
object is created by using only GPU workers, create new clusters for CPU workers using the same nodes (with slurm:-w <node>
).
Discussion
Approach (1) achieve our goal by design. However:
- it is not in the spirit of batch schedulers (resources are not shared anymore)
- the "sub-worker" idea might be cumbersome to implement (ex. one slurm scheduler, connecting to a localCluster scheduler).
Approach (2) is at risk of using 2 GPUS of a node, and the CPU cores of another node. So in the end, the entire node might not be used.
Approach (3) seems to be a good compromise. If the cluster node resources are known in advance (likely), then allocating 2 GPU workers and 2 CPU workers will allocate the entire node[1]. The two main drawbacks are
- it entails using exactly as many CPU workers as GPU workers
- somehow the "sub-worker" idea is back: client submits job to a "GPU worker", which in turn dispatches work to its "CPU worker companion".
Approach (4) is nice in principle, but it results in a collection of clusters/clients (instead of one). How to distribute the work ? We are back implementing the same thing as dask.distributed
.
It seems there is no elegant way to achieve our goal with batch schedulers, as the assumption is that all the nodes (in a given partition) have roughly the same resources. For local computations, things are easier because we can register individual workers to the scheduler.
The easiest is clearly (2). Approach (3) can be implemented with some effort. See also: recent discussion on similar issue.
Notes
[1] In SLURM, allocating N/2 cores in one job, then the other N/2 cores in another job, will make the node entirely allocated.