select_gpu is not always called with the gpu_name
Re salut Vincent, j'ai une machine comme ca...
J'utilise dans mon cas deux cartes sur les 4. J'imagine la 0 et la 1
picca@re-grades-01:~/src/gitlab.synchrotron-soleil.fr/hermes-beamline/ptychohermesscripts$ nvidia-smi
Tue Oct 24 15:46:42 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:84:00.0 Off | N/A |
| 30% 31C P8 16W / 350W | 8MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:85:00.0 Off | 0 |
| N/A 30C P8 9W / 70W | 9MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... On | 00000000:86:00.0 Off | N/A |
| 30% 30C P8 20W / 350W | 8MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... On | 00000000:87:00.0 Off | N/A |
| 30% 28C P8 23W / 350W | 8MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3059 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 3059 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 3059 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 3059 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
Le output de pynx en mode mpi me donne ca
###############
Ptycho runner: preparing processing unit
Computing speed for available CUDA GPU [ranking by global memory bandwidth]:
Ptycho runner: preparing processing unit
Computing speed for available CUDA GPU [ranking by global memory bandwidth]:
NVIDIA GeForce RTX 3090: 23 Gb, 365 Gbytes/s
NVIDIA GeForce RTX 3090: 23 Gb, 372 Gbytes/s
NVIDIA GeForce RTX 3090: 23 Gb, 364 Gbytes/s
NVIDIA GeForce RTX 3090: 23 Gb, 371 Gbytes/s
NVIDIA GeForce RTX 3090: 23 Gb, 362 Gbytes/s
NVIDIA GeForce RTX 3090: 23 Gb, 368 Gbytes/s
Tesla T4: 14 Gb, 107 Gbytes/s
Tesla T4: 14 Gb, 106 Gbytes/s
select_gpu using MPI: node=re-grades-01 mpi_rank=1, using GPU #1/4 PCI: 0000:85:00.0
select_gpu using MPI: node=re-grades-01 mpi_rank=0, using GPU #0/4 PCI: 0000:84:00.0
Using CUDA GPU: NVIDIA GeForce RTX 3090
Using CUDA GPU=> setting large stack size (613) (override with stack_size=N)
Using CUDA GPU: Tesla T4
Using CUDA GPU=> setting large stack size (613) (override with stack_size=N)
Donc il prend bien la 0 et la 1.
De ce que je comprends avec les lignes suivantes
mpi multiscan
MPI # 1 analysing scans: (2,)
###############
Processing nrj number 2
###############
MPI # 0 analysing scans: (1, 3)
###############
Processing nrj number 1
###############
Il va traiter l'nrj 2 sur la deuxieme carte donc la T4 et l'nrj 1 et 3 sur lq carte 0 donc la 3090
Je vois bien le traitement se faire sur la 3090.
mais je pense que j'ai ce message d'erreur sur la T4.
Please give the number of GPUs to be used (if nrj_points < 12, give nrj_points, otherwise 12): 2
Traceback (most recent call last):
File "/mnt/home-re-grades-02/experiences/instrumentation/picca/src/gitlab.synchrotron-soleil.fr/hermes-beamline/ptychohermesscripts/./pynx-ptycho-hermes", line 44, in <module>
main()
File "/mnt/home-re-grades-02/experiences/instrumentation/picca/src/gitlab.synchrotron-soleil.fr/hermes-beamline/ptychohermesscripts/./pynx-ptycho-hermes", line 30, in main
w.process_scans()
File "/home/experiences/instrumentation/picca/src/gitlab.esrf.fr/picca/PyNX/pynx/ptycho/runner/runner.py", line 3041, in process_scans
self.ws.run(reuse_ptycho=reuse_ptycho)
File "/home/experiences/instrumentation/picca/src/gitlab.esrf.fr/picca/PyNX/pynx/ptycho/runner/runner.py", line 1689, in run
self.p = ScaleObjProbe(verbose=True) * self.p
~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~
File "/home/experiences/instrumentation/picca/src/gitlab.esrf.fr/picca/PyNX/pynx/operator/__init__.py", line 61, in __mul__
self.apply_ops_mul(w)
File "/home/experiences/instrumentation/picca/src/gitlab.esrf.fr/picca/PyNX/pynx/ptycho/cu_operator.py", line 812, in apply_ops_mul
return super(CUOperatorPtycho, self).apply_ops_mul(pty)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/experiences/instrumentation/picca/src/gitlab.esrf.fr/picca/PyNX/pynx/operator/__init__.py", line 177, in apply_ops_mul
o.prepare_data(w)
File "/home/experiences/instrumentation/picca/src/gitlab.esrf.fr/picca/PyNX/pynx/ptycho/cu_operator.py", line 871, in prepare_data
p._cu_psi = cua.empty(shape=(len(p._obj), len(p._probe), self.processing_unit.cu_stack_size, ny, nx),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/pycuda/gpuarray.py", line 268, in __init__
self.gpudata = self.allocator(self.size * self.dtype.itemsize)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pycuda._driver.MemoryError: memory_pool::allocate failed: out of memory - failed to free memory for allocation
invalid command name "140499303860160delayed_destroy"
while executing
"140499303860160delayed_destroy"
("after" script)
invalid command name "140499305030912delayed_destroy"
while executing
"140499305030912delayed_destroy"
("after" script)
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[9394,1],1]
Exit code: 1
--------------------------------------------------------------------------
L'allocation mémoire n'est pas suffisente sur la T4
Edited by Picca Frédéric-Emmanuel