Add more partitions and update unit test #116

yuema137 · 2024-06-22T02:43:13Z

As @shenyangshi mentioned in #115, there are more possible partitions than what we previously listed in utilix.batchq and the actual available ones bigmem2 and gpu2 are added here (see Yue's reponse in #115)

Besides, the unit test for batchq has also been updated to ensure that all of the partitions listed here are actually supported. I have tested on midway2, midway3 and dali and the tests were all passed.

shenyangshi · 2024-06-22T03:37:27Z

Thanks Yue, this is really helpful!

The PR generally looks good, but the GPU can be tricky, in sbatch we need to explicitly ask general resources like GPU with #SBATCH --gres=gpu:1 and initialize cuda module load cuda in the batch script as well, like in env_starter , and simple_slurm can handle input as well.

I tried starting a jupyter notebook with GPU on gpu2 using the env_starter branch, I can start a job with 28 CPUs, but no GPU access is given. Maybe we are not allocated to have midway GPUs anymore? If that's the case we don't need to implement my GPU gres comment and can directly merge the PR.

yuema137 · 2024-06-22T03:40:31Z

@shenyangshi So if I understand correctly, we haven't successfully used GPU even on gpu2 partition?

shenyangshi · 2024-06-22T03:54:30Z

I think originally we could access gpu2, see slack from Andrii, now I'm not sure, I haven't successfully used it.

yuema137 · 2024-06-22T04:03:06Z

@shenyangshi I just tried with sbatch directly and the GPU could work. So there are probably something missing in env_starter. I will fix it and also add the GPU header here

shenyangshi · 2024-06-23T13:59:19Z

Sounds good, thanks

yuema137 · 2024-06-23T20:49:51Z

@shenyangshi After diving into this, I realized that it's not trivial to set up the GPUs for utilix and env_starter. The reason is that on fried rice, the NVIDIA driver is much more advanced and so does the CUDA version (12.4)

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA TITAN V                 Off |   00000000:86:00.0 Off |                  N/A |
| 28%   38C    P8             26W /  250W |    1992MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA TITAN V                 Off |   00000000:AF:00.0 Off |                  N/A |
| 28%   36C    P8             25W /  250W |    2152MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Therefore, the tensorflow version was upgraded to 2.15 to be compatible with the fried rice GPUs. However, on midway the NVIDIA driver is quite outdated (very likely not being maintained anymore), which doesn't meet the requirement for tensorflow 2.15. So, if we really want this resource, we need to:

Ask RCC people to update the NVIDIA driver and install new versions of CUDA
Or, create a special environment for GPUs on midway
My feeling is that the complexity is quite a lot, while the gain is limited, as the fried rice ones are adequate now. So I think it's fine to keep gpu2 partition as a CPU-only resource. What do you think?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:09:00.0 Off |                    0 |
| N/A   32C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

shenyangshi · 2024-06-23T20:52:00Z

Thanks Yue for the hard work and detailed check! I totally agree we can use it as a CPU-only node now.

yuema137 added 12 commits April 19, 2024 18:30

add partition info in test

64bcbe2

put bind in front of partition

82a36ce

Merge branch 'master' of github.com:XENONnT/utilix

d8e8bb1

Merge branch 'master' of github.com:XENONnT/utilix

22e8f0f

Merge branch 'master' of github.com:XENONnT/utilix

ec832c9

bump version -> v0.8.5

f70d566

Merge branch 'master' of github.com:XENONnT/utilix

29d6ee1

Merge branch 'master' of github.com:XENONnT/utilix

bfefe7b

Merge branch 'master' of github.com:XENONnT/utilix

0ba89bc

add bigmem2 and gpu2

0b8c4f7

update unnitest for batchq to validate all partitions

33f763f

remove broadwl for midway3

48dbdc0

yuema137 requested a review from shenyangshi June 22, 2024 02:43

yuema137 mentioned this pull request Jun 22, 2024

Add more supported partition #115

Closed

yuema137 mentioned this pull request Jun 22, 2024

add bigmem2 and gpu2 XENONnT/env_starter#61

Merged

shenyangshi approved these changes Jun 23, 2024

View reviewed changes

yuema137 merged commit 2692edb into master Jun 25, 2024
1 check passed

yuema137 deleted the add_more_partitions branch June 25, 2024 02:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more partitions and update unit test #116

Add more partitions and update unit test #116

yuema137 commented Jun 22, 2024

shenyangshi commented Jun 22, 2024

yuema137 commented Jun 22, 2024

shenyangshi commented Jun 22, 2024

yuema137 commented Jun 22, 2024

shenyangshi commented Jun 23, 2024

yuema137 commented Jun 23, 2024 •

edited

Loading

shenyangshi commented Jun 23, 2024

Add more partitions and update unit test #116

Add more partitions and update unit test #116

Conversation

yuema137 commented Jun 22, 2024

shenyangshi commented Jun 22, 2024

yuema137 commented Jun 22, 2024

shenyangshi commented Jun 22, 2024

yuema137 commented Jun 22, 2024

shenyangshi commented Jun 23, 2024

yuema137 commented Jun 23, 2024 • edited Loading

shenyangshi commented Jun 23, 2024

yuema137 commented Jun 23, 2024 •

edited

Loading