Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more partitions and update unit test #116

Merged
merged 12 commits into from
Jun 25, 2024
Merged

Conversation

yuema137
Copy link
Contributor

As @shenyangshi mentioned in #115, there are more possible partitions than what we previously listed in utilix.batchq and the actual available ones bigmem2 and gpu2 are added here (see Yue's reponse in #115)

Besides, the unit test for batchq has also been updated to ensure that all of the partitions listed here are actually supported. I have tested on midway2, midway3 and dali and the tests were all passed.

@shenyangshi
Copy link

Thanks Yue, this is really helpful!

The PR generally looks good, but the GPU can be tricky, in sbatch we need to explicitly ask general resources like GPU with #SBATCH --gres=gpu:1 and initialize cuda module load cuda in the batch script as well, like in env_starter , and simple_slurm can handle input as well.

I tried starting a jupyter notebook with GPU on gpu2 using the env_starter branch, I can start a job with 28 CPUs, but no GPU access is given. Maybe we are not allocated to have midway GPUs anymore? If that's the case we don't need to implement my GPU gres comment and can directly merge the PR.

@yuema137
Copy link
Contributor Author

@shenyangshi So if I understand correctly, we haven't successfully used GPU even on gpu2 partition?

@shenyangshi
Copy link

I think originally we could access gpu2, see slack from Andrii, now I'm not sure, I haven't successfully used it.

@yuema137
Copy link
Contributor Author

@shenyangshi I just tried with sbatch directly and the GPU could work. So there are probably something missing in env_starter. I will fix it and also add the GPU header here

@shenyangshi
Copy link

Sounds good, thanks

@yuema137
Copy link
Contributor Author

yuema137 commented Jun 23, 2024

@shenyangshi After diving into this, I realized that it's not trivial to set up the GPUs for utilix and env_starter. The reason is that on fried rice, the NVIDIA driver is much more advanced and so does the CUDA version (12.4)

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA TITAN V                 Off |   00000000:86:00.0 Off |                  N/A |
| 28%   38C    P8             26W /  250W |    1992MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA TITAN V                 Off |   00000000:AF:00.0 Off |                  N/A |
| 28%   36C    P8             25W /  250W |    2152MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Therefore, the tensorflow version was upgraded to 2.15 to be compatible with the fried rice GPUs. However, on midway the NVIDIA driver is quite outdated (very likely not being maintained anymore), which doesn't meet the requirement for tensorflow 2.15. So, if we really want this resource, we need to:

  • Ask RCC people to update the NVIDIA driver and install new versions of CUDA
  • Or, create a special environment for GPUs on midway
    My feeling is that the complexity is quite a lot, while the gain is limited, as the fried rice ones are adequate now. So I think it's fine to keep gpu2 partition as a CPU-only resource. What do you think?
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:09:00.0 Off |                    0 |
| N/A   32C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

@shenyangshi
Copy link

Thanks Yue for the hard work and detailed check! I totally agree we can use it as a CPU-only node now.

@yuema137 yuema137 merged commit 2692edb into master Jun 25, 2024
1 check passed
@yuema137 yuema137 deleted the add_more_partitions branch June 25, 2024 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants