Performance issues on Nvidia GPUs #315
Replies: 5 comments 12 replies
-
Which implementation of the model are you using? Do you have a script that we can try running on our side? I want to check if some operators are falling back to the CPU for one reason or another. DML supports tensor cores, but it won't be used unless mixed precision is on and support may not be the same as CUDA (e.g. it may depend on tensor sizes, layout and batch size). In any case it wouldn't be what causes the slowdown in your scenario. |
Beta Was this translation helpful? Give feedback.
-
Sure, here's the script that I use to test:
Link to the oxford flower dataset organized in the structure that this script expects: https://drive.google.com/file/d/16o-MNClgTxv4TKs9ux7K21PHJxisYWSS/view?usp=sharing As for mixed precision, I had to turn it off because when I have it on, the ResNetRS50 model gives nan loss and accuracy. I cannot test with other models that I know works with mixed precision because of the fused batchnorm issue. I do see higher CPU usages while training for DML compared to CUDA. Maybe the offloading is happening. I am running on Windows 10 Pro v19044, Python 3.9.13. CPU is AMD 5800X, 32GB RAM, GPU is RTX 3080Ti. |
Beta Was this translation helpful? Give feedback.
-
On the NaN issue, I just tested out effecient net B3 with mixed precision on, and it also suffer the same.
Hope this helps with debugging it. |
Beta Was this translation helpful? Give feedback.
-
Hi @PatriceVignola any updates on these issues? |
Beta Was this translation helpful? Give feedback.
-
We released version 0.3.0 yesterday which adds a workaround for the NaN issue, but mixed precision performance won't be better. We now have an idea why mixed precision is so much slower than CUDA, and it's partially due to tensor cores not being used. We don't have an ETA yet for this work. |
Beta Was this translation helpful? Give feedback.
-
I have done a simple benchmark of ResNetRS50 on an RTX 3080Ti, comparing DirectML plugin 0.1.1.dev221004 and CUDA 11.8 + CUDNN 8.6.0, and found that DML is very slow compared to CUDA, and uses only about 50% of GPU while training, while CUDA constantly uses 100%. Both tests were conducted with mixed precision off and batch size of 64.
Training 10 epochs on DML took 416 seconds, while on CUDA took only 164 seconds. Both on TF 2.10 (CPU for DML) and Python 3.9.13.
This brings the big performance question - is DML in any case optimized for Nvidia GPUs, especially its Tensor Cores and TensorFloat32 datatypes? And what could cause it to not use 100% of my GPU? I have tried to increase batch size but it will just OOM so 64 is definitely a large enough BS to fully use the GPU (as shown by 100% usage on CUDA).
Or perhaps is this something that will be optimized in the future, but just not yet?
Beta Was this translation helpful? Give feedback.
All reactions