Performance issues on Nvidia GPUs #315

aliencaocao · 2022-10-19T05:33:33Z

aliencaocao
Oct 19, 2022

I have done a simple benchmark of ResNetRS50 on an RTX 3080Ti, comparing DirectML plugin 0.1.1.dev221004 and CUDA 11.8 + CUDNN 8.6.0, and found that DML is very slow compared to CUDA, and uses only about 50% of GPU while training, while CUDA constantly uses 100%. Both tests were conducted with mixed precision off and batch size of 64.

Training 10 epochs on DML took 416 seconds, while on CUDA took only 164 seconds. Both on TF 2.10 (CPU for DML) and Python 3.9.13.

This brings the big performance question - is DML in any case optimized for Nvidia GPUs, especially its Tensor Cores and TensorFloat32 datatypes? And what could cause it to not use 100% of my GPU? I have tried to increase batch size but it will just OOM so 64 is definitely a large enough BS to fully use the GPU (as shown by 100% usage on CUDA).

Or perhaps is this something that will be optimized in the future, but just not yet?

PatriceVignola · 2022-10-19T05:44:47Z

PatriceVignola
Oct 19, 2022

Which implementation of the model are you using? Do you have a script that we can try running on our side? I want to check if some operators are falling back to the CPU for one reason or another.

DML supports tensor cores, but it won't be used unless mixed precision is on and support may not be the same as CUDA (e.g. it may depend on tensor sizes, layout and batch size). In any case it wouldn't be what causes the slowdown in your scenario.

0 replies

aliencaocao · 2022-10-19T05:51:10Z

aliencaocao
Oct 19, 2022
Author

Sure, here's the script that I use to test:

import os
import sys
import time
import tensorflow as tf
from tensorflow.keras import models, Model
from tensorflow.keras.layers import *
from tensorflow.keras.utils import plot_model

physical_devices = tf.config.experimental.list_physical_devices('GPU')
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)
print(f'Running on Python {sys.version}, Tensorflow {tf.__version__}.')

# Hyperparameters

lr = 1e-3  # Adam default is 1e-3
batch_size = 64
epoch = 10
img_height = 224
img_width = 224

# Dataset loading (Oxford Flower 102)

labels = 'inferred'
label_mode = 'categorical'  # sparse one hot encoding
color_mode = 'rgb'
shuffle = True
seed = 69
test_split = 0.2  # split into train and test (NOT val), 0-1
AUTOTUNE = tf.data.AUTOTUNE

print('Training data:')
train = tf.keras.preprocessing.image_dataset_from_directory('train', labels=labels, label_mode=label_mode,
                                                            color_mode=color_mode, shuffle=shuffle, seed=seed, image_size=(img_height, img_width), batch_size=batch_size)
train_class_names = train.class_names

print('\nValidation data:')
val = tf.keras.preprocessing.image_dataset_from_directory('val', labels=labels, label_mode=label_mode,
                                                          color_mode=color_mode, shuffle=shuffle, seed=seed, image_size=(img_height, img_width), batch_size=batch_size)

print('\nTesting data:')
test = tf.keras.preprocessing.image_dataset_from_directory('test', labels=labels, label_mode=label_mode,
                                                           color_mode=color_mode, shuffle=shuffle, seed=seed, image_size=(img_height, img_width), batch_size=batch_size)

train = train.cache().prefetch(buffer_size=AUTOTUNE)
val = val.cache().prefetch(buffer_size=AUTOTUNE)
test = test.cache().prefetch(buffer_size=AUTOTUNE)

# model

xInput = Input((img_height, img_width, 3), dtype=tf.float32)
net = tf.keras.applications.resnet_rs.ResNetRS50(include_top=False, weights='imagenet')
x = net(xInput)
x = Flatten()(x)
x = Dense(256)(x)
x = BatchNormalization(epsilon=1.001e-5)(x)
x = Activation('relu')(x)
xOutput = Dense(len(train_class_names), dtype=tf.float32)(x)
model = tf.keras.models.Model(xInput, xOutput)

# compile model

opt = tf.keras.optimizers.Adam(learning_rate=lr)
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True, label_smoothing=0.1)
metrics = ['accuracy']
model.compile(optimizer=opt, loss=loss, metrics=metrics)
model.summary()
# Training
start_time = time.time()
history = model.fit(train, epochs=epoch, validation_data=val, batch_size=batch_size, verbose=1)
end_time = time.time()
print(f'Training time used: {end_time - start_time} seconds')

# Evaluate
start_time = time.time()
model.evaluate(test, batch_size=batch_size)
end_time = time.time()
print(
    f'Evaluation time used: {end_time - start_time} seconds')
input('Press enter to exit')

Link to the oxford flower dataset organized in the structure that this script expects: https://drive.google.com/file/d/16o-MNClgTxv4TKs9ux7K21PHJxisYWSS/view?usp=sharing

As for mixed precision, I had to turn it off because when I have it on, the ResNetRS50 model gives nan loss and accuracy. I cannot test with other models that I know works with mixed precision because of the fused batchnorm issue.

I do see higher CPU usages while training for DML compared to CUDA. Maybe the offloading is happening.

I am running on Windows 10 Pro v19044, Python 3.9.13. CPU is AMD 5800X, 32GB RAM, GPU is RTX 3080Ti.

11 replies

aliencaocao Oct 21, 2022
Author

Hi, thank you for the speedy release! I tested it out, and here are my findings:

The fix seems to be working. Training CPU usage reduced to almost zero, and GPU usage is now at 100%, which is in line with CUDA. Performance wise, training took 207 seconds, inference took 1.94 seconds, which are significant improvements but still not up to what CUDA could offer. My guess is the use of TensorFloat32 (I am not sure if DML uses TF32) in CUDA and TensorCores, because CUDA does not require mixed precision to be on to use tensor core, as you can see here: https://developer.nvidia.com/tensor-cores

The third generation of tensor cores introduced in the NVIDIA Ampere architecture provides a huge performance boost and delivers new precisions to cover the full spectrum required from research to production — FP32, Tensor Float 32 (TF32), FP16, INT8, INT4 and bfloat16

while I turned off mixed precision for the NaN issue.

However, the fix seems to introduce some issues affecting model performance. I have yet to test other models, but just speaking for the ResNetRS50 and Oxford Flower 102 dataset, the training seems to be less stable and validation accuracy reduced observably. I do not think this is a random issue because:

the training, val and test splits are all pre-determined by the folders I split them into, the test script does not split them, but only read the list of files from the folders
While the test script does shuffle within train, val and test sets respectively, I have set the random seed to ensure that the shuffle outcome is fixed (so the ordering of the samples are fixed)
I ran the same training for 3 times, each time I observe significant fluctuations in training loss between epochs and the final test accuracy is about 93%, which is 2-3% lower than CUDA or TF DML plugin 0.1.1, where it consistently reached 95+%. Below are the logs for 3 trials of 0.2.0, followed by a try back in 0.1.1:
This should not be overfitting because training loss also goes up significantly (see trial 1 below epoch 6-8)

82/82 [==============================] - 26s 240ms/step - loss: 1.8052 - accuracy: 0.7529 - val_loss: 1.4044 - val_accuracy: 0.9008
Epoch 2/10
82/82 [==============================] - 18s 222ms/step - loss: 0.9378 - accuracy: 0.9928 - val_loss: 1.1254 - val_accuracy: 0.9402
Epoch 3/10
82/82 [==============================] - 19s 227ms/step - loss: 0.8694 - accuracy: 0.9998 - val_loss: 1.0796 - val_accuracy: 0.9568
Epoch 4/10
82/82 [==============================] - 19s 235ms/step - loss: 0.8606 - accuracy: 0.9981 - val_loss: 1.2567 - val_accuracy: 0.9129
Epoch 5/10
82/82 [==============================] - 20s 243ms/step - loss: 0.8973 - accuracy: 0.9926 - val_loss: 1.3626 - val_accuracy: 0.8697
Epoch 6/10
82/82 [==============================] - 20s 247ms/step - loss: 1.0307 - accuracy: 0.9604 - val_loss: 3.5454 - val_accuracy: 0.2591
Epoch 7/10
82/82 [==============================] - 20s 246ms/step - loss: 2.6023 - accuracy: 0.4676 - val_loss: 2.4857 - val_accuracy: 0.5265
Epoch 8/10
82/82 [==============================] - 21s 259ms/step - loss: 1.1391 - accuracy: 0.9466 - val_loss: 2.1388 - val_accuracy: 0.6197
Epoch 9/10
82/82 [==============================] - 22s 270ms/step - loss: 0.9089 - accuracy: 0.9987 - val_loss: 2.1356 - val_accuracy: 0.6288
Epoch 10/10
82/82 [==============================] - 22s 268ms/step - loss: 0.8600 - accuracy: 1.0000 - val_loss: 2.1670 - val_accuracy: 0.6386
Training time used: 207.4251148700714 seconds
26/26 [==============================] - 2s 68ms/step - loss: 2.1763 - accuracy: 0.6410
Evaluation time used: 1.9373724460601807 seconds
Press enter to exit


82/82 [==============================] - 26s 241ms/step - loss: 1.8440 - accuracy: 0.7460 - val_loss: 1.3971 - val_accuracy: 0.9061
Epoch 2/10
82/82 [==============================] - 19s 228ms/step - loss: 0.9806 - accuracy: 0.9838 - val_loss: 1.3532 - val_accuracy: 0.8553
Epoch 3/10
82/82 [==============================] - 20s 239ms/step - loss: 0.9339 - accuracy: 0.9939 - val_loss: 1.2715 - val_accuracy: 0.8932
Epoch 4/10
82/82 [==============================] - 20s 250ms/step - loss: 0.8926 - accuracy: 0.9987 - val_loss: 1.2204 - val_accuracy: 0.9076
Epoch 5/10
82/82 [==============================] - 21s 259ms/step - loss: 0.8572 - accuracy: 1.0000 - val_loss: 1.1878 - val_accuracy: 0.9273
Epoch 6/10
82/82 [==============================] - 21s 257ms/step - loss: 0.8339 - accuracy: 1.0000 - val_loss: 1.2109 - val_accuracy: 0.9311
Epoch 7/10
82/82 [==============================] - 22s 264ms/step - loss: 0.8209 - accuracy: 1.0000 - val_loss: 1.2264 - val_accuracy: 0.9235
Epoch 8/10
82/82 [==============================] - 22s 274ms/step - loss: 0.8141 - accuracy: 1.0000 - val_loss: 1.2146 - val_accuracy: 0.9341
Epoch 9/10
82/82 [==============================] - 22s 274ms/step - loss: 0.8119 - accuracy: 1.0000 - val_loss: 1.2290 - val_accuracy: 0.9318
Epoch 10/10
82/82 [==============================] - 23s 281ms/step - loss: 0.8101 - accuracy: 1.0000 - val_loss: 1.2303 - val_accuracy: 0.9326
Training time used: 216.41067028045654 seconds
26/26 [==============================] - 2s 74ms/step - loss: 1.2095 - accuracy: 0.9395
Evaluation time used: 2.0782406330108643 seconds
Press enter to exit

82/82 [==============================] - 26s 241ms/step - loss: 1.8379 - accuracy: 0.7479 - val_loss: 1.3547 - val_accuracy: 0.9098
Epoch 2/10
82/82 [==============================] - 18s 220ms/step - loss: 0.9387 - accuracy: 0.9933 - val_loss: 1.1183 - val_accuracy: 0.9348
Epoch 3/10
82/82 [==============================] - 18s 225ms/step - loss: 0.8746 - accuracy: 0.9983 - val_loss: 1.0805 - val_accuracy: 0.9515
Epoch 4/10
82/82 [==============================] - 19s 233ms/step - loss: 0.8401 - accuracy: 1.0000 - val_loss: 1.1020 - val_accuracy: 0.9515
Epoch 5/10
82/82 [==============================] - 20s 243ms/step - loss: 0.8266 - accuracy: 1.0000 - val_loss: 1.1124 - val_accuracy: 0.9606
Epoch 6/10
82/82 [==============================] - 21s 253ms/step - loss: 0.8528 - accuracy: 0.9947 - val_loss: 1.5557 - val_accuracy: 0.8439
Epoch 7/10
82/82 [==============================] - 21s 262ms/step - loss: 0.9036 - accuracy: 0.9922 - val_loss: 1.3016 - val_accuracy: 0.8856
Epoch 8/10
82/82 [==============================] - 23s 277ms/step - loss: 0.9223 - accuracy: 0.9918 - val_loss: 1.2298 - val_accuracy: 0.8932
Epoch 9/10
82/82 [==============================] - 24s 294ms/step - loss: 0.8822 - accuracy: 0.9977 - val_loss: 1.1790 - val_accuracy: 0.9068
Epoch 10/10
82/82 [==============================] - 24s 292ms/step - loss: 0.8459 - accuracy: 1.0000 - val_loss: 1.1314 - val_accuracy: 0.9303
Training time used: 214.4180076122284 seconds
26/26 [==============================] - 2s 75ms/step - loss: 1.1265 - accuracy: 0.9340
Evaluation time used: 2.1039135456085205 seconds
Press enter to exit


below is 0.1.1

82/82 [==============================] - 49s 519ms/step - loss: 1.7849 - accuracy: 0.7597 - val_loss: 1.4926 - val_accuracy: 0.8848
Epoch 2/10
82/82 [==============================] - 42s 512ms/step - loss: 0.9276 - accuracy: 0.9958 - val_loss: 1.1409 - val_accuracy: 0.9371
Epoch 3/10
82/82 [==============================] - 40s 482ms/step - loss: 0.8713 - accuracy: 0.9994 - val_loss: 1.0906 - val_accuracy: 0.9606
Epoch 4/10
82/82 [==============================] - 38s 467ms/step - loss: 0.8421 - accuracy: 0.9998 - val_loss: 1.1119 - val_accuracy: 0.9515
Epoch 5/10
82/82 [==============================] - 38s 468ms/step - loss: 0.8280 - accuracy: 1.0000 - val_loss: 1.1241 - val_accuracy: 0.9485
Epoch 6/10
82/82 [==============================] - 38s 467ms/step - loss: 0.8235 - accuracy: 1.0000 - val_loss: 1.0951 - val_accuracy: 0.9576
Epoch 7/10
82/82 [==============================] - 38s 469ms/step - loss: 0.8224 - accuracy: 0.9998 - val_loss: 1.1128 - val_accuracy: 0.9583
Epoch 8/10
82/82 [==============================] - 40s 489ms/step - loss: 0.8203 - accuracy: 0.9998 - val_loss: 1.1370 - val_accuracy: 0.9523
Epoch 9/10
82/82 [==============================] - 41s 499ms/step - loss: 0.8175 - accuracy: 1.0000 - val_loss: 1.1273 - val_accuracy: 0.9576
Epoch 10/10
82/82 [==============================] - 40s 491ms/step - loss: 0.8165 - accuracy: 1.0000 - val_loss: 1.0973 - val_accuracy: 0.9553
Training time used: 405.3269968032837 seconds
26/26 [==============================] - 2s 63ms/step - loss: 1.0981 - accuracy: 0.9581
Evaluation time used: 1.8266606330871582 seconds
Press enter to exit

I think when you fill int32 registers, there may be a loss of accuracy. Please check if this can be reproduced on your Titan V. If it cannot, that means it has something to do with the Ampere architecture.

My NVIDIA GPU driver is Studio driver 522.30

aliencaocao Oct 21, 2022
Author

Just tested out efficient net B3, same accuracy issue (see epoch 8-9):


164/164 [==============================] - 44s 209ms/step - loss: 1.8956 - accuracy: 0.7208 - val_loss: 1.4887 - val_accuracy: 0.8227
Epoch 2/10
164/164 [==============================] - 32s 195ms/step - loss: 0.9845 - accuracy: 0.9754 - val_loss: 1.0793 - val_accuracy: 0.9470
Epoch 3/10
164/164 [==============================] - 32s 196ms/step - loss: 0.8810 - accuracy: 0.9954 - val_loss: 1.0257 - val_accuracy: 0.9583
Epoch 4/10
164/164 [==============================] - 33s 199ms/step - loss: 0.8501 - accuracy: 0.9985 - val_loss: 0.9483 - val_accuracy: 0.9712
Epoch 5/10
164/164 [==============================] - 33s 202ms/step - loss: 0.8262 - accuracy: 1.0000 - val_loss: 0.9510 - val_accuracy: 0.9682
Epoch 6/10
164/164 [==============================] - 34s 207ms/step - loss: 0.8175 - accuracy: 0.9998 - val_loss: 0.9197 - val_accuracy: 0.9750
Epoch 7/10
164/164 [==============================] - 34s 210ms/step - loss: 0.8089 - accuracy: 1.0000 - val_loss: 0.9119 - val_accuracy: 0.9765
Epoch 8/10
164/164 [==============================] - 35s 212ms/step - loss: 1.1439 - accuracy: 0.9051 - val_loss: 5.6820 - val_accuracy: 0.0235
Epoch 9/10
164/164 [==============================] - 35s 214ms/step - loss: 1.6293 - accuracy: 0.7611 - val_loss: 1.6344 - val_accuracy: 0.7629
Epoch 10/10
164/164 [==============================] - 35s 213ms/step - loss: 0.9966 - accuracy: 0.9712 - val_loss: 1.3108 - val_accuracy: 0.8636
Training time used: 347.41023993492126 seconds
51/51 [==============================] - 3s 52ms/step - loss: 1.3360 - accuracy: 0.8575
Evaluation time used: 2.7214744091033936 seconds
Press enter to exit

I have run this same script (with efficient net B3) on CUDA with and without mixed precision more than 10 times (because this is actually a standard benchmarking script in my organization), and none of them gave lower than 95.8% test accuracy. The average is 96.9%.

Seems like a minor exploding gradient to me.
This is also ran without mixed precision.

PatriceVignola Oct 21, 2022

I agree, your results look weird. I tried reproducing it on my Titan V but wasn't able to. All 3 runs I ran were very stable and were over 95% accuracy every time, so there are a few possibilities here:

Bad RNG seeds (although unlikely if you ran more than 3 times and with different models)
Bug in the driver
Bug on this combination of RTX 3080Ti + driver

I don't have a 3080Ti to test it out, but I have a 3070 which should be close enough. I'll try reproducing it. I'll also try to reduce the randomness in the model by seeding as many things as possible, although it's generally pretty hard to do perfectly in TensorFlow.

aliencaocao Oct 21, 2022
Author

https://docs.google.com/spreadsheets/d/1y3W8CSJOu_nXFUuYDflYCwUvlG4IfpB9Bv1MGnN6fjY/edit#gid=0
You can check out this google doc where my organization record some benchmarking results. All the results at top are ran with efficient net B3 and use a variety of GPUs. All of them have high accuracy (except for a few caused by low batch sizes but that's not the issue here).

PatriceVignola Oct 21, 2022

Yep, definitely odd that the accuracy dropped so much when using 0.2.0 in your case on 3080Ti, but like I said I get the same accuracy as 0.1.1 on my Titan V. This leads me to believe that this is a driver issue or something else is going on here. I'll keep looking into it.

aliencaocao · 2022-10-21T04:16:35Z

aliencaocao
Oct 21, 2022
Author

On the NaN issue, I just tested out effecient net B3 with mixed precision on, and it also suffer the same.
On both efficient net B3 and ResNetRS, I see this error right before epoch starts which may be related to the NaN causes:

C:\Program Files\Python39\lib\site-packages\numpy\core\fromnumeric.py:3432: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
C:\Program Files\Python39\lib\site-packages\numpy\core\_methods.py:190: RuntimeWarning: invalid value encountered in divide
  ret = ret.dtype.type(ret / rcount)
 8/82 [=>............................] - ETA: 23s - loss: nan - accuracy: nan

Hope this helps with debugging it.

0 replies

aliencaocao · 2022-12-13T05:18:35Z

aliencaocao
Dec 13, 2022
Author

Hi @PatriceVignola any updates on these issues?

0 replies

PatriceVignola · 2022-12-13T19:05:03Z

PatriceVignola
Dec 13, 2022

We released version 0.3.0 yesterday which adds a workaround for the NaN issue, but mixed precision performance won't be better. We now have an idea why mixed precision is so much slower than CUDA, and it's partially due to tensor cores not being used. We don't have an ETA yet for this work.

1 reply

aliencaocao Dec 14, 2022
Author

Thanks for the update. How about the loss in accuracy in certain models? Were you able to reproduce?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues on Nvidia GPUs #315

{{title}}

Replies: 5 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Performance issues on Nvidia GPUs #315

aliencaocao Oct 19, 2022

Replies: 5 comments · 12 replies

PatriceVignola Oct 19, 2022

aliencaocao Oct 19, 2022 Author

aliencaocao Oct 21, 2022 Author

aliencaocao Oct 21, 2022 Author

PatriceVignola Oct 21, 2022

aliencaocao Oct 21, 2022 Author

PatriceVignola Oct 21, 2022

aliencaocao Oct 21, 2022 Author

aliencaocao Dec 13, 2022 Author

PatriceVignola Dec 13, 2022

aliencaocao Dec 14, 2022 Author

aliencaocao
Oct 19, 2022

Replies: 5 comments 12 replies

PatriceVignola
Oct 19, 2022

aliencaocao
Oct 19, 2022
Author

aliencaocao Oct 21, 2022
Author

aliencaocao Oct 21, 2022
Author

aliencaocao Oct 21, 2022
Author

aliencaocao
Oct 21, 2022
Author

aliencaocao
Dec 13, 2022
Author

PatriceVignola
Dec 13, 2022

aliencaocao Dec 14, 2022
Author