LocalNormalizedCrossCorrelationLoss and TF32 numerical stability #6525

wyli · 2023-05-17T08:00:46Z

Describe the bug
Follow-up of Project-MONAI/tutorials#1336, depending on the cuDNN version and GPU mode, LocalNormalizedCrossCorrelationLoss running with low precision operations may not be numerically stable.

The current workaround is with
torch.backends.cuda.matmul.allow_tf32 = False torch.backends.cudnn.allow_tf32 = False. It would be great to improve the stability in general.

The text was updated successfully, but these errors were encountered:

qingpeng9802 · 2023-07-19T20:19:45Z

After some investigations, I believe this is probably not a numerical stability issue. See https://dev-discuss.pytorch.org/t/pytorch-and-tensorfloat32/504 for more similar issues. The issue is NGC container sets the default value torch.backends.cuda.matmul.allow_tf32 = False to True somewhere (since tf32 is a trick for significantly accelerating training).
Therefore, the issue only occurs in the NGC container but not in the pytorch official docker.

Maybe we should mention the tf32 issue to the users somewhere in the documentation?

wyli · 2023-07-19T20:30:53Z

thanks, before we have a workaround, perhaps adding a warning message in the constructor when the flag is True?

MONAI/monai/losses/image_dissimilarity.py

Line 90 in 42e3674

self.ndim = spatial_dims

qingpeng9802 · 2023-07-20T13:08:00Z

It is okay to add a warning for the loss. However, my larger concern is that other operations in monai will be also affected by the tf32 issue (since all operations uses cuda.matmul are affected). This may lead to significant reproducibility issues.

My proposal is adding something like
https://github.com/Lightning-AI/lightning/pull/16037/files#diff-909e246d6c36514f952ae5023bd9fbcc3e8f2c6a0837ebf81d7dc96790b5f938R190-R210
to related classes/functions in monai. Then, monai will print warnings when the flag is True. Not sure when it is better to print warnings, maybe during import? Maybe warnings can be suppressed when the flage is explicitly set by users, but it seems technically challenging.
&
adding a part in the documentation to educate users how to use tf32 properly.

qingpeng9802 · 2023-07-21T17:25:08Z

By checking https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/layers Tag 23.06-py3, LAYERS: 08aa16a90c shows that ENV TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1, which modifies the flag in the NGC container. Therefore, the scope of the warning can be narrowed down to the users who have this environment variable, which is more feasible.

wyli · 2023-07-21T17:39:33Z

thanks @qingpeng9802, I think that's a very good point and I create a separate ticket to track the features #6754.

wyli added Contribution wanted Feature request labels May 17, 2023

wyli mentioned this issue May 17, 2023

Tutorial deep_atlas shows a huge negative loss in PyTorch 22.09-23.03 on some GPUs Project-MONAI/tutorials#1336

Closed

wyli mentioned this issue Jul 21, 2023

warning msg/documentation on the tf32 related system flags and usage #6754

Closed

vikashg closed this as completed Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LocalNormalizedCrossCorrelationLoss and TF32 numerical stability #6525

LocalNormalizedCrossCorrelationLoss and TF32 numerical stability #6525

wyli commented May 17, 2023

qingpeng9802 commented Jul 19, 2023

wyli commented Jul 19, 2023

qingpeng9802 commented Jul 20, 2023 •

edited

Loading

qingpeng9802 commented Jul 21, 2023 •

edited

Loading

wyli commented Jul 21, 2023

LocalNormalizedCrossCorrelationLoss and TF32 numerical stability #6525

LocalNormalizedCrossCorrelationLoss and TF32 numerical stability #6525

Comments

wyli commented May 17, 2023

qingpeng9802 commented Jul 19, 2023

wyli commented Jul 19, 2023

qingpeng9802 commented Jul 20, 2023 • edited Loading

qingpeng9802 commented Jul 21, 2023 • edited Loading

wyli commented Jul 21, 2023

qingpeng9802 commented Jul 20, 2023 •

edited

Loading

qingpeng9802 commented Jul 21, 2023 •

edited

Loading