Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocalNormalizedCrossCorrelationLoss and TF32 numerical stability #6525

Closed
wyli opened this issue May 17, 2023 · 5 comments
Closed

LocalNormalizedCrossCorrelationLoss and TF32 numerical stability #6525

wyli opened this issue May 17, 2023 · 5 comments

Comments

@wyli
Copy link
Contributor

wyli commented May 17, 2023

Describe the bug
Follow-up of Project-MONAI/tutorials#1336, depending on the cuDNN version and GPU mode, LocalNormalizedCrossCorrelationLoss running with low precision operations may not be numerically stable.

The current workaround is with
torch.backends.cuda.matmul.allow_tf32 = False torch.backends.cudnn.allow_tf32 = False. It would be great to improve the stability in general.

@qingpeng9802
Copy link
Contributor

After some investigations, I believe this is probably not a numerical stability issue. See https://dev-discuss.pytorch.org/t/pytorch-and-tensorfloat32/504 for more similar issues. The issue is NGC container sets the default value torch.backends.cuda.matmul.allow_tf32 = False to True somewhere (since tf32 is a trick for significantly accelerating training).
Therefore, the issue only occurs in the NGC container but not in the pytorch official docker.

Maybe we should mention the tf32 issue to the users somewhere in the documentation?

@wyli
Copy link
Contributor Author

wyli commented Jul 19, 2023

thanks, before we have a workaround, perhaps adding a warning message in the constructor when the flag is True?

self.ndim = spatial_dims

@qingpeng9802
Copy link
Contributor

qingpeng9802 commented Jul 20, 2023

It is okay to add a warning for the loss. However, my larger concern is that other operations in monai will be also affected by the tf32 issue (since all operations uses cuda.matmul are affected). This may lead to significant reproducibility issues.

My proposal is adding something like
https://github.com/Lightning-AI/lightning/pull/16037/files#diff-909e246d6c36514f952ae5023bd9fbcc3e8f2c6a0837ebf81d7dc96790b5f938R190-R210
to related classes/functions in monai. Then, monai will print warnings when the flag is True. Not sure when it is better to print warnings, maybe during import? Maybe warnings can be suppressed when the flage is explicitly set by users, but it seems technically challenging.
&
adding a part in the documentation to educate users how to use tf32 properly.

@qingpeng9802
Copy link
Contributor

qingpeng9802 commented Jul 21, 2023

By checking https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/layers Tag 23.06-py3, LAYERS: 08aa16a90c shows that ENV TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1, which modifies the flag in the NGC container. Therefore, the scope of the warning can be narrowed down to the users who have this environment variable, which is more feasible.

@wyli
Copy link
Contributor Author

wyli commented Jul 21, 2023

thanks @qingpeng9802, I think that's a very good point and I create a separate ticket to track the features #6754.

@vikashg vikashg closed this as completed Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants