Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVPTX] Only run LowerUnreachable when necessary #109868

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

justinfargnoli
Copy link
Contributor

NVPTX: Lower unreachable to exit to allow ptxas to accurately reconstruct the CFG. added the LowerUnreachable pass to NVPTX. Based on the PR description:

Finally, although I expect this to fix most of
https://bugs.llvm.org/show_bug.cgi?id=27738, I do still encounter
miscompilations with Julia's unreachable-heavy code when targeting these
older GPUs using an older ptxas version (specifically, from CUDA 11.4 or
below). This is likely due to related bugs in ptxas which have been fixed
since, as I have filed several reproducers with NVIDIA over the past couple of
years. I'm not inclined to look into fixing those issues over here, and will
instead be recommending our users to upgrade CUDA to 11.5+ when using these GPUs.

this pass is only necessary when targeting Pascal or earlier via ptxas from CUDA 11.4 or earlier. This PR updates NVPTXTargetMachine.cpp to reflect that.

CC @maleadt

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 24, 2024

@llvm/pr-subscribers-backend-nvptx

Author: Justin Fargnoli (justinfargnoli)

Changes

NVPTX: Lower unreachable to exit to allow ptxas to accurately reconstruct the CFG. added the LowerUnreachable pass to NVPTX. Based on the PR description:

> Finally, although I expect this to fix most of
> https://bugs.llvm.org/show_bug.cgi?id=27738, I do still encounter
> miscompilations with Julia's unreachable-heavy code when targeting these
> older GPUs using an older ptxas version (specifically, from CUDA 11.4 or
> below). This is likely due to related bugs in ptxas which have been fixed
> since, as I have filed several reproducers with NVIDIA over the past couple of
> years. I'm not inclined to look into fixing those issues over here, and will
> instead be recommending our users to upgrade CUDA to 11.5+ when using these GPUs.

this pass is only necessary when targeting Pascal or earlier via ptxas from CUDA 11.4 or earlier. This PR updates NVPTXTargetMachine.cpp to reflect that.

CC @maleadt


Full diff: https://github.com/llvm/llvm-project/pull/109868.diff

3 Files Affected:

  • (modified) llvm/lib/Target/NVPTX/NVPTXSubtarget.h (+3)
  • (modified) llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp (+7-3)
  • (modified) llvm/test/CodeGen/NVPTX/unreachable.ll (+9-6)
diff --git a/llvm/lib/Target/NVPTX/NVPTXSubtarget.h b/llvm/lib/Target/NVPTX/NVPTXSubtarget.h
index 8b9059bd60cbd4..e2ce088cacdf53 100644
--- a/llvm/lib/Target/NVPTX/NVPTXSubtarget.h
+++ b/llvm/lib/Target/NVPTX/NVPTXSubtarget.h
@@ -95,6 +95,9 @@ class NVPTXSubtarget : public NVPTXGenSubtargetInfo {
   bool hasDotInstructions() const {
     return SmVersion >= 61 && PTXVersion >= 50;
   }
+  bool hasPTXASUnreachableBug() const {
+    return SmVersion < 70 && PTXVersion <= 74;
+  }
   bool hasCvtaParam() const { return SmVersion >= 70 && PTXVersion >= 77; }
   unsigned int getFullSmVersion() const { return FullSmVersion; }
   unsigned int getSmVersion() const { return getFullSmVersion() / 10; }
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
index 57b7fa783c14a7..b79b4ff93efe49 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
@@ -368,9 +368,13 @@ void NVPTXPassConfig::addIRPasses() {
     addPass(createSROAPass());
   }
 
-  const auto &Options = getNVPTXTargetMachine().Options;
-  addPass(createNVPTXLowerUnreachablePass(Options.TrapUnreachable,
-                                          Options.NoTrapAfterNoreturn));
+  if (ST.hasPTXASUnreachableBug()) {
+    // Run LowerUnreachable to WAR a ptxas bug. See the commit description of
+    // 1ee4d880e8760256c606fe55b7af85a4f70d006d for more details.
+    const auto &Options = getNVPTXTargetMachine().Options;
+    addPass(createNVPTXLowerUnreachablePass(Options.TrapUnreachable,
+                                            Options.NoTrapAfterNoreturn));
+  }
 }
 
 bool NVPTXPassConfig::addInstSelector() {
diff --git a/llvm/test/CodeGen/NVPTX/unreachable.ll b/llvm/test/CodeGen/NVPTX/unreachable.ll
index 011497c4e23401..4219f0f3b47fc9 100644
--- a/llvm/test/CodeGen/NVPTX/unreachable.ll
+++ b/llvm/test/CodeGen/NVPTX/unreachable.ll
@@ -1,11 +1,15 @@
 ; RUN: llc < %s -march=nvptx -mcpu=sm_20 -verify-machineinstrs \
-; RUN:     | FileCheck %s  --check-prefix=CHECK --check-prefix=CHECK-NOTRAP
+; RUN:     | FileCheck %s  --check-prefixes=CHECK,CHECK-NOTRAP
 ; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -verify-machineinstrs \
-; RUN:     | FileCheck %s  --check-prefix=CHECK --check-prefix=CHECK-NOTRAP
+; RUN:     | FileCheck %s  --check-prefixes=CHECK,CHECK-NOTRAP
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -verify-machineinstrs -mattr=+ptx75 \
+; RUN:     | FileCheck %s  --check-prefixes=CHECK-BUG-FIXED
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_70 -verify-machineinstrs \
+; RUN:     | FileCheck %s  --check-prefixes=CHECK-BUG-FIXED
 ; RUN: llc < %s -march=nvptx -mcpu=sm_20 -verify-machineinstrs -trap-unreachable \
-; RUN:     | FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-TRAP
+; RUN:     | FileCheck %s --check-prefixes=CHECK,CHECK-TRAP
 ; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -verify-machineinstrs -trap-unreachable \
-; RUN:     | FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-TRAP
+; RUN:     | FileCheck %s --check-prefixes=CHECK,CHECK-TRAP
 ; RUN: %if ptxas && !ptxas-12.0 %{ llc < %s -march=nvptx -mcpu=sm_20 -verify-machineinstrs | %ptxas-verify %}
 ; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_20 -verify-machineinstrs | %ptxas-verify %}
 
@@ -21,12 +25,11 @@ define void @kernel_func() {
 ; CHECK-TRAP: trap;
 ; CHECK-NOTRAP-NOT: trap;
 ; CHECK: exit;
+; CHECK-BUG-FIXED-NOT: exit;
   unreachable
 }
 
 attributes #0 = { noreturn }
 
-
 !nvvm.annotations = !{!1}
-
 !1 = !{ptr @kernel_func, !"kernel", i32 1}

const auto &Options = getNVPTXTargetMachine().Options;
addPass(createNVPTXLowerUnreachablePass(Options.TrapUnreachable,
Options.NoTrapAfterNoreturn));
if (ST.hasPTXASUnreachableBug()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Topic for discussion: should this be a CLI option?

I added the check for PTX Version <= 7.4 because it's the best proxy for querying whether ptxas comes from CUDA 11.4 or earlier I could find. (The highest version CUDA 11.4 ptxas supports is 7.4.)

Alternatively, we could put the onus on the user to set a flag indicating that they're using a copy of ptxas from CUDA 11.4 or prior.

CC @Artem-B for comment.

@Artem-B
Copy link
Member

Artem-B commented Sep 24, 2024

this pass is only necessary when targeting Pascal or earlier via ptxas from CUDA 11.4 or earlier. This PR updates NVPTXTargetMachine.cpp to reflect that.

I'm confused. The comment you quoted in the description only says

I do still encounter
miscompilations with Julia's unreachable-heavy code when targeting these
older GPUs using an older ptxas version (specifically, from CUDA 11.4 or
below).

The way I read it, it says that 11.4 and older ptxas with older GPUs may have more bugs that are not fixed by this patch, but I do not think it implies that the newer versions of ptxas or the newer GPUs do not need this pass.

What am I missing?

@maleadt
Copy link
Contributor

maleadt commented Sep 25, 2024

The way I read it, it says that 11.4 and older ptxas with older GPUs may have more bugs that are not fixed by this patch

Yeah, that was what I meant.

Although we only know about issues on Pascal and earlier because bar.sync is not allowed to be executed divergently there, making sure that ptxas has an accurate view of the CFG as intended by LLVM seems important on later hardware generations too. Citing from the relevant NVIDIA bug report (4078847):

If LLVM creates unstructured control flows then it makes many downstream compiler optimizations less effective.

We could avoid lowering unreachable to exit; trap; starting from the ptxas version that's known to model control flow accurately for just trap instructions, which it didn't when I proposed this change, but was told to me would get fixed in a future version of the compiler. I'm not sure if and how we can guard on that though, as the code generated by NVPTX may be handed to whatever version of back-end compiler out there (e.g. when compiling to PTX and using the driver APIs as opposed to having clang drive the ptxas invocation).

@Artem-B
Copy link
Member

Artem-B commented Sep 25, 2024

If we know which ptxas version no longer needs exit, then we may use PTX version that's only supported by that version of ptxas (or the driver), or newer. This guarantees that it will never be compileable by the older ptxas with the bug.

  bool hasPTXASUnreachableBug() const {
    return SmVersion < 70 && PTXVersion <= 74;
  }

With that in mind, this condition should be changed appropriately. Assuming that it's the latest public CUDA release that has the bug fixed, we should only mark PTX ISA 8.5 or newer as safe. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#release-notes

@duk-37
Copy link
Contributor

duk-37 commented Sep 25, 2024

We could avoid lowering unreachable to exit; trap; starting from the ptxas version that's known to model control flow accurately for just trap instructions, which it didn't when I proposed this change, but was told to me would get fixed in a future version of the compiler. I'm not sure if and how we can guard on that though, as the code generated by NVPTX may be handed to whatever version of back-end compiler out there (e.g. when compiling to PTX and using the driver APIs as opposed to having clang drive the ptxas invocation).

That would be a change in the instruction selection backend though, not disabling this pass altogether, since that would mean there's still a need for trap instructions to avoid bogus CFG edges.

Alternatively, one could just do what WebAssembly and PS4+PS5 do and force these TargetOptions to be enabled. That would get rid of this pass completely. The question of "should TargetOptions be load-bearing" is a good one, but I feel like that bridge has been crossed already and there are many, many other places in which these things are inaccessible or user-provided command line options are just outright ignored.

@justinfargnoli
Copy link
Contributor Author

I'm confused. The comment you quoted in the description only says

Yes, this is my bad. I should have quoted places where @maleadt said things along the lines of:

we only know about issues on Pascal and earlier because bar.sync is not allowed to be executed divergently there


Citing from the relevant NVIDIA bug report (4078847):

Thank you for the bug number! I'm following up internally to confirm what release the fix was included in.


That would be a change in the instruction selection backend though, not disabling this pass altogether

I thought this pass added the exit instruction, not instruction selection. Assuming I'm missing something, where should we make this change within the instruction selection?

disabling this pass altogether ... mean(s) there's still a need for trap instructions to avoid bogus CFG edges.

This pass doesn't touch the trap instruction, so it should be okay to disable it, right?

@duk-37
Copy link
Contributor

duk-37 commented Sep 28, 2024

That would be a change in the instruction selection backend though, not disabling this pass altogether

I thought this pass added the exit instruction, not instruction selection. Assuming I'm missing something, where should we make this change within the instruction selection?

Sorry, I probably added some confusion with the way I worded things. What I was saying is that, if the reason this pass exists is to avoid bogus CFG edges (which makes sense, especially if there are downstream optimizations happening) then it either:

  1. should stay as-is, or;
  2. TrapUnreachable + NoTrapAfterNoreturrn could be set to true/false respectively in order to fix the issue in pretty much the same way, just without the extra pass.

What I meant by "change in instruction selection" was avoiding emitting exit when emitting trap instructions if you have some way of accurately determining the downstream compiler. That's what would probably involve changing some TableGen stuff to add checks for how trapinst is lowered.

The introduction of the pass happened to be due to a miscompile because of bogus CFG edges, but even if that's been fixed it's still probably a useful hint to any compilers taking this output as input like @maleadt said.

@Artem-B
Copy link
Member

Artem-B commented Sep 30, 2024

To me it looks like there are two things we need to decide here:

  • Can ptxas guarantee that it can handle unstructured CFG in principle? My recollection from the time when we initially struggled wit this issue is that it's not always possible. The nature of "unreachable" code in generated PTX is that it can result in unpredictable phantom CFG edges (as far as ptxas is concerned). Explicitly marking unreachable code with something that allows us to propagate that information to ptxas looks like something that we need to do. trap, AFAICT, is the best way to do so that we have right now.

The downside of this approach is that it is an extra instruction and, I believe, there were some reported use cases where it resulted in performance regressions in user code.

Ideally, it would be nice if ptxas would provide a NOP-like equivalent of unreachable, which would preserve correctness of its view of CFG without generating any extra instructions.

  • generation of trap; exit; is a workaround for the bug in the older ptxas which did not consider trap by itself as a CFG leaf. Whether we emit it or not would depend on whether we want/need to emit the unreachable hint instruction to PTX, and, if we do, or on the PTX (or ptxas) version.

@justinfargnoli justinfargnoli force-pushed the dev/jf/upstream-guard-lower-unreachable branch from 6bc6174 to b88976b Compare October 8, 2024 22:52
@justinfargnoli
Copy link
Contributor Author

justinfargnoli commented Oct 8, 2024

trap, AFAICT, is the best way to do so that we have right now.

I followed up internally and confirmed that trap is the best option that we have right now.

if we do, or on the PTX (or ptxas) version.

I've confirmed that this was fixed in CUDA 12.3, corresponding to PTX version 8.3.


whether we want/need to emit the unreachable hint instruction to PTX

Assuming that we do want to emit the unreachable hint (I don't think anyone has advocated against this), this PR should be ready for another round of review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants