[NVPTX] Only run LowerUnreachable when necessary #109868

justinfargnoli · 2024-09-24T21:47:33Z

NVPTX: Lower unreachable to exit to allow ptxas to accurately reconstruct the CFG. added the LowerUnreachable pass to NVPTX. Based on the PR description:

Finally, although I expect this to fix most of
https://bugs.llvm.org/show_bug.cgi?id=27738, I do still encounter
miscompilations with Julia's unreachable-heavy code when targeting these
older GPUs using an older ptxas version (specifically, from CUDA 11.4 or
below). This is likely due to related bugs in ptxas which have been fixed
since, as I have filed several reproducers with NVIDIA over the past couple of
years. I'm not inclined to look into fixing those issues over here, and will
instead be recommending our users to upgrade CUDA to 11.5+ when using these GPUs.

this pass is only necessary when targeting Pascal or earlier via ptxas from CUDA 11.4 or earlier. This PR updates NVPTXTargetMachine.cpp to reflect that.

CC @maleadt

llvmbot · 2024-09-24T21:48:10Z

@llvm/pr-subscribers-backend-nvptx

Author: Justin Fargnoli (justinfargnoli)

Changes

NVPTX: Lower unreachable to exit to allow ptxas to accurately reconstruct the CFG. added the LowerUnreachable pass to NVPTX. Based on the PR description:

> Finally, although I expect this to fix most of
> https://bugs.llvm.org/show_bug.cgi?id=27738, I do still encounter
> miscompilations with Julia's unreachable-heavy code when targeting these
> older GPUs using an older ptxas version (specifically, from CUDA 11.4 or
> below). This is likely due to related bugs in ptxas which have been fixed
> since, as I have filed several reproducers with NVIDIA over the past couple of
> years. I'm not inclined to look into fixing those issues over here, and will
> instead be recommending our users to upgrade CUDA to 11.5+ when using these GPUs.

this pass is only necessary when targeting Pascal or earlier via ptxas from CUDA 11.4 or earlier. This PR updates NVPTXTargetMachine.cpp to reflect that.

CC @maleadt

Full diff: https://github.com/llvm/llvm-project/pull/109868.diff

3 Files Affected:

(modified) llvm/lib/Target/NVPTX/NVPTXSubtarget.h (+3)
(modified) llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp (+7-3)
(modified) llvm/test/CodeGen/NVPTX/unreachable.ll (+9-6)

diff --git a/llvm/lib/Target/NVPTX/NVPTXSubtarget.h b/llvm/lib/Target/NVPTX/NVPTXSubtarget.h
index 8b9059bd60cbd4..e2ce088cacdf53 100644
--- a/llvm/lib/Target/NVPTX/NVPTXSubtarget.h
+++ b/llvm/lib/Target/NVPTX/NVPTXSubtarget.h
@@ -95,6 +95,9 @@ class NVPTXSubtarget : public NVPTXGenSubtargetInfo {
   bool hasDotInstructions() const {
     return SmVersion >= 61 && PTXVersion >= 50;
   }
+  bool hasPTXASUnreachableBug() const {
+    return SmVersion < 70 && PTXVersion <= 74;
+  }
   bool hasCvtaParam() const { return SmVersion >= 70 && PTXVersion >= 77; }
   unsigned int getFullSmVersion() const { return FullSmVersion; }
   unsigned int getSmVersion() const { return getFullSmVersion() / 10; }
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
index 57b7fa783c14a7..b79b4ff93efe49 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp
@@ -368,9 +368,13 @@ void NVPTXPassConfig::addIRPasses() {
     addPass(createSROAPass());
   }
 
-  const auto &Options = getNVPTXTargetMachine().Options;
-  addPass(createNVPTXLowerUnreachablePass(Options.TrapUnreachable,
-                                          Options.NoTrapAfterNoreturn));
+  if (ST.hasPTXASUnreachableBug()) {
+    // Run LowerUnreachable to WAR a ptxas bug. See the commit description of
+    // 1ee4d880e8760256c606fe55b7af85a4f70d006d for more details.
+    const auto &Options = getNVPTXTargetMachine().Options;
+    addPass(createNVPTXLowerUnreachablePass(Options.TrapUnreachable,
+                                            Options.NoTrapAfterNoreturn));
+  }
 }
 
 bool NVPTXPassConfig::addInstSelector() {
diff --git a/llvm/test/CodeGen/NVPTX/unreachable.ll b/llvm/test/CodeGen/NVPTX/unreachable.ll
index 011497c4e23401..4219f0f3b47fc9 100644
--- a/llvm/test/CodeGen/NVPTX/unreachable.ll
+++ b/llvm/test/CodeGen/NVPTX/unreachable.ll
@@ -1,11 +1,15 @@
 ; RUN: llc < %s -march=nvptx -mcpu=sm_20 -verify-machineinstrs \
-; RUN:     | FileCheck %s  --check-prefix=CHECK --check-prefix=CHECK-NOTRAP
+; RUN:     | FileCheck %s  --check-prefixes=CHECK,CHECK-NOTRAP
 ; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -verify-machineinstrs \
-; RUN:     | FileCheck %s  --check-prefix=CHECK --check-prefix=CHECK-NOTRAP
+; RUN:     | FileCheck %s  --check-prefixes=CHECK,CHECK-NOTRAP
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -verify-machineinstrs -mattr=+ptx75 \
+; RUN:     | FileCheck %s  --check-prefixes=CHECK-BUG-FIXED
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_70 -verify-machineinstrs \
+; RUN:     | FileCheck %s  --check-prefixes=CHECK-BUG-FIXED
 ; RUN: llc < %s -march=nvptx -mcpu=sm_20 -verify-machineinstrs -trap-unreachable \
-; RUN:     | FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-TRAP
+; RUN:     | FileCheck %s --check-prefixes=CHECK,CHECK-TRAP
 ; RUN: llc < %s -march=nvptx64 -mcpu=sm_20 -verify-machineinstrs -trap-unreachable \
-; RUN:     | FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-TRAP
+; RUN:     | FileCheck %s --check-prefixes=CHECK,CHECK-TRAP
 ; RUN: %if ptxas && !ptxas-12.0 %{ llc < %s -march=nvptx -mcpu=sm_20 -verify-machineinstrs | %ptxas-verify %}
 ; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_20 -verify-machineinstrs | %ptxas-verify %}
 
@@ -21,12 +25,11 @@ define void @kernel_func() {
 ; CHECK-TRAP: trap;
 ; CHECK-NOTRAP-NOT: trap;
 ; CHECK: exit;
+; CHECK-BUG-FIXED-NOT: exit;
   unreachable
 }
 
 attributes #0 = { noreturn }
 
-
 !nvvm.annotations = !{!1}
-
 !1 = !{ptr @kernel_func, !"kernel", i32 1}

justinfargnoli · 2024-09-24T21:55:18Z

llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp

-  const auto &Options = getNVPTXTargetMachine().Options;
-  addPass(createNVPTXLowerUnreachablePass(Options.TrapUnreachable,
-                                          Options.NoTrapAfterNoreturn));
+  if (ST.hasPTXASUnreachableBug()) {


Topic for discussion: should this be a CLI option?

I added the check for PTX Version <= 7.4 because it's the best proxy for querying whether ptxas comes from CUDA 11.4 or earlier I could find. (The highest version CUDA 11.4 ptxas supports is 7.4.)

Alternatively, we could put the onus on the user to set a flag indicating that they're using a copy of ptxas from CUDA 11.4 or prior.

CC @Artem-B for comment.

Artem-B · 2024-09-24T23:42:30Z

this pass is only necessary when targeting Pascal or earlier via ptxas from CUDA 11.4 or earlier. This PR updates NVPTXTargetMachine.cpp to reflect that.

I'm confused. The comment you quoted in the description only says

I do still encounter
miscompilations with Julia's unreachable-heavy code when targeting these
older GPUs using an older ptxas version (specifically, from CUDA 11.4 or
below).

The way I read it, it says that 11.4 and older ptxas with older GPUs may have more bugs that are not fixed by this patch, but I do not think it implies that the newer versions of ptxas or the newer GPUs do not need this pass.

What am I missing?

maleadt · 2024-09-25T08:12:59Z

The way I read it, it says that 11.4 and older ptxas with older GPUs may have more bugs that are not fixed by this patch

Yeah, that was what I meant.

Although we only know about issues on Pascal and earlier because bar.sync is not allowed to be executed divergently there, making sure that ptxas has an accurate view of the CFG as intended by LLVM seems important on later hardware generations too. Citing from the relevant NVIDIA bug report (4078847):

If LLVM creates unstructured control flows then it makes many downstream compiler optimizations less effective.

We could avoid lowering unreachable to exit; trap; starting from the ptxas version that's known to model control flow accurately for just trap instructions, which it didn't when I proposed this change, but was told to me would get fixed in a future version of the compiler. I'm not sure if and how we can guard on that though, as the code generated by NVPTX may be handed to whatever version of back-end compiler out there (e.g. when compiling to PTX and using the driver APIs as opposed to having clang drive the ptxas invocation).

Artem-B · 2024-09-25T18:12:47Z

If we know which ptxas version no longer needs exit, then we may use PTX version that's only supported by that version of ptxas (or the driver), or newer. This guarantees that it will never be compileable by the older ptxas with the bug.

  bool hasPTXASUnreachableBug() const {
    return SmVersion < 70 && PTXVersion <= 74;
  }

With that in mind, this condition should be changed appropriately. Assuming that it's the latest public CUDA release that has the bug fixed, we should only mark PTX ISA 8.5 or newer as safe. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#release-notes

duk-37 · 2024-09-25T19:24:57Z

We could avoid lowering unreachable to exit; trap; starting from the ptxas version that's known to model control flow accurately for just trap instructions, which it didn't when I proposed this change, but was told to me would get fixed in a future version of the compiler. I'm not sure if and how we can guard on that though, as the code generated by NVPTX may be handed to whatever version of back-end compiler out there (e.g. when compiling to PTX and using the driver APIs as opposed to having clang drive the ptxas invocation).

That would be a change in the instruction selection backend though, not disabling this pass altogether, since that would mean there's still a need for trap instructions to avoid bogus CFG edges.

Alternatively, one could just do what WebAssembly and PS4+PS5 do and force these TargetOptions to be enabled. That would get rid of this pass completely. The question of "should TargetOptions be load-bearing" is a good one, but I feel like that bridge has been crossed already and there are many, many other places in which these things are inaccessible or user-provided command line options are just outright ignored.

justinfargnoli · 2024-09-28T02:46:08Z

I'm confused. The comment you quoted in the description only says

Yes, this is my bad. I should have quoted places where @maleadt said things along the lines of:

we only know about issues on Pascal and earlier because bar.sync is not allowed to be executed divergently there

Citing from the relevant NVIDIA bug report (4078847):

Thank you for the bug number! I'm following up internally to confirm what release the fix was included in.

That would be a change in the instruction selection backend though, not disabling this pass altogether

I thought this pass added the exit instruction, not instruction selection. Assuming I'm missing something, where should we make this change within the instruction selection?

disabling this pass altogether ... mean(s) there's still a need for trap instructions to avoid bogus CFG edges.

This pass doesn't touch the trap instruction, so it should be okay to disable it, right?

duk-37 · 2024-09-28T18:11:42Z

That would be a change in the instruction selection backend though, not disabling this pass altogether

I thought this pass added the exit instruction, not instruction selection. Assuming I'm missing something, where should we make this change within the instruction selection?

Sorry, I probably added some confusion with the way I worded things. What I was saying is that, if the reason this pass exists is to avoid bogus CFG edges (which makes sense, especially if there are downstream optimizations happening) then it either:

should stay as-is, or;
TrapUnreachable + NoTrapAfterNoreturrn could be set to true/false respectively in order to fix the issue in pretty much the same way, just without the extra pass.

What I meant by "change in instruction selection" was avoiding emitting exit when emitting trap instructions if you have some way of accurately determining the downstream compiler. That's what would probably involve changing some TableGen stuff to add checks for how trapinst is lowered.

The introduction of the pass happened to be due to a miscompile because of bogus CFG edges, but even if that's been fixed it's still probably a useful hint to any compilers taking this output as input like @maleadt said.

Artem-B · 2024-09-30T17:58:49Z

To me it looks like there are two things we need to decide here:

Can ptxas guarantee that it can handle unstructured CFG in principle? My recollection from the time when we initially struggled wit this issue is that it's not always possible. The nature of "unreachable" code in generated PTX is that it can result in unpredictable phantom CFG edges (as far as ptxas is concerned). Explicitly marking unreachable code with something that allows us to propagate that information to ptxas looks like something that we need to do. trap, AFAICT, is the best way to do so that we have right now.

The downside of this approach is that it is an extra instruction and, I believe, there were some reported use cases where it resulted in performance regressions in user code.

Ideally, it would be nice if ptxas would provide a NOP-like equivalent of unreachable, which would preserve correctness of its view of CFG without generating any extra instructions.

generation of trap; exit; is a workaround for the bug in the older ptxas which did not consider trap by itself as a CFG leaf. Whether we emit it or not would depend on whether we want/need to emit the unreachable hint instruction to PTX, and, if we do, or on the PTX (or ptxas) version.

justinfargnoli · 2024-10-08T23:20:27Z

trap, AFAICT, is the best way to do so that we have right now.

I followed up internally and confirmed that trap is the best option that we have right now.

if we do, or on the PTX (or ptxas) version.

I've confirmed that this was fixed in CUDA 12.3, corresponding to PTX version 8.3.

whether we want/need to emit the unreachable hint instruction to PTX

Assuming that we do want to emit the unreachable hint (I don't think anyone has advocated against this), this PR should be ready for another round of review.

justinfargnoli requested review from chsigg, AlexMaclean and duk-37 September 24, 2024 21:47

justinfargnoli self-assigned this Sep 24, 2024

llvmbot added the backend:NVPTX label Sep 24, 2024

justinfargnoli commented Sep 24, 2024

View reviewed changes

justinfargnoli requested a review from Artem-B September 24, 2024 21:55

justinfargnoli added 4 commits October 8, 2024 12:42

[NVPTX] Only run LowerUnreachable on PTX version <= 7.4

58ffde6

Also check wether we're targeting Pascal or prior

5424a9d

PTXAS only has the bug prior to CUDA 12.8 / PTX v8.2

529062c

Only emit exit for PTX v8.3+

b88976b

justinfargnoli force-pushed the dev/jf/upstream-guard-lower-unreachable branch from 6bc6174 to b88976b Compare October 8, 2024 22:52

Rename tablegen def

29acb3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVPTX] Only run LowerUnreachable when necessary #109868

[NVPTX] Only run LowerUnreachable when necessary #109868

justinfargnoli commented Sep 24, 2024

llvmbot commented Sep 24, 2024

justinfargnoli Sep 24, 2024

Artem-B commented Sep 24, 2024

maleadt commented Sep 25, 2024

Artem-B commented Sep 25, 2024

duk-37 commented Sep 25, 2024 •

edited

Loading

justinfargnoli commented Sep 28, 2024

duk-37 commented Sep 28, 2024 •

edited

Loading

Artem-B commented Sep 30, 2024

justinfargnoli commented Oct 8, 2024 •

edited

Loading

[NVPTX] Only run LowerUnreachable when necessary #109868

Are you sure you want to change the base?

[NVPTX] Only run LowerUnreachable when necessary #109868

Conversation

justinfargnoli commented Sep 24, 2024

llvmbot commented Sep 24, 2024

justinfargnoli Sep 24, 2024

Choose a reason for hiding this comment

Artem-B commented Sep 24, 2024

maleadt commented Sep 25, 2024

Artem-B commented Sep 25, 2024

duk-37 commented Sep 25, 2024 • edited Loading

justinfargnoli commented Sep 28, 2024

duk-37 commented Sep 28, 2024 • edited Loading

Artem-B commented Sep 30, 2024

justinfargnoli commented Oct 8, 2024 • edited Loading

duk-37 commented Sep 25, 2024 •

edited

Loading

duk-37 commented Sep 28, 2024 •

edited

Loading

justinfargnoli commented Oct 8, 2024 •

edited

Loading