Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect branch targets in RISC-V executables built with LTO #65090

Closed
ilovepi opened this issue Aug 29, 2023 · 12 comments
Closed

Incorrect branch targets in RISC-V executables built with LTO #65090

ilovepi opened this issue Aug 29, 2023 · 12 comments
Labels
backend:RISC-V llvm:codegen LTO Link time optimization (regular/full LTO or ThinLTO)

Comments

@ilovepi
Copy link
Contributor

ilovepi commented Aug 29, 2023

We're seeing several cases where branch targets are invalid, for example in the middle of another instruction, when building Fuchsia's zircon kernel with flto=full. This only seems to occur under -flto-full, and both non-LTO and ThinLTO do not seem to have this issue.

Most branches appear to be correct, or at least have valid targets, but in some cases the target isn't the location of a valid instruction. This sounds like it could be related to linker relaxation, but https://discourse.llvm.org/t/riscv-status-of-lto-for-risc-v/58518 isn't clear on the current status of LTO for RISC-V, so it may be related.

I'm working on getting something smaller than our kernel image as a reproducer, but in the meantime I'm providing a snippet from a relevant case. The target for the bne instruction is in the middle of another instruction sequence

  1b2ac8:	1d051363          	bne	a0,a6,1b2c8e <VmCowPages::AddNewPageLocked(unsigned long, vm_page*, VmCowPages::CanOverwriteContent, VmPageOrMarker*, bool, bool)+0x1ea>
...
  1b2c80:	0d000593          	li	a1,208
  1b2c84:	000ac097          	auipc	ra,0xac
  1b2c88:	8a8080e7          	jalr	-1880(ra) # 25e52c <assert_fail>
fbl::Canary<1447904080u>::Assert() const:
./../../zircon/system/ulib/fbl/include/fbl/canary.h:64
  1b2c8c:	030aa783          	lw	a5,48(s5)
  1b2c90:	ffea5517          	auipc	a0,0xffea5

https://fxbug.dev/129493 has additional context.

CC: @topperc @MaskRay @asb

@ilovepi ilovepi added backend:RISC-V llvm:codegen LTO Link time optimization (regular/full LTO or ThinLTO) labels Aug 29, 2023
@llvmbot
Copy link

llvmbot commented Aug 29, 2023

@llvm/issue-subscribers-backend-risc-v

@kito-cheng
Copy link
Member

The guideline for those kind of bug in my mind is: could you try to disable linker relaxation at all to see if it work? I am not sure how to disable that correctly at LTO build, but you could try to add -mno-relax at CFLAGS and LDFLAGS=-Wl,--no-relax?

And if it not work...@MaskRay could help :P

@jrtc27
Copy link
Collaborator

jrtc27 commented Aug 31, 2023

Probably something like -Wl,--plugin-opt=-mattr=-relax

@MaskRay
Copy link
Member

MaskRay commented Sep 1, 2023

#59350 (comment) this may be relevant

If you use clang++ --target=riscv64-linux-gnu -flto -fuse-ld=lld a.cc -r -o a.o -Wl,-mllvm,-mattr=+c,-mllvm,-mattr=+relax, you'll get R_RISCV_RELAX. If you omit the -Wl, then there will be no R_RISCV_RELAX, and e_flags of the ELF header does not contain EF_RISCV_RVC. This is due to a target-features LTO issue which hasn't been resolved.

@nickdesaulniers
Copy link
Member

@twd2 spotted a case that smells a lot like fc018eb.

If we need to re-pass compiler flags to the linker, but only for LTO, that is a deficit in the IR IMO.

intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Oct 17, 2023
Allow LTO to be selected for RISC-V, only when LLD >= 14, since there is
an issue [1] in prior LLD versions that prevents LLD to generate proper
machine code for RISC-V when writing `nop`s.

To avoid boot failures in QEMU [2], '-mattr=+c' and '-mattr=+relax'
need to be passed via '-mllvm' to ld.lld, as there appears to be an
issue with LLVM's target-features and LTO [3], which can result in
incorrect relocations to branch targets [4]. Once this is fixed in LLVM,
it can be made conditional on affected ld.lld versions.

Disable LTO for arch/riscv/kernel/pi, as llvm-objcopy expects an ELF
object file when manipulating the files in that subfolder, rather than
LLVM bitcode.

[1] llvm/llvm-project#50505, resolved by LLVM
    commit e63455d5e0e5 ("[MC] Use local MCSubtargetInfo in writeNops")
[2] ClangBuiltLinux#1942
[3] llvm/llvm-project#59350
[4] llvm/llvm-project#65090

Tested-by: Wende Tan <[email protected]>
Signed-off-by: Wende Tan <[email protected]>
Co-developed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
@petrhosek
Copy link
Member

We have now also encountered this issue in ThinLTO builds. We no longer think it's LTO related, our current working theory is that it's a garbage-in-garbage-out scenario where the backend generates branches with invalid targets and LLD fails to detect and warn for this case. We believe it might be related to the size of module which is why it's manifesting primarily in LTO and ThinLTO builds, but we're trying to see if we could reproduce it in a regular build as well. While we're trying to narrow down the issue, I think it'd be helpful if LLD could diagnose this case and give a warning.

@twd2
Copy link
Contributor

twd2 commented Nov 6, 2023

We have now also encountered this issue in ThinLTO builds. We no longer think it's LTO related, our current working theory is that it's a garbage-in-garbage-out scenario where the backend generates branches with invalid targets and LLD fails to detect and warn for this case. We believe it might be related to the size of module which is why it's manifesting primarily in LTO and ThinLTO builds, but we're trying to see if we could reproduce it in a regular build as well. While we're trying to narrow down the issue, I think it'd be helpful if LLD could diagnose this case and give a warning.

Please check ClangBuiltLinux/linux#1942 (comment) to see if this is similar to your case :)

@ilovepi
Copy link
Contributor Author

ilovepi commented Nov 27, 2023

So, we're able to work around this by passing -Wl,-mllvm,-mattr=+c,-mllvm,-mattr=+relax in our build. I had tried this approach earlier by adding the -mllvm ... flags in the linker invocation, but it seems that I must have missed a key dependency in a previous step. You can see this Bug for details

While I'm going to close this particular issue, I'll probably file a separate issue an perhaps follow up on discourse about how to reconcile the underlying problem w/ the interaction between the linker, target-features, and LTO.

@ilovepi ilovepi closed this as completed Nov 27, 2023
@nickdesaulniers
Copy link
Member

So, we're able to work around this by passing -Wl,-mllvm,-mattr=+c,-mllvm,-mattr=+relax in our build

That's unfortunate. Every other project looking to use LTO for riscv is going to hit the same issue and have to use the same (awful) workaround. (Needing -mllvm flags for correct codegen is a code smell, IMO).

@ilovepi
Copy link
Contributor Author

ilovepi commented Nov 27, 2023

@nickdesaulniers I'm going through the issues now, and compiling a list of all of them. If I don't find a canonical one, I'll file a new issue pointing out the underlying issue w/ links to all the rest and follow up on discourse. This is pretty broken ATM, and while its critical for RISC-V, I don't think it's a RISC-V specific issue any more than #67698 (which is remarkably similar). It just hits harder on RISC-V because target-features is load-bearing in the extreme.

@preames
Copy link
Collaborator

preames commented Nov 30, 2023

The failure description on this sounds a lot like the same issue as #73977. There may be other contributors as well, but ruling that one out would likely be very worthwhile.

Edit: A bit more context on why this sounds related.

So, we're able to work around this by passing -Wl,-mllvm,-mattr=+c,-mllvm,-mattr=+relax in our build. I had tried this approach earlier by adding the -mllvm ... flags in the linker invocation, but it seems that I must have missed a key dependency in a previous step.

This was exactly the behavior I saw in the original build which triggered finding #73977. In my local example, we had a mix of object files compiled without C, and an archive that was compiled with C. Due to the link order, the linker was treating all the object files as if they had been compiled with C. This was incompatible with the code emitted by the compiler, and resulted in a case where a RISCV_ALIGN couldn't be satisfied. Adding C to all of the object files universally caused the problem to go away because the compiler left two extra bytes of padding for the RISCV_ALIGN. Note that while the RISCV_ALIGN failed asserts in assert builds, it simply computes an incorrect adjustment in release builds and thus true chaos can fall out.

@ilovepi
Copy link
Contributor Author

ilovepi commented Nov 30, 2023

Thanks for chiming in @preames. I'll try to confirm if that mitigates some of the issues once this lands.

prabhakarlad pushed a commit to prabhakarlad/linux that referenced this issue Dec 11, 2023
Allow LTO to be selected for RISC-V, only when LLD >= 14, since there is
an issue [1] in prior LLD versions that prevents LLD to generate proper
machine code for RISC-V when writing `nop`s.

To avoid boot failures in QEMU [2], '-mattr=+c' and '-mattr=+relax'
need to be passed via '-mllvm' to ld.lld, as there appears to be an
issue with LLVM's target-features and LTO [3], which can result in
incorrect relocations to branch targets [4]. Once this is fixed in LLVM,
it can be made conditional on affected ld.lld versions.

Disable LTO for arch/riscv/kernel/pi, as llvm-objcopy expects an ELF
object file when manipulating the files in that subfolder, rather than
LLVM bitcode.

[1] llvm/llvm-project#50505, resolved by LLVM
    commit e63455d5e0e5 ("[MC] Use local MCSubtargetInfo in writeNops")
[2] ClangBuiltLinux#1942
[3] llvm/llvm-project#59350
[4] llvm/llvm-project#65090

Tested-by: Wende Tan <[email protected]>
Signed-off-by: Wende Tan <[email protected]>
Co-developed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
gnoliyil pushed a commit to gnoliyil/fuchsia that referenced this issue Jan 27, 2024
The underlying issue seems to be that the linker does not have target
attributes set correctly in all cases, and LTO cannot diagnose this.
This is tracked in upstream LLVM, and adding these flags to the linker
invocation is the suggested work around (see
llvm/llvm-project#65090,
llvm/llvm-project#59350 (comment),
llvm/llvm-project#67698,
and
https://discourse.llvm.org/t/myterious-soft-float-output-in-lto-cache/70753
 for more details).

Bug: 129493
Change-Id: Ib3b9f387d61c5a855ba57ede191881c567414de7
Reviewed-on: https://fuchsia-review.googlesource.com/c/fuchsia/+/949677
Reviewed-by: Petr Hosek <[email protected]>
Fuchsia-Auto-Submit: Paul Kirth <[email protected]>
Commit-Queue: Auto-Submit <[email protected]>
knagaitsev pushed a commit to knagaitsev/linux that referenced this issue Oct 4, 2024
Allow LTO to be selected for RISC-V, only when LLD >= 14, since there is
an issue [1] in prior LLD versions that prevents LLD to generate proper
machine code for RISC-V when writing `nop`s.

To avoid boot failures in QEMU [2], '-mattr=+c' and '-mattr=+relax'
need to be passed via '-mllvm' to ld.lld, as there appears to be an
issue with LLVM's target-features and LTO [3], which can result in
incorrect relocations to branch targets [4]. Once this is fixed in LLVM,
it can be made conditional on affected ld.lld versions.

Disable LTO for arch/riscv/kernel/pi, as llvm-objcopy expects an ELF
object file when manipulating the files in that subfolder, rather than
LLVM bitcode.

[1] llvm/llvm-project#50505, resolved by LLVM
    commit e63455d5e0e5 ("[MC] Use local MCSubtargetInfo in writeNops")
[2] ClangBuiltLinux#1942
[3] llvm/llvm-project#59350
[4] llvm/llvm-project#65090

Tested-by: Wende Tan <[email protected]>
Signed-off-by: Wende Tan <[email protected]>
Co-developed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
knagaitsev pushed a commit to knagaitsev/linux that referenced this issue Oct 11, 2024
Allow LTO to be selected for RISC-V, only when LLD >= 14, since there is
an issue [1] in prior LLD versions that prevents LLD to generate proper
machine code for RISC-V when writing `nop`s.

To avoid boot failures in QEMU [2], '-mattr=+c' and '-mattr=+relax'
need to be passed via '-mllvm' to ld.lld, as there appears to be an
issue with LLVM's target-features and LTO [3], which can result in
incorrect relocations to branch targets [4]. Once this is fixed in LLVM,
it can be made conditional on affected ld.lld versions.

Disable LTO for arch/riscv/kernel/pi, as llvm-objcopy expects an ELF
object file when manipulating the files in that subfolder, rather than
LLVM bitcode.

[1] llvm/llvm-project#50505, resolved by LLVM
    commit e63455d5e0e5 ("[MC] Use local MCSubtargetInfo in writeNops")
[2] ClangBuiltLinux#1942
[3] llvm/llvm-project#59350
[4] llvm/llvm-project#65090

Tested-by: Wende Tan <[email protected]>
Signed-off-by: Wende Tan <[email protected]>
Co-developed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:RISC-V llvm:codegen LTO Link time optimization (regular/full LTO or ThinLTO)
Projects
None yet
Development

No branches or pull requests

9 participants