Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: CHPL_ROCM_PATH is set to the wrong ROCm installation path #25952

Closed
Guillaume-Helbecque opened this issue Sep 17, 2024 · 7 comments · Fixed by #26072
Closed

[Bug]: CHPL_ROCM_PATH is set to the wrong ROCm installation path #25952

Guillaume-Helbecque opened this issue Sep 17, 2024 · 7 comments · Fixed by #26072

Comments

@Guillaume-Helbecque
Copy link
Contributor

Guillaume-Helbecque commented Sep 17, 2024

Summary of Problem

Description:
I tried to build Chapel 2.1 on a system where ROCm 6.0.3 is the default and ROCm 5.4.6 is loaded. The reason for this is that ROCm 6.0.3 is not supported by Chapel 2.1. In this configuration, I got the error Error: command not found: /***/***/***/rocm/llvm/bin/llvm-config. The issue here is that Chapel set CHPL_ROCM_PATH to the default ROCm installation path, which is not supported, instead of the one I loaded: /***/***/***/rocm/5.4.6/llvm/bin/llvm-config. Manually setting CHPL_ROCM_PATH=/***/***/***/rocm/5.4.6 fixes the issue.

[edit: After a quick discussion with the experts managing the system, it seems that the issue may not come from how Chapel detects ROCm installation, but how it tries to find llvm-config. The following subdirectories in the path in the error message are those from the ROCm 5.4.6 module. However, for some reason, rather than looking in rocm/5.4.6/llvm, it removes the 5.4.6 from the directory where it goes looking. Exporting CHPL_LLVM_CONFIG=/***/***/***/rocm/5.4.6/llvm/bin/llvm-config also solve the issue.]

If I'm not wrong, the heuristic that searches for the ROCm installation path is using which hipcc and then detects the path. However, executing which hipcc on the system seems to give me the good path: /***/***/***/rocm/5.4.6/bin/hipcc. This suggests that an issue may occur when Chapel have to choose between two (or more) possible installation paths.

May be related to #23542.

Is this issue currently blocking your progress?
No. Setting path(s) manually makes things work.

@jabraham17
Copy link
Member

I suspect the issue is mostly with how we detect the rocm sdk directory. We run which hipcc, and then walk up the directories until we find a directory with rocm. So if which hipcc reports /***/***/***/rocm/5.4.6/bin/hipcc then CHPL_ROCM_PATH is inferred to be /***/***/***/rocm.

The llvm-config for rocm is selected purely based on CHPL_ROCM_PATH

If you manually set CHPL_LLVM_CONFIG=/***/***/***/rocm/5.4.6/llvm/bin/llvm-config and don't set CHPL_ROCM_PATH, I would expect that running printchplenv --all --internal would show an incorrect CHPL_ROCM_PATH. So while printchplenv might not have an error, I would expect compilation of GPU Chapel code to fail.

I think this is just a case where we need to improve the auto-detection, because everything should work fine if you set CHPL_ROCM_PATH

@e-kayrakli
Copy link
Contributor

Thanks for the bug report Guillaume and the assessment Jade!

Personally, I find the difficulty of this path finding really surprising. Neither vendor seems to have a good way of asking the compiler about some crucial paths, leaving us guessing in our scripts. Most systems I am familiar with have rocm installations in .../rocm-x.y.z, .../rocm-a.b.c, but apparently some installations have rocm/x.y.z etc.

So far what we have been doing (and seemingly will keep on doing) is to improve our scripts as we encounter more different installations and bug reports and that's unfortunate. For this particular case, I am wondering whether we can do find (or whatever Python modules provide that's equivalent) for particular things we are looking for after we do what we do today. Because right now, we just run which hipcc and find rocm in path and hope that everything works out. With that, at least we should be able to error out if the installation is not something we can automatically detect.

One little challenge is that the compiler picks out some bitcode libraries from these installations. If we want to find them in our scripts, we may want to have a common place to list those libraries for both the compiler and the scripts to be able to find them. A comma separated list in an environment variable could be a good alternative.

@Guillaume-Helbecque
Copy link
Contributor Author

Thanks @jabraham17 and @e-kayrakli for your feedbacks. It's obviously not trivial to handle all the different installations exhaustively due to the wide diversity. At least, I hope this report can help to improve a bit more the existing path finding strategy.

If you manually set CHPL_LLVM_CONFIG=///***/rocm/5.4.6/llvm/bin/llvm-config and don't set CHPL_ROCM_PATH, I would expect that running printchplenv --all --internal would show an incorrect CHPL_ROCM_PATH.

That's right, I just tested it. Setting manually CHPL_LLVM_CONFIG allows to pass Chapel build, but CHPL_ROCM_PATH remains wrong.

@jabraham17
Copy link
Member

Since hipcc won't tell us the right path, I wonder if we can try invoking hipcc to extract the right path?

For example, this bash one-liner should report the right installation, without needing our path heuristics

echo "int main() {return 0;}" | hipcc -v -x c++ - 2>&1 | grep 'Found HIP installation'

I tested this on 2 systems with various rocm versions and it seemed to report the right path each time.

@Guillaume-Helbecque does this report the right path on your system?

Here is a similar check for nvidia

echo "int main() {return 0;}" | nvcc -v -x c++ - 2>&1 | grep '#$ TOP='

@e-kayrakli
Copy link
Contributor

The magic words you're grepping is still a portability issue. If it helps with the current situation, I am not against it though. We'll be exchanging a problem with a smaller one (hopefully). Note that this could still be augmented with more checks using find and the like.

@Guillaume-Helbecque
Copy link
Contributor Author

does this report the right path on your system?

Yes, it does. In my case, which hipcc is also returning the right path.

@lydia-duncan
Copy link
Member

This was also resolved by @jabraham17's #26072. Thanks for fixing it, Jade, and thanks again for reporting it @Guillaume-Helbecque !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants