Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HYPRE Struct - problems using GPU-aware MPI #1131

Open
ondrejchrenko opened this issue Sep 17, 2024 · 4 comments
Open

HYPRE Struct - problems using GPU-aware MPI #1131

ondrejchrenko opened this issue Sep 17, 2024 · 4 comments

Comments

@ondrejchrenko
Copy link

ondrejchrenko commented Sep 17, 2024

Dear HYPRE developers,

following on issue #1126, I've been able to implement HYPRE in my code and run it on multiple GPUs. However, when I try to enable GPU-aware MPI in HYPRE, I get the following types of segmentation faults when running the code:
[acn16:283118:0:283118] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15450e000004)
==== backtrace (tid: 283118) ====
0 0x0000000000012d20 __funlockfile() :0
1 0x00000000009a6891 hypre_FinalizeCommunication() /scratch/project/open-29-3/hypre-master_paragpu2/src/struct_mv/struct_communication.c:1216
2 0x00000000009b37de hypre_StructMatrixAssemble() /scratch/project/open-29-3/hypre-master_paragpu2/src/struct_mv/struct_matrix.c:1436
3 0x00000000009968c6 HYPRE_StructMatrixAssemble() /scratch/project/open-29-3/hypre-master_paragpu2/src/struct_mv/HYPRE_struct_matrix.c:323
50e000004)

When HYPRE is not used, my code runs with GPU-aware MPI without problems. Any ideas what could be causing these errors?

Thank you,
Ondrej

@pledac
Copy link

pledac commented Oct 16, 2024

Hello, I have similar issues with Hypre (CG+Boomeramg, used through PETSc) with MPI Gpu-Aware.

OpenMPI 4.x (no GPU Aware) -> OK for all my tests
OpenMPI 4.x GPU Aware -> KSP_DIVERGED for some tests
OpenMPI 5.0.5 GPU Aware -> OK for all my tests !

The KSP_DIVERGED happens with Hypre Boomeramg above some number of GPUs and with CG solver. The issue may be bypassed by switching to BiCGstab solver...

Could you check your OpenMPI version and test with 5.0.5 ?

Thanks

@ondrejchrenko
Copy link
Author

Hi and thanks for your feedback!

I've been using OpenMPI 4.1.6 so I'll try some >5 version and let you know the result. For me, the problem is not solver-dependent and occurs at the first assemble of the matrix.

Cheers,
Ondrej

@ondrejchrenko
Copy link
Author

Hi again,

I've tested with OpenMPI 5.0.5 but I am unfortunately getting the same segfault:

[acn35:1588861:0:1588861] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x153b88e00004)
==== backtrace (tid:1588861) ====
0 0x0000000000012d10 __funlockfile() :0
1 0x00000000009a6851 hypre_FinalizeCommunication() /scratch/project/open-29-3/hypre-master_paragpu5/src/struct_mv/struct_communication.c:1216
2 0x00000000009b379e hypre_StructMatrixAssemble() /scratch/project/open-29-3/hypre-master_paragpu5/src/struct_mv/struct_matrix.c:1436
3 0x0000000000996886 HYPRE_StructMatrixAssemble() /scratch/project/open-29-3/hypre-master_paragpu5/src/struct_mv/HYPRE_struct_matrix.c:323

Any other ideas are welcome...

Cheers,
Ondrej

@ondrejchrenko
Copy link
Author

ondrejchrenko commented Oct 23, 2024

Dear HYPRE developers, I would appreciate some additional feedback.

I have been trying to adapt one of the example codes 'ex3.c' to reproduce the error occurring on my cluster. The modified source code can be found here: https://github.com/ondrejchrenko/HYPRE_ex3

Could you please let me know:

  • if the modifications that I've done correctly convert the given example for a usage on GPU clusters with CUDA-aware MPI
  • if you can reproduce the error when running the example on multiple GPUs with CUDA-aware MPI or if the code runs OK for you

Cheers,
Ondrej

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants