Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI Derived Datatypes, Scatter with ob1 and uct produces deadlocks #12698

Closed
AnnaLena77 opened this issue Jul 22, 2024 · 28 comments · Fixed by #12823
Closed

MPI Derived Datatypes, Scatter with ob1 and uct produces deadlocks #12698

AnnaLena77 opened this issue Jul 22, 2024 · 28 comments · Fixed by #12823

Comments

@AnnaLena77
Copy link

Background information

For certain research reasons, I need to use Open MPI with the pml ob1 in conjunction with Infiniband (ucx/uct as btl) on our cluster. This works largely without any problems.
In my particular program I am trying to split a matrix into sub-matrices (using Derived Datatypes) and distribute them to all processes using scatter.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Version 5.0.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

+6f81bfd163f3275d2b0630974968c82759dd4439 3rd-party/openpmix (v1.1.3-3983-g6f81bfd1)
+4f27008906d96845e22df6502d6a9a29d98dec83 3rd-party/prrte (psrvr-v2.0.0rc1-4746-g4f27008906)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: NixOs
  • Computer hardware: Cluster with AMD Opteron Processors 6272
  • Network type: Infiniband (mlx4_0) / Ethernet

Details of the problem

The scatter always ends in a deadlock if the matrix is selected to be correspondingly large (here in the example 900x900, size 90 still works) and ob1 is used in conjunction with uct.
(I am using slurm scripts to distribute all jobs)

mpirun -n 81 --mca pml ucx ./test                  --> works
mpirun -n 81 --mca pml ob1 ./test                  --> deadlock
mpirun -n 81 --mca pml ob1 --mca btl ^uct ./test   --> works

If I use normal MPI data types (e.g. MPI_DOUBLE) instead of the Derived Datatypes, everything also works with uct. So the problem is definitely with the derived datatypes that are to be used.

MPI Program:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char *argv[]) {
    MPI_Init(&argc, &argv);

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    int N = 900;  //Matrix-Dimension
    int block_size = N / (int)sqrt(size); 

    int *matrix = NULL;

    if (rank == 0) {
        matrix = (int *)malloc(N * N * sizeof(int));
        for (int i = 0; i < N; i++) {
            for (int j = 0; j < N; j++) {
                matrix[i * N + j] = i * N + j;
            }
        }
    }
    MPI_Datatype blocktype, blocktype_resized;
    int array_of_sizes[2] = {N, N};
    int array_of_subsizes[2] = {block_size, block_size};
    int array_of_starts[2] = {0, 0};
    MPI_Type_create_subarray(2, array_of_sizes, array_of_subsizes, array_of_starts, MPI_ORDER_C, MPI_INT, &blocktype);
    MPI_Type_create_resized(blocktype, 0, block_size * sizeof(int), &blocktype_resized);
    MPI_Type_commit(&blocktype_resized);

    int *recvbuf = (int *)malloc(block_size * block_size * sizeof(int));
    
    MPI_Scatter(matrix, 1, blocktype_resized, recvbuf, block_size * block_size, MPI_INT, 0, MPI_COMM_WORLD);

    free(recvbuf);
    if (rank == 0) {
        free(matrix);
    }

    MPI_Type_free(&blocktype_resized);
    MPI_Finalize();
    return 0; 
}
@AnnaLena77 AnnaLena77 changed the title MPI Derived Datatypes, Scatterv with ob1 and uct produces deadlocks MPI Derived Datatypes, Scatter with ob1 and uct produces deadlocks Jul 22, 2024
@ggouaillardet
Copy link
Contributor

There is a known issue in the coll/tuned component that can occur when using matching signatures but with different types and counts.

Instead of receiving block_size * block_size MPI_INT, what if you create a derived datatype made of block_size * block_size contiguous MPI_INT and then receive 1 instance of this newly created derived datatype?

@AnnaLena77
Copy link
Author

Oh yes, that works fine, also with the btl uct.
Sorry, I've been searching for ages but haven't found the known issue.

@AnnaLena77
Copy link
Author

AnnaLena77 commented Jul 24, 2024

I have to correct myself. Unfortunately, the approach does not work. I had first tried the whole thing with the same data type as receive-type (“blocktype_resize”). This no longer caused a deadlock, but this is not our goal and leads to problems in other situations. Integers of the size block_size*block_size must be received, not a blocktype_resize Datatype.

When we rebuilt our code today, we noticed this. A new derived datatype (created with MPI_Type_contiguous) still leads to a deadlock with uct.

MPI_Type_contiguous(block_size * block_size, MPI_INT, &recvtype);
MPI_Type_commit(&recvtype);
MPI_Scatter(matrix, 1, blocktype_resized, recvbuf, 1, recvtype, 0, MPI_COMM_WORLD);

--> Deadlock

@AnnaLena77 AnnaLena77 reopened this Jul 24, 2024
@ggouaillardet
Copy link
Contributor

I will try to reproduce the issue.

Meanwhile, you can try the following workaround and see how it goes

mpirun --mca coll ^tuned ...

@bosilca
Copy link
Member

bosilca commented Jul 24, 2024

None of tuned scatter implementations have support for segmentation, thus scatter is immune to the datatype mismatch issue. I quickly scanned through the scatter algorithms in tuned, and as far as I see they all exchange the entire data in a single message (with different communication patterns, but the data goes in a single message. The UCT BTL is not up-to-date, if the tests runs with the TCP BTL then it is a clear indication where the problem is coming from.

@AnnaLena77
Copy link
Author

AnnaLena77 commented Jul 24, 2024

I will try to reproduce the issue.

Meanwhile, you can try the following workaround and see how it goes

mpirun --mca coll ^tuned ...

Unfortunately, that doesn't work either. The deadlock remains at this point. The only time I no longer have a deadlock is when I use the tcp btl instead of uct:

mpirun --mca pml ob1 --mca btl ^uct  ./test

However, this is also not the best case for my workaround. Are there plans to fix the problem at some point?

@AnnaLena77
Copy link
Author

I know the collective modules a little, as I have been working with the code for some time.
Tuned selects an algorithm from “base” based on criteria and executes it. HAN has the highest priority, but refers to tuned because the processes are not distributed evenly across the nodes.
If I lower the priority for both modules, the “base” module is still used automatically. Here in the example the Scatter Basic Linear Algorithm.
So even if I switch off "tuned", the same function (from base) is called in the collective module, so that the error occurs again in the uct btl.

@ggouaillardet
Copy link
Contributor

@bosilca has a good point, and the issue is very unlikely the "known bug" I initially mentioned, but more likely something with btl/uct. Out of curiosity, why are you using pml/ob1 and btl/uct instead of pml/ucx?

In order to rule out the collective components, you can

mpirun --mca coll basic,libnbc,inter ...

and see if it helps.

@AnnaLena77
Copy link
Author

AnnaLena77 commented Jul 25, 2024

I am working on an open MPI fork that measures internal performance data (similar to Peruse) at runtime and writes it to a database.

This performance data is then visualized in a frontend at runtime. It is designed for students to show them what is happening on the cluster at runtime.

Among other things, it is possible to visualize at runtime which algorithm is used for collective communication, how this affects the individual ranks, what they are doing at that moment, where potential late senders are, etc.

In ob1 pml I can use it to measure internal LateSender times, for example. I have been working with ob1 and its functionality for a long time and have learned to understand how it works. Our cluster has an Infiniband connection, so I use it in conjunction with uct. The runtimes with TCP are not acceptable.

UCX has a completely different architecture and I would have to start my work from scratch.

In addition, the students start their program from the frontend via button and of course have no knowledge of the various pmls, btls and collectives, so I have to execute an MPI command in the background that cannot be adapted due to the error.

Everything has worked wonderfully until the students started transferring derived datatypes via scatter.

@ggouaillardet
Copy link
Contributor

I see, for the time being, you might want to give a try to Open MPI v4 and the btl/openib component.

@ggouaillardet
Copy link
Contributor

@AnnaLena77 which UCX version are you using?

@AnnaLena77
Copy link
Author

@AnnaLena77 which UCX version are you using?

1.16.0

@ggouaillardet
Copy link
Contributor

@AnnaLena77 can you try to

mpirun -np 81 --mca btl ob1 --mca btl_uct_eager_limit 8178 ./test

@AnnaLena77
Copy link
Author

AnnaLena77 commented Jul 25, 2024

@AnnaLena77 can you try to

mpirun -np 81 --mca btl ob1 --mca btl_uct_eager_limit 8178 ./test

No, does not work... I will have a look at the code and find out where the deadlock exactly comes from.

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Jul 25, 2024

FWIW, I do not observe a deadlock but a crash (assertion error, I configure'd with --enable-debug

I use this trimmed reproducer based on your test case, and run it on a single node with 2 MPI tasks

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
    MPI_Init(&argc, &argv);

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    int N = 900;
    int block_size = N/2;
    char *buf = malloc(N*N*sizeof(int));
    MPI_Datatype blocktype, blocktype_tmp;
    int array_of_sizes[2] = {N, N};
    int array_of_subsizes[2] = {block_size, block_size};
    int array_of_starts[2] = {0, 0};
    MPI_Type_create_subarray(2, array_of_sizes, array_of_subsizes, array_of_starts, MPI_ORDER_C, MPI_INT, &blocktype_tmp);
    MPI_Type_create_resized(blocktype_tmp, 0, block_size * sizeof(int), &blocktype);
    MPI_Type_commit(&blocktype);

    if (0 == rank) {
	  MPI_Send(buf, 1, blocktype, 1, 0, MPI_COMM_WORLD);
    } else if (1 == rank) {
	    MPI_Recv(buf, block_size*block_size, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    }

    free(buf);
    MPI_Type_free(&blocktype);
    MPI_Finalize();
    return 0; 
}

and here is the stack trace

[1,0]<stderr>: sv: ../../../../../../src/ompi-v5.0.x/opal/mca/btl/uct/btl_uct_am.c:73: _mca_btl_uct_send_pack: Assertion `length == payload_size' failed.
[1,0]<stdout>:
[1,0]<stdout>: Thread 1 "sv" received signal SIGABRT, Aborted.
[1,0]<stdout>: 0x0000fffff7886274 in raise () from /lib64/libc.so.6
[1,0]<stdout>: #0  0x0000fffff7886274 in raise () from /lib64/libc.so.6
[1,0]<stdout>: #1  0x0000fffff7870a2c in abort () from /lib64/libc.so.6
[1,0]<stdout>: #2  0x0000fffff787fba0 in __assert_fail_base () from /lib64/libc.so.6
[1,0]<stdout>: #3  0x0000fffff787fc18 in __assert_fail () from /lib64/libc.so.6
[1,0]<stdout>: #4  0x0000fffff7573f6c in _mca_btl_uct_send_pack (data=0xfffff4c00020, header=0x0, header_size=0, convertor=0x73ae60, payload_size=8214) at ../../../../../../src/ompi-v5.0.x/opal/mca/btl/uct/btl_uct_am.c:73
[1,0]<stdout>: #5  0x0000fffff7574074 in mca_btl_uct_prepare_src (btl=0x5e2330, endpoint=0x67e2b0, convertor=0x73ae60, order=255 '\377', reserve=32, size=0xffffffffdbc8, flags=70) at ../../../../../../src/ompi-v5.0.x/opal/mca/btl/uct/btl_uct_am.c:96
[1,0]<stdout>: #6  0x0000fffff7d7bed0 in mca_bml_base_prepare_src (bml_btl=0x6afcf0, conv=0x73ae60, order=255 '\377', reserve=32, size=0xffffffffdbc8, flags=70, des=0xffffffffdbd0) at ../../../../../../src/ompi-v5.0.x/ompi/mca/bml/bml.h:339
[1,0]<stdout>: #7  0x0000fffff7d7f188 in mca_pml_ob1_send_request_schedule_once (sendreq=0x73ad80) at ../../../../../../src/ompi-v5.0.x/ompi/mca/pml/ob1/pml_ob1_sendreq.c:1179
[1,0]<stdout>: #8  0x0000fffff7d70f28 in mca_pml_ob1_send_request_schedule_exclusive (sendreq=0x73ad80) at ../../../../../../src/ompi-v5.0.x/ompi/mca/pml/ob1/pml_ob1_sendreq.h:327
[1,0]<stdout>: #9  0x0000fffff7d70fac in mca_pml_ob1_send_request_schedule (sendreq=0x73ad80) at ../../../../../../src/ompi-v5.0.x/ompi/mca/pml/ob1/pml_ob1_sendreq.h:351
[1,0]<stdout>: #10 0x0000fffff7d73068 in mca_pml_ob1_recv_frag_callback_ack (btl=0x5e2330, descriptor=0xffffffffdd18) at ../../../../../../src/ompi-v5.0.x/ompi/mca/pml/ob1/pml_ob1_recvfrag.c:772
[1,0]<stdout>: #11 0x0000fffff756b5fc in mca_btl_uct_am_handler (arg=0x5e4510, data=0xfffff4d90002, length=48, flags=0) at ../../../../../../src/ompi-v5.0.x/opal/mca/btl/uct/btl_uct_component.c:327
[1,0]<stdout>: #12 0x0000fffff58b1934 in uct_dc_mlx5_iface_progress_ll () from /lib64/ucx/libuct_ib.so.0
[1,0]<stdout>: #13 0x0000fffff756a2c8 in ucs_callbackq_dispatch (cbq=0x5e2140) at /usr/include/ucs/datastruct/callbackq.h:215
[1,0]<stdout>: #14 0x0000fffff756a684 in uct_worker_progress (worker=0x5e2140) at /usr/include/uct/api/uct.h:2787
[1,0]<stdout>: #15 mca_btl_uct_context_progress (context=0x5e4510) at ../../../../../../src/ompi-v5.0.x/opal/mca/btl/uct/btl_uct_device_context.h:165
[1,0]<stdout>: #16 0x0000fffff756c170 in mca_btl_uct_tl_progress (tl=0x5e2f30, starting_index=0) at ../../../../../../src/ompi-v5.0.x/opal/mca/btl/uct/btl_uct_component.c:575
[1,0]<stdout>: #17 0x0000fffff756c324 in mca_btl_uct_component_progress () at ../../../../../../src/ompi-v5.0.x/opal/mca/btl/uct/btl_uct_component.c:626
[1,0]<stdout>: #18 0x0000fffff74c6b7c in opal_progress () at ../../../src/ompi-v5.0.x/opal/runtime/opal_progress.c:224
[1,0]<stdout>: #19 0x0000fffff7d69b04 in ompi_request_wait_completion (req=0x73ad80) at ../../../../../../src/ompi-v5.0.x/ompi/request/request.h:492
[1,0]<stdout>: #20 0x0000fffff7d6c604 in mca_pml_ob1_send (buf=0xfffff4ca0010, count=1, datatype=0x62add0, dst=1, tag=0, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x420080 <ompi_mpi_comm_world>) at ../../../../../../src/ompi-v5.0.x/ompi/mca/pml/ob1/pml_ob1_isend.c:327
[1,0]<stdout>: #21 0x0000fffff7b6804c in PMPI_Send (buf=0xfffff4ca0010, count=1, type=0x62add0, dest=1, tag=0, comm=0x420080 <ompi_mpi_comm_world>) at ../../../../../src/ompi-v5.0.x/ompi/mpi/c/send.c:93
[1,0]<stdout>: #22 0x0000000000400cb0 in main (argc=1, argv=0xffffffffe218) at sv.c:23

When the crash occurs, payload_size is 8214 which is not a multiple of 4, so opal_convertor_pack() packs only 8212 bytes and then the assertion fails.

Packing an "unaligned" number of bytes (so to speak) looks fishy to me and looks like a good place to start investigating.

Good luck!

@AnnaLena77
Copy link
Author

AnnaLena77 commented Jul 25, 2024

FWIW, I do not observe a deadlock but a crash (assertion error, I configure'd with --enable-debug

I use this trimmed reproducer based on your test case, and run it on a single node with 2 MPI tasks

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
    MPI_Init(&argc, &argv);

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    int N = 900;
    int block_size = N/2;
    char *buf = malloc(N*N*sizeof(int));
    MPI_Datatype blocktype, blocktype_tmp;
    int array_of_sizes[2] = {N, N};
    int array_of_subsizes[2] = {block_size, block_size};
    int array_of_starts[2] = {0, 0};
    MPI_Type_create_subarray(2, array_of_sizes, array_of_subsizes, array_of_starts, MPI_ORDER_C, MPI_INT, &blocktype_tmp);
    MPI_Type_create_resized(blocktype_tmp, 0, block_size * sizeof(int), &blocktype);
    MPI_Type_commit(&blocktype);

    if (0 == rank) {
	  MPI_Send(buf, 1, blocktype, 1, 0, MPI_COMM_WORLD);
    } else if (1 == rank) {
	    MPI_Recv(buf, block_size*block_size, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    }

    free(buf);
    MPI_Type_free(&blocktype);
    MPI_Finalize();
    return 0; 
}

and here is the stack trace

[1,0]<stderr>: sv: ../../../../../../src/ompi-v5.0.x/opal/mca/btl/uct/btl_uct_am.c:73: _mca_btl_uct_send_pack: Assertion `length == payload_size' failed.
[1,0]<stdout>:
[1,0]<stdout>: Thread 1 "sv" received signal SIGABRT, Aborted.
[1,0]<stdout>: 0x0000fffff7886274 in raise () from /lib64/libc.so.6
[1,0]<stdout>: #0  0x0000fffff7886274 in raise () from /lib64/libc.so.6
[1,0]<stdout>: #1  0x0000fffff7870a2c in abort () from /lib64/libc.so.6
[1,0]<stdout>: #2  0x0000fffff787fba0 in __assert_fail_base () from /lib64/libc.so.6
[1,0]<stdout>: #3  0x0000fffff787fc18 in __assert_fail () from /lib64/libc.so.6
[1,0]<stdout>: #4  0x0000fffff7573f6c in _mca_btl_uct_send_pack (data=0xfffff4c00020, header=0x0, header_size=0, convertor=0x73ae60, payload_size=8214) at ../../../../../../src/ompi-v5.0.x/opal/mca/btl/uct/btl_uct_am.c:73
[1,0]<stdout>: #5  0x0000fffff7574074 in mca_btl_uct_prepare_src (btl=0x5e2330, endpoint=0x67e2b0, convertor=0x73ae60, order=255 '\377', reserve=32, size=0xffffffffdbc8, flags=70) at ../../../../../../src/ompi-v5.0.x/opal/mca/btl/uct/btl_uct_am.c:96
[1,0]<stdout>: #6  0x0000fffff7d7bed0 in mca_bml_base_prepare_src (bml_btl=0x6afcf0, conv=0x73ae60, order=255 '\377', reserve=32, size=0xffffffffdbc8, flags=70, des=0xffffffffdbd0) at ../../../../../../src/ompi-v5.0.x/ompi/mca/bml/bml.h:339
[1,0]<stdout>: #7  0x0000fffff7d7f188 in mca_pml_ob1_send_request_schedule_once (sendreq=0x73ad80) at ../../../../../../src/ompi-v5.0.x/ompi/mca/pml/ob1/pml_ob1_sendreq.c:1179
[1,0]<stdout>: #8  0x0000fffff7d70f28 in mca_pml_ob1_send_request_schedule_exclusive (sendreq=0x73ad80) at ../../../../../../src/ompi-v5.0.x/ompi/mca/pml/ob1/pml_ob1_sendreq.h:327
[1,0]<stdout>: #9  0x0000fffff7d70fac in mca_pml_ob1_send_request_schedule (sendreq=0x73ad80) at ../../../../../../src/ompi-v5.0.x/ompi/mca/pml/ob1/pml_ob1_sendreq.h:351
[1,0]<stdout>: #10 0x0000fffff7d73068 in mca_pml_ob1_recv_frag_callback_ack (btl=0x5e2330, descriptor=0xffffffffdd18) at ../../../../../../src/ompi-v5.0.x/ompi/mca/pml/ob1/pml_ob1_recvfrag.c:772
[1,0]<stdout>: #11 0x0000fffff756b5fc in mca_btl_uct_am_handler (arg=0x5e4510, data=0xfffff4d90002, length=48, flags=0) at ../../../../../../src/ompi-v5.0.x/opal/mca/btl/uct/btl_uct_component.c:327
[1,0]<stdout>: #12 0x0000fffff58b1934 in uct_dc_mlx5_iface_progress_ll () from /lib64/ucx/libuct_ib.so.0
[1,0]<stdout>: #13 0x0000fffff756a2c8 in ucs_callbackq_dispatch (cbq=0x5e2140) at /usr/include/ucs/datastruct/callbackq.h:215
[1,0]<stdout>: #14 0x0000fffff756a684 in uct_worker_progress (worker=0x5e2140) at /usr/include/uct/api/uct.h:2787
[1,0]<stdout>: #15 mca_btl_uct_context_progress (context=0x5e4510) at ../../../../../../src/ompi-v5.0.x/opal/mca/btl/uct/btl_uct_device_context.h:165
[1,0]<stdout>: #16 0x0000fffff756c170 in mca_btl_uct_tl_progress (tl=0x5e2f30, starting_index=0) at ../../../../../../src/ompi-v5.0.x/opal/mca/btl/uct/btl_uct_component.c:575
[1,0]<stdout>: #17 0x0000fffff756c324 in mca_btl_uct_component_progress () at ../../../../../../src/ompi-v5.0.x/opal/mca/btl/uct/btl_uct_component.c:626
[1,0]<stdout>: #18 0x0000fffff74c6b7c in opal_progress () at ../../../src/ompi-v5.0.x/opal/runtime/opal_progress.c:224
[1,0]<stdout>: #19 0x0000fffff7d69b04 in ompi_request_wait_completion (req=0x73ad80) at ../../../../../../src/ompi-v5.0.x/ompi/request/request.h:492
[1,0]<stdout>: #20 0x0000fffff7d6c604 in mca_pml_ob1_send (buf=0xfffff4ca0010, count=1, datatype=0x62add0, dst=1, tag=0, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x420080 <ompi_mpi_comm_world>) at ../../../../../../src/ompi-v5.0.x/ompi/mca/pml/ob1/pml_ob1_isend.c:327
[1,0]<stdout>: #21 0x0000fffff7b6804c in PMPI_Send (buf=0xfffff4ca0010, count=1, type=0x62add0, dest=1, tag=0, comm=0x420080 <ompi_mpi_comm_world>) at ../../../../../src/ompi-v5.0.x/ompi/mpi/c/send.c:93
[1,0]<stdout>: #22 0x0000000000400cb0 in main (argc=1, argv=0xffffffffe218) at sv.c:23

When the crash occurs, payload_size is 8214 which is not a multiple of 4, so opal_convertor_pack() packs only 8212 bytes and then the assertion fails.

Packing an "unaligned" number of bytes (so to speak) looks fishy to me and looks like a good place to start investigating.

Good luck!

The code is not completely correct and does not lead to the desired result. Unfortunately, I cannot post “perfect” code here, as this is part of an exam. However, the code should not lead to a deadlock.
With 2 MPI tasks, the problem never occurs for me, even if I explicitly select uct as btl here.

@AnnaLena77
Copy link
Author

Strange... very very strange:
The problem is caused by the combination of uct and sm.

mpirun -n 81 --mca pml ob1 --mca btl ^uct   ./cannon_test

--> works!

mpirun -n 81 --mca pml ob1 --mca btl uct  ./cannon_test

--> works

mpirun -n 81 --mca pml ob1 --mca btl uct,sm  ./cannon_test

--> deadlock

@ggouaillardet
Copy link
Contributor

I edited my reproducer and changed the allocated size of the buf buffer.

If Open MPI is configure'd with --disable-debug, then the program does not hang nor crash.
I strongly suggest you configure with --enable-debug.

@AnnaLena77
Copy link
Author

I edited my reproducer and changed the allocated size of the buf buffer.

If Open MPI is configure'd with --disable-debug, then the program does not hang nor crash. I strongly suggest you configure with --enable-debug.

Yes, I can reproduce the problem right now. Thank you.

@ggouaillardet
Copy link
Contributor

you can try this patch

diff --git a/opal/mca/btl/uct/btl_uct_tl.c b/opal/mca/btl/uct/btl_uct_tl.c
index 2205508389..fc71b7692b 100644
--- a/opal/mca/btl/uct/btl_uct_tl.c
+++ b/opal/mca/btl/uct/btl_uct_tl.c
@@ -480,10 +480,10 @@ static void mca_btl_uct_set_tl_am(mca_btl_uct_module_t *module, mca_btl_uct_tl_t
         tl->max_device_contexts = mca_btl_uct_component.num_contexts_per_module;
     }
 
-    module->super.btl_eager_limit = MCA_BTL_UCT_TL_ATTR(tl, 0).cap.am.max_bcopy
+    module->super.btl_eager_limit = MCA_BTL_UCT_TL_ATTR(tl, 0).cap.am.max_bcopy & ~0x0F
                                     - sizeof(mca_btl_uct_am_header_t);
     if (MCA_BTL_UCT_TL_ATTR(tl, 0).cap.flags & UCT_IFACE_FLAG_AM_ZCOPY) {
-        module->super.btl_max_send_size = MCA_BTL_UCT_TL_ATTR(tl, 0).cap.am.max_zcopy
+        module->super.btl_max_send_size = MCA_BTL_UCT_TL_ATTR(tl, 0).cap.am.max_zcopy & ~0x0F
                                           - sizeof(mca_btl_uct_am_header_t);
     } else {
         module->super.btl_max_send_size = module->super.btl_eager_limit;

it works when only btl/uct is used, but still crashed when btl/sm is used.

@bosilca at this stage my analysis is that when btl/uct is used, all btl (e.g. not limited to btl/uct) can only send full predefined datatype. If my understanding is correct, I have no idea on how to enforce this.

@bosilca
Copy link
Member

bosilca commented Jul 25, 2024

@ggouaillardet I'm confused about this patch. It lead to a waste of up to 15 bytes, fo a data that is not required to be aligned.

What brings you to the conclusion that all BTLs can only send full predefined datatype ?

@AnnaLena77
Copy link
Author

The Send Revc example from @ggouaillardet works up to a size of N=91, from N=92 I get the deadlock.

The message size would then be 16200 bytes. Larger messages are no longer sent, but smaller ones are.

@ggouaillardet
Copy link
Contributor

@bosilca the patch is indeed wrong for the general case, Here are my observations:

static inline void _mca_btl_uct_send_pack(void *data, void *header, size_t header_size,
                                          opal_convertor_t *convertor, size_t payload_size)
{
    uint32_t iov_count = 1;
    struct iovec iov;
    size_t length;

    if (header_size > 0) {
        assert(NULL != header);
        memcpy(data, header, header_size);
    }

    /* pack the data into the supplied buffer */
    iov.iov_base = (IOVBASE_TYPE *) ((intptr_t) data + header_size);
    iov.iov_len = length = payload_size;

    (void) opal_convertor_pack(convertor, &iov, &iov_count, &length);

    assert(length == payload_size);
}

the assert() here basically means btl/uct will crash if trying to send an incomplete predefined datatype
(because opal_convertor_pack() will cause length < payload_size)
The patch avoids this for ddt made of a single predefined datatype (it poorly fails at handling a ddt made of a single byte followed by many int for example). If several btl are used, then an additional constraint is other btl should not send incomplete predefined datatypes, otherwise btl/uct will be passed an unaligned pointer and might try to send an incomplete datatype too.
To be perfectly clear, if btl/uct is not used, other btl components are free to send incomplete predefined datatypes.

I do not think I understand enough the UCX api to provide a correct fix, but at this stage, my conclusion is we _mca_btl_uct_send_pack() should not try to send an incomplete predefined datatype.
FWIW we cannot simply get rid of the assert() otherwise other assert() will fail later.

@ggouaillardet
Copy link
Contributor

btl/sm and btl/uct should not be used together anyway according to the comment in btl_utc_component.c so that is an issue that can be easily avoided.
I tried to address the issue with the inline patch below, it might not be enough but I hope it can help!

diff --git a/opal/mca/btl/uct/btl_uct_am.c b/opal/mca/btl/uct/btl_uct_am.c
index 312c85e83e..33a6df9faf 100644
--- a/opal/mca/btl/uct/btl_uct_am.c
+++ b/opal/mca/btl/uct/btl_uct_am.c
@@ -51,7 +51,7 @@ mca_btl_base_descriptor_t *mca_btl_uct_alloc(mca_btl_base_module_t *btl,
 }
 
 static inline void _mca_btl_uct_send_pack(void *data, void *header, size_t header_size,
-                                          opal_convertor_t *convertor, size_t payload_size)
+                                          opal_convertor_t *convertor, size_t *payload_size, int assrt)
 {
     uint32_t iov_count = 1;
     struct iovec iov;
@@ -64,11 +64,11 @@ static inline void _mca_btl_uct_send_pack(void *data, void *header, size_t heade
 
     /* pack the data into the supplied buffer */
     iov.iov_base = (IOVBASE_TYPE *) ((intptr_t) data + header_size);
-    iov.iov_len = length = payload_size;
+    iov.iov_len = length = *payload_size;
 
-    (void) opal_convertor_pack(convertor, &iov, &iov_count, &length);
+    (void) opal_convertor_pack(convertor, &iov, &iov_count, payload_size);
 
-    assert(length == payload_size);
+    if(assrt) assert(length == *payload_size);
 }
 
 struct mca_btl_base_descriptor_t *mca_btl_uct_prepare_src(mca_btl_base_module_t *btl,
@@ -92,7 +92,9 @@ struct mca_btl_base_descriptor_t *mca_btl_uct_prepare_src(mca_btl_base_module_t
         }
 
         _mca_btl_uct_send_pack((void *) ((intptr_t) frag->uct_iov.buffer + reserve), NULL, 0,
-                               convertor, *size);
+                               convertor, size, 0);
+        frag->segments[0].seg_len = reserve + *size;
+        frag->uct_iov.length = reserve + *size;
     } else {
         opal_convertor_get_current_pointer(convertor, &data_ptr);
         assert(NULL != data_ptr);
@@ -286,7 +288,7 @@ static size_t mca_btl_uct_sendi_pack(void *data, void *arg)
 
     am_header->value = args->am_header;
     _mca_btl_uct_send_pack((void *) ((intptr_t) data + 8), args->header, args->header_size,
-                           args->convertor, args->payload_size);
+                           args->convertor, &args->payload_size, 1);
     return args->header_size + args->payload_size + 8;
 }
 
@@ -329,7 +331,7 @@ int mca_btl_uct_sendi(mca_btl_base_module_t *btl, mca_btl_base_endpoint_t *endpo
     } else if (msg_size < (size_t) MCA_BTL_UCT_TL_ATTR(uct_btl->am_tl, context->context_id)
                               .cap.am.max_short) {
         int8_t *data = alloca(total_size);
-        _mca_btl_uct_send_pack(data, header, header_size, convertor, payload_size);
+        _mca_btl_uct_send_pack(data, header, header_size, convertor, &payload_size, 1);
         ucs_status = uct_ep_am_short(ep_handle, MCA_BTL_UCT_FRAG, am_header.value, data,
                                      total_size);
     } else {
diff --git a/opal/mca/btl/uct/btl_uct_component.c b/opal/mca/btl/uct/btl_uct_component.c
index 51c7152423..ba97339d7b 100644
--- a/opal/mca/btl/uct/btl_uct_component.c
+++ b/opal/mca/btl/uct/btl_uct_component.c
@@ -103,7 +103,7 @@ static int mca_btl_uct_component_register(void)
 #endif
 
     /* for now we want this component to lose to btl/ugni and btl/vader */
-    module->super.btl_exclusivity = MCA_BTL_EXCLUSIVITY_HIGH;
+    module->super.btl_exclusivity = MCA_BTL_EXCLUSIVITY_HIGH-2;
 
     return mca_btl_base_param_register(&mca_btl_uct_component.super.btl_version, &module->super);
 }

@AnnaLena77
Copy link
Author

AnnaLena77 commented Jul 26, 2024

@ggouaillardet
Thank you very much for your help! I have applied the patch and no longer get segmentation faults, most of the programs I have tested with are now running correctly.

In one of my student's programs, which performed a matrix multiplication with the Cannon algorithm, there must be a data inconsistency somewhere, because the final matrix is shifted by exactly one value after a certain size and number of processes. The last position of the matrix then contains the value 0.00. In contrast, the correct solution for the last element of the matrix is in the penultimate position. This only happens when ob1 and uct are used. This does not happen with ob1 and tcp or the ucx pml, which is why it is probably still related to this problem.

Unfortunately, we can't post the code at the moment as it is an exam. I'll try to reproduce it again elsewhere this weekend.
Nevertheless, we can work with it for now. Thank you very much!

@wenduwan wenduwan added the bug label Aug 13, 2024
@hjelmn
Copy link
Member

hjelmn commented Sep 19, 2024

@ggouaillardet btl/sm and btl/uct are compatible. What comment are you referring to?

@ggouaillardet
Copy link
Contributor

/* for now we want this component to lose to btl/ugni and btl/vader */

@bosilca
Copy link
Member

bosilca commented Sep 20, 2024

that's for performance reasons, not correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants