-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI Derived Datatypes, Scatter with ob1 and uct produces deadlocks #12698
Comments
There is a known issue in the Instead of receiving |
Oh yes, that works fine, also with the btl uct. |
I have to correct myself. Unfortunately, the approach does not work. I had first tried the whole thing with the same data type as receive-type (“blocktype_resize”). This no longer caused a deadlock, but this is not our goal and leads to problems in other situations. Integers of the size block_size*block_size must be received, not a blocktype_resize Datatype. When we rebuilt our code today, we noticed this. A new derived datatype (created with MPI_Type_contiguous) still leads to a deadlock with uct. MPI_Type_contiguous(block_size * block_size, MPI_INT, &recvtype);
MPI_Type_commit(&recvtype);
MPI_Scatter(matrix, 1, blocktype_resized, recvbuf, 1, recvtype, 0, MPI_COMM_WORLD); --> Deadlock |
I will try to reproduce the issue. Meanwhile, you can try the following workaround and see how it goes
|
None of tuned scatter implementations have support for segmentation, thus scatter is immune to the datatype mismatch issue. I quickly scanned through the scatter algorithms in tuned, and as far as I see they all exchange the entire data in a single message (with different communication patterns, but the data goes in a single message. The UCT BTL is not up-to-date, if the tests runs with the TCP BTL then it is a clear indication where the problem is coming from. |
Unfortunately, that doesn't work either. The deadlock remains at this point. The only time I no longer have a deadlock is when I use the tcp btl instead of uct: mpirun --mca pml ob1 --mca btl ^uct ./test However, this is also not the best case for my workaround. Are there plans to fix the problem at some point? |
I know the collective modules a little, as I have been working with the code for some time. |
@bosilca has a good point, and the issue is very unlikely the "known bug" I initially mentioned, but more likely something with In order to rule out the collective components, you can
and see if it helps. |
I am working on an open MPI fork that measures internal performance data (similar to Peruse) at runtime and writes it to a database. This performance data is then visualized in a frontend at runtime. It is designed for students to show them what is happening on the cluster at runtime. Among other things, it is possible to visualize at runtime which algorithm is used for collective communication, how this affects the individual ranks, what they are doing at that moment, where potential late senders are, etc. In ob1 pml I can use it to measure internal LateSender times, for example. I have been working with ob1 and its functionality for a long time and have learned to understand how it works. Our cluster has an Infiniband connection, so I use it in conjunction with uct. The runtimes with TCP are not acceptable. UCX has a completely different architecture and I would have to start my work from scratch. In addition, the students start their program from the frontend via button and of course have no knowledge of the various pmls, btls and collectives, so I have to execute an MPI command in the background that cannot be adapted due to the error. Everything has worked wonderfully until the students started transferring derived datatypes via scatter. |
I see, for the time being, you might want to give a try to Open MPI |
@AnnaLena77 which UCX version are you using? |
1.16.0 |
@AnnaLena77 can you try to
|
No, does not work... I will have a look at the code and find out where the deadlock exactly comes from. |
FWIW, I do not observe a deadlock but a crash (assertion error, I configure'd with I use this trimmed reproducer based on your test case, and run it on a single node with 2 MPI tasks
and here is the stack trace
When the crash occurs, Packing an "unaligned" number of bytes (so to speak) looks fishy to me and looks like a good place to start investigating. Good luck! |
The code is not completely correct and does not lead to the desired result. Unfortunately, I cannot post “perfect” code here, as this is part of an exam. However, the code should not lead to a deadlock. |
Strange... very very strange: mpirun -n 81 --mca pml ob1 --mca btl ^uct ./cannon_test --> works! mpirun -n 81 --mca pml ob1 --mca btl uct ./cannon_test --> works mpirun -n 81 --mca pml ob1 --mca btl uct,sm ./cannon_test --> deadlock |
I edited my reproducer and changed the allocated size of the If Open MPI is configure'd with |
Yes, I can reproduce the problem right now. Thank you. |
you can try this patch
it works when only @bosilca at this stage my analysis is that when |
@ggouaillardet I'm confused about this patch. It lead to a waste of up to 15 bytes, fo a data that is not required to be aligned. What brings you to the conclusion that all BTLs can only send full predefined datatype ? |
The Send Revc example from @ggouaillardet works up to a size of N=91, from N=92 I get the deadlock. The message size would then be 16200 bytes. Larger messages are no longer sent, but smaller ones are. |
@bosilca the patch is indeed wrong for the general case, Here are my observations:
the I do not think I understand enough the |
diff --git a/opal/mca/btl/uct/btl_uct_am.c b/opal/mca/btl/uct/btl_uct_am.c
index 312c85e83e..33a6df9faf 100644
--- a/opal/mca/btl/uct/btl_uct_am.c
+++ b/opal/mca/btl/uct/btl_uct_am.c
@@ -51,7 +51,7 @@ mca_btl_base_descriptor_t *mca_btl_uct_alloc(mca_btl_base_module_t *btl,
}
static inline void _mca_btl_uct_send_pack(void *data, void *header, size_t header_size,
- opal_convertor_t *convertor, size_t payload_size)
+ opal_convertor_t *convertor, size_t *payload_size, int assrt)
{
uint32_t iov_count = 1;
struct iovec iov;
@@ -64,11 +64,11 @@ static inline void _mca_btl_uct_send_pack(void *data, void *header, size_t heade
/* pack the data into the supplied buffer */
iov.iov_base = (IOVBASE_TYPE *) ((intptr_t) data + header_size);
- iov.iov_len = length = payload_size;
+ iov.iov_len = length = *payload_size;
- (void) opal_convertor_pack(convertor, &iov, &iov_count, &length);
+ (void) opal_convertor_pack(convertor, &iov, &iov_count, payload_size);
- assert(length == payload_size);
+ if(assrt) assert(length == *payload_size);
}
struct mca_btl_base_descriptor_t *mca_btl_uct_prepare_src(mca_btl_base_module_t *btl,
@@ -92,7 +92,9 @@ struct mca_btl_base_descriptor_t *mca_btl_uct_prepare_src(mca_btl_base_module_t
}
_mca_btl_uct_send_pack((void *) ((intptr_t) frag->uct_iov.buffer + reserve), NULL, 0,
- convertor, *size);
+ convertor, size, 0);
+ frag->segments[0].seg_len = reserve + *size;
+ frag->uct_iov.length = reserve + *size;
} else {
opal_convertor_get_current_pointer(convertor, &data_ptr);
assert(NULL != data_ptr);
@@ -286,7 +288,7 @@ static size_t mca_btl_uct_sendi_pack(void *data, void *arg)
am_header->value = args->am_header;
_mca_btl_uct_send_pack((void *) ((intptr_t) data + 8), args->header, args->header_size,
- args->convertor, args->payload_size);
+ args->convertor, &args->payload_size, 1);
return args->header_size + args->payload_size + 8;
}
@@ -329,7 +331,7 @@ int mca_btl_uct_sendi(mca_btl_base_module_t *btl, mca_btl_base_endpoint_t *endpo
} else if (msg_size < (size_t) MCA_BTL_UCT_TL_ATTR(uct_btl->am_tl, context->context_id)
.cap.am.max_short) {
int8_t *data = alloca(total_size);
- _mca_btl_uct_send_pack(data, header, header_size, convertor, payload_size);
+ _mca_btl_uct_send_pack(data, header, header_size, convertor, &payload_size, 1);
ucs_status = uct_ep_am_short(ep_handle, MCA_BTL_UCT_FRAG, am_header.value, data,
total_size);
} else {
diff --git a/opal/mca/btl/uct/btl_uct_component.c b/opal/mca/btl/uct/btl_uct_component.c
index 51c7152423..ba97339d7b 100644
--- a/opal/mca/btl/uct/btl_uct_component.c
+++ b/opal/mca/btl/uct/btl_uct_component.c
@@ -103,7 +103,7 @@ static int mca_btl_uct_component_register(void)
#endif
/* for now we want this component to lose to btl/ugni and btl/vader */
- module->super.btl_exclusivity = MCA_BTL_EXCLUSIVITY_HIGH;
+ module->super.btl_exclusivity = MCA_BTL_EXCLUSIVITY_HIGH-2;
return mca_btl_base_param_register(&mca_btl_uct_component.super.btl_version, &module->super);
} |
@ggouaillardet In one of my student's programs, which performed a matrix multiplication with the Cannon algorithm, there must be a data inconsistency somewhere, because the final matrix is shifted by exactly one value after a certain size and number of processes. The last position of the matrix then contains the value 0.00. In contrast, the correct solution for the last element of the matrix is in the penultimate position. This only happens when ob1 and uct are used. This does not happen with ob1 and tcp or the ucx pml, which is why it is probably still related to this problem. Unfortunately, we can't post the code at the moment as it is an exam. I'll try to reproduce it again elsewhere this weekend. |
@ggouaillardet btl/sm and btl/uct are compatible. What comment are you referring to? |
ompi/opal/mca/btl/uct/btl_uct_component.c Line 105 in 5f00259
|
that's for performance reasons, not correctness. |
Background information
For certain research reasons, I need to use Open MPI with the pml ob1 in conjunction with Infiniband (ucx/uct as btl) on our cluster. This works largely without any problems.
In my particular program I am trying to split a matrix into sub-matrices (using Derived Datatypes) and distribute them to all processes using scatter.
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
Version 5.0.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.+6f81bfd163f3275d2b0630974968c82759dd4439 3rd-party/openpmix (v1.1.3-3983-g6f81bfd1)
+4f27008906d96845e22df6502d6a9a29d98dec83 3rd-party/prrte (psrvr-v2.0.0rc1-4746-g4f27008906)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main)
Please describe the system on which you are running
Details of the problem
The scatter always ends in a deadlock if the matrix is selected to be correspondingly large (here in the example 900x900, size 90 still works) and ob1 is used in conjunction with uct.
(I am using slurm scripts to distribute all jobs)
If I use normal MPI data types (e.g. MPI_DOUBLE) instead of the Derived Datatypes, everything also works with uct. So the problem is definitely with the derived datatypes that are to be used.
MPI Program:
The text was updated successfully, but these errors were encountered: