Intermittent flakiness of v7.0 RC #1676

danielmarbach · 2024-09-16T10:30:14Z

Describe the bug

It is still a bit early to say what actually causes these things to happen and might very well be that the problem ins somewhere in our implementation. But we (@bording and I) figured to give a heads-up in case it ends up being something in the client.

We have migrated to the latest RC of the v7 version and see some intermittent test failures

https://github.com/Particular/NServiceBus.RabbitMQ/actions/runs/10852801383?pr=1446

For some reason, the consume isn't happening, which messes up all the other tests

The thing we might be seeing so far here would be if BasicPublishAsync task is somehow completing before the message has actually been fully sent and confirmed. This would then let the BasicGetAsync start before the message is in the queue. At first sight, we couldn't spot anything in the transport or test code that could account for that, but like I said we may have missed something.

It is failing on Linux, so it's not just a Windows thing: https://github.com/Particular/NServiceBus.RabbitMQ/actions/runs/10854796632?pr=1446

Reproduction steps

Still investigating

Expected behavior

Publish working ;)

Additional context

No response

The text was updated successfully, but these errors were encountered:

danielmarbach · 2024-09-16T19:03:14Z

We have a hunch the problem might be related to confirms tracking and we are restoring our homegrown approach we had before to verify.

lukebakken · 2024-09-17T01:31:47Z

The thing we might be seeing so far here would be if BasicPublishAsync task is somehow completing before the message has actually been fully sent and confirmed.

Can you point out the exact test that is failing? In your source code?

The new confirmation tracking code was added by @stebet (I think) so I'm pinging him here.

When you call BasicPublishAsync, yes, it will complete and you must then call one of the WaitForConfirm*Async methods to wait for the confirmation. Or, like you mention, handle the acks yourself and track outstanding messages yourself.

danielmarbach · 2024-09-17T04:50:43Z

Ah OK that's probably the missing link. Given the API returns a task and the channel knows that confirms are enabled why does one need to call two methods?

The failing tests are visible in the CI runs. Sure I can link them later here

danielmarbach · 2024-09-17T10:58:48Z

Given the API returns a task and the channel knows that confirms are enabled why does one need to call two methods?

Was this design chosen because of the multiple flag? // cc @stebet

danielmarbach · 2024-09-17T12:51:19Z

Ok, I think I understand now. The work followed an existing design and made that async. Because we had our own confirmation tracking in place, we never needed to use WaitFormConfirms before. For me though, by looking at the async nature of the APIs as of today, I would have found it more intuitive for a single publish to have that functionality built-in depending on whether the channel is in confirms mode or not. Now that I better understand the functionality, I do see though why the APIs have been separated. It is up to the caller of Publish to decide how many publishes that should be done until the WaitForConfirms is used to wait for all the pending publishes.

lukebakken · 2024-09-17T14:33:58Z

It is up to the caller of Publish to decide how many publishes that should be done until the WaitForConfirms is used to wait for all the pending publishes

Yep, I think that's the main idea, plus I think it makes it a bit "easier" for an application to handle the case of a timed-out confirmation, rather than a random exception while publishing or at some other point.

Let me know if there's anything I can do to to assist with your testing. Thanks a lot for giving the latest RC a spin!

danielmarbach · 2024-09-17T16:25:04Z

@bording and I discussed this further today, and we believe this is a legacy that should be removed from the API surface of the client. In an async world on a channel that has confirms enabled, the publish operation could simply wire up the task completion source and then await that instead of the ModelSend. The result of the operation (successful, failed or cancelled) can be propagated into the task completion source.

Doing multiple publishes and waiting for those to be confirmed is then simply a Task.WhenAll. The current way is counterintuitive and error-prone.

We will look into opening a PR

lukebakken · 2024-09-17T16:26:14Z

Great, that makes a lot of sense.

Tornhoof · 2024-09-17T17:03:02Z

Doing multiple publishes and waiting for those to be confirmed is then simply a Task.WhenAll.

Be careful that multiple publishes via Task.WhenAll might not keep the ordering.l of the initial list of publishes.

A System.Threading.Channel might solve that and multiple confirms could then simply be a part of an inner (synchronous) TryRead loop, i.e., if multiple publishes are enqueued synchronously, these could be then confirmed together. This would then improve the confirm delay for channels with higher publish load.

danielmarbach · 2024-09-17T17:04:14Z

Sure we can take this into account in the PR @Tornhoof

danielmarbach · 2024-09-17T17:08:09Z

@Tornhoof FYI we don't care about the order of publishes for our use cases that much, but I can see how that is something that might be something the users of this library care about, which I guess right now is fulfilled by having the channel wide semaphore ensuring consistent ordering. So I started wondering how much of that is a concern of the rabbitmq client vs the user of the API surface.

Tornhoof · 2024-09-17T17:51:35Z

So I started wondering how much of that is a concern of the rabbitmq client vs the user of the API surface.

I'd say User, but if publish automatically confirms (and single confirm is slow), then you'd need something like PublishMany(list) with multi confirm.

From my own Benchmarks while migrating to 7.0, without any confirm, the code published around 16k msg/s, with single confirm it was down to 800 and with an S.T.Channel doing multi-confirm after the synchronous TryRead Loop lifted it back to ~10k.

lukebakken · 2024-09-18T19:31:23Z

@danielmarbach @bording @Tornhoof - #1682

danielmarbach · 2024-09-19T18:27:57Z

Now that things are split out into dedicated issues and PRs I will close this one. Brandon will update #1682 with some of the discussions and challenges we are having.

danielmarbach added the bug label Sep 16, 2024

danielmarbach changed the title ~~Intermittent flakiness of v7.0~~ Intermittent flakiness of v7.0 RC Sep 16, 2024

lukebakken added this to the 7.0.0 milestone Sep 17, 2024

lukebakken self-assigned this Sep 17, 2024

lukebakken mentioned this issue Sep 18, 2024

Make handling of publisher confirmations transparent to the user #1682

Closed

4 tasks

danielmarbach closed this as completed Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent flakiness of v7.0 RC #1676

Intermittent flakiness of v7.0 RC #1676

danielmarbach commented Sep 16, 2024 •

edited

Loading

danielmarbach commented Sep 16, 2024

lukebakken commented Sep 17, 2024

danielmarbach commented Sep 17, 2024 •

edited

Loading

danielmarbach commented Sep 17, 2024

danielmarbach commented Sep 17, 2024

lukebakken commented Sep 17, 2024

danielmarbach commented Sep 17, 2024

lukebakken commented Sep 17, 2024

Tornhoof commented Sep 17, 2024

danielmarbach commented Sep 17, 2024

danielmarbach commented Sep 17, 2024

Tornhoof commented Sep 17, 2024

lukebakken commented Sep 18, 2024

danielmarbach commented Sep 19, 2024

Intermittent flakiness of v7.0 RC #1676

Intermittent flakiness of v7.0 RC #1676

Comments

danielmarbach commented Sep 16, 2024 • edited Loading

Describe the bug

Reproduction steps

Expected behavior

Additional context

danielmarbach commented Sep 16, 2024

lukebakken commented Sep 17, 2024

danielmarbach commented Sep 17, 2024 • edited Loading

danielmarbach commented Sep 17, 2024

danielmarbach commented Sep 17, 2024

lukebakken commented Sep 17, 2024

danielmarbach commented Sep 17, 2024

lukebakken commented Sep 17, 2024

Tornhoof commented Sep 17, 2024

danielmarbach commented Sep 17, 2024

danielmarbach commented Sep 17, 2024

Tornhoof commented Sep 17, 2024

lukebakken commented Sep 18, 2024

danielmarbach commented Sep 19, 2024

danielmarbach commented Sep 16, 2024 •

edited

Loading

danielmarbach commented Sep 17, 2024 •

edited

Loading