Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent flakiness of v7.0 RC #1676

Closed
danielmarbach opened this issue Sep 16, 2024 · 14 comments
Closed

Intermittent flakiness of v7.0 RC #1676

danielmarbach opened this issue Sep 16, 2024 · 14 comments
Assignees
Labels
Milestone

Comments

@danielmarbach
Copy link
Collaborator

danielmarbach commented Sep 16, 2024

Describe the bug

It is still a bit early to say what actually causes these things to happen and might very well be that the problem ins somewhere in our implementation. But we (@bording and I) figured to give a heads-up in case it ends up being something in the client.

We have migrated to the latest RC of the v7 version and see some intermittent test failures

https://github.com/Particular/NServiceBus.RabbitMQ/actions/runs/10852801383?pr=1446

For some reason, the consume isn't happening, which messes up all the other tests

The thing we might be seeing so far here would be if BasicPublishAsync task is somehow completing before the message has actually been fully sent and confirmed. This would then let the BasicGetAsync start before the message is in the queue. At first sight, we couldn't spot anything in the transport or test code that could account for that, but like I said we may have missed something.

It is failing on Linux, so it's not just a Windows thing: https://github.com/Particular/NServiceBus.RabbitMQ/actions/runs/10854796632?pr=1446

Reproduction steps

Still investigating

Expected behavior

Publish working ;)

Additional context

No response

@danielmarbach danielmarbach changed the title Intermittent flakiness of v7.0 Intermittent flakiness of v7.0 RC Sep 16, 2024
@danielmarbach
Copy link
Collaborator Author

We have a hunch the problem might be related to confirms tracking and we are restoring our homegrown approach we had before to verify.

@lukebakken lukebakken added this to the 7.0.0 milestone Sep 17, 2024
@lukebakken lukebakken self-assigned this Sep 17, 2024
@lukebakken
Copy link
Contributor

The thing we might be seeing so far here would be if BasicPublishAsync task is somehow completing before the message has actually been fully sent and confirmed.

Can you point out the exact test that is failing? In your source code?

The new confirmation tracking code was added by @stebet (I think) so I'm pinging him here.

When you call BasicPublishAsync, yes, it will complete and you must then call one of the WaitForConfirm*Async methods to wait for the confirmation. Or, like you mention, handle the acks yourself and track outstanding messages yourself.

@danielmarbach
Copy link
Collaborator Author

danielmarbach commented Sep 17, 2024

Ah OK that's probably the missing link. Given the API returns a task and the channel knows that confirms are enabled why does one need to call two methods?

The failing tests are visible in the CI runs. Sure I can link them later here

@danielmarbach
Copy link
Collaborator Author

Given the API returns a task and the channel knows that confirms are enabled why does one need to call two methods?

Was this design chosen because of the multiple flag? // cc @stebet

@danielmarbach
Copy link
Collaborator Author

Ok, I think I understand now. The work followed an existing design and made that async. Because we had our own confirmation tracking in place, we never needed to use WaitFormConfirms before. For me though, by looking at the async nature of the APIs as of today, I would have found it more intuitive for a single publish to have that functionality built-in depending on whether the channel is in confirms mode or not. Now that I better understand the functionality, I do see though why the APIs have been separated. It is up to the caller of Publish to decide how many publishes that should be done until the WaitForConfirms is used to wait for all the pending publishes.

@lukebakken
Copy link
Contributor

It is up to the caller of Publish to decide how many publishes that should be done until the WaitForConfirms is used to wait for all the pending publishes

Yep, I think that's the main idea, plus I think it makes it a bit "easier" for an application to handle the case of a timed-out confirmation, rather than a random exception while publishing or at some other point.

Let me know if there's anything I can do to to assist with your testing. Thanks a lot for giving the latest RC a spin!

@danielmarbach
Copy link
Collaborator Author

@bording and I discussed this further today, and we believe this is a legacy that should be removed from the API surface of the client. In an async world on a channel that has confirms enabled, the publish operation could simply wire up the task completion source and then await that instead of the ModelSend. The result of the operation (successful, failed or cancelled) can be propagated into the task completion source.

Doing multiple publishes and waiting for those to be confirmed is then simply a Task.WhenAll. The current way is counterintuitive and error-prone.

We will look into opening a PR

@lukebakken
Copy link
Contributor

Great, that makes a lot of sense.

@Tornhoof
Copy link
Contributor

Doing multiple publishes and waiting for those to be confirmed is then simply a Task.WhenAll.

Be careful that multiple publishes via Task.WhenAll might not keep the ordering.l of the initial list of publishes.

A System.Threading.Channel might solve that and multiple confirms could then simply be a part of an inner (synchronous) TryRead loop, i.e., if multiple publishes are enqueued synchronously, these could be then confirmed together. This would then improve the confirm delay for channels with higher publish load.

@danielmarbach
Copy link
Collaborator Author

Sure we can take this into account in the PR @Tornhoof

@danielmarbach
Copy link
Collaborator Author

@Tornhoof FYI we don't care about the order of publishes for our use cases that much, but I can see how that is something that might be something the users of this library care about, which I guess right now is fulfilled by having the channel wide semaphore ensuring consistent ordering. So I started wondering how much of that is a concern of the rabbitmq client vs the user of the API surface.

@Tornhoof
Copy link
Contributor

So I started wondering how much of that is a concern of the rabbitmq client vs the user of the API surface.

I'd say User, but if publish automatically confirms (and single confirm is slow), then you'd need something like PublishMany(list) with multi confirm.

From my own Benchmarks while migrating to 7.0, without any confirm, the code published around 16k msg/s, with single confirm it was down to 800 and with an S.T.Channel doing multi-confirm after the synchronous TryRead Loop lifted it back to ~10k.

@lukebakken
Copy link
Contributor

@danielmarbach
Copy link
Collaborator Author

Now that things are split out into dedicated issues and PRs I will close this one. Brandon will update #1682 with some of the discussions and challenges we are having.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants