Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RabbitMQ 3.10 changed behaviour of unacknowledged messages #585

Open
2 tasks
keifgwinn opened this issue Jul 21, 2022 · 2 comments
Open
2 tasks

RabbitMQ 3.10 changed behaviour of unacknowledged messages #585

keifgwinn opened this issue Jul 21, 2022 · 2 comments

Comments

@keifgwinn
Copy link
Contributor

keifgwinn commented Jul 21, 2022

Description

There was an issue on production captured in https://github.com/3drepo/DevOps/issues/457 that was resolved with a configuration change on rabbitmq as they changed the default behaviour of unacknowledged queues in RabbitMQ 3.10. This change was also backported to 3.8.15

They've also deprecated 'classic queues' and recommend new quorum queues https://www.rabbitmq.com/ha.html

We are also seeing higher queue numbers as the system gets busier, and unacknowledged messages, due to the high number of channel exceptions, we also encountered this 'stuck' queue

RabbitMQ Uacked Messages are unacknowledged messages. In RabbitMQ, when many messages are delivered to the consumer or the target. But according to protocols, it is not guaranteed that message delivery will always be successful. So to solve this, Publishers and Consumers require a mechanism for delivery and processing confirmation. This is where Acknowledgement and Unacknowledgment. A message is ready when it is waiting to be processed. Whenever a consumer connects to the queue it receives a batch of messages to process. Meanwhile, the consumer is working on the messages the amount is given in prefetch size and they get the message unacked. RabbitMQ Unacked Messages are the messages that are not Acknowledged.

If a consumer fails to acknowledge messages, the RabbitMQ will keep sending new messages until the prefetch value set for the associated channel is equal to the number of RabbitMQ Unacked Messages count. If RabbitMQ Unacked Messages are available then it will make the “struck” messages available again, and the client process will automatically reconnect. RabbitMQ Unacked Messages are read by the consumer but the consumer has never sent an ACK or confirmation to the RabbitMQ broker to say that it has finished processing it.

Goals

  • Review our use of queues to closer match the patterns expected by RabbitMQ developers.

Tasks

  • TBD

Related Resources

RabbitMQ ChangeLog

Current queue configuration
image

@carmenfan
Copy link
Member

as discussed in teams, we could ack the message as soon as we received it, but we will need to requeue a message should the processing failed in an unexpected way, currently there's 2 ways this may happen:

  1. Licensing failure - this is easily resolved by code changed as we are in control at the point of failure
  2. If the bouncer_worker was terminated unexpectedly, this mostly happens when AWS recalls the machine and the nodeJS application gets killed. Question is, is there anyway the nodeJS can get a signal before this happens so we can requeue a message? Do we get a signal that we can catch (like SIGTERM?)

@keifgwinn
Copy link
Contributor Author

keifgwinn commented Jul 21, 2022

If the node is going away due to AWS initiated activity, we have the aws-node-termination-handler installed and currently configured in metadata monitoring mode so we should get notified that the hardware will be going away.

image

Currently installed like this


#AWS packages, termination handler, cloudwatch metrics  
helm upgrade --install \
 --force aws-node-termination-handler \
 --namespace kube-system \
 --set enableSpotInterruptionDraining="true" \
 --set enableRebalanceMonitoring="true" \
 --set enableScheduledEventDraining="true" \
 --set emitKubernetesEvents="true" \
 --set taintNode="true" \
 eks/aws-node-termination-handler --version 0.13.3

In that event, the nodes get 'tainted' or set that they're going to be offlined, and kubernetes should issue a signal to the pods that they're getting terminated and pods gracefully shut down under our current configuration TERM is sent to node

The kubelet triggers the container runtime to send a TERM signal to process 1 inside each container.

so we should be able to catch it.

However, occasionally there may be a hardware error underneath that will not follow this process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants