-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[[JAMES-3696] step 1 : fix broken tests and isolate flaky tests #2416
base: master
Are you sure you want to change the base?
Conversation
This doesn't fix the test but it eliminates one possible variation
The test started breaking when the deadletter policy was introduced on the pulsar mailqueue. It wasn't detected because test suites that are tagged unstable are never executed on CI (:scream:) The core issue is that the way the test was setup the same mail could end up being retried several times while others were being acknowledge in success immediately. As long as no dead letter policy was applied, it was fine and the email would be retried several times before being dequeued. With the dead letter policy this was no longer possible and the test never completed because at least one of the mail ended up in dead letter. I changed the test to track the state of each email and apply the proper action (retry or success) ensuring each mail is retried once and then acknowledged.
bd22335
to
80a45dc
Compare
server/queue/queue-pulsar/src/test/java/org/apache/james/queue/pulsar/PulsarMailQueueTest.java
Outdated
Show resolved
Hide resolved
server/queue/queue-pulsar/src/test/java/org/apache/james/queue/pulsar/PulsarMailQueueTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read it . Looks good to me so far.
Noticed things sometimes as well breaking in tmail because it's not been really tested on James, there is clearly something better we could do here :)
The previous implementation using `Iterators.consumingIterator` doesn't actually consume anything. It only wraps the iterator it receives in an interator that removes the items from the underlying iterator when next is called. We implemented a very simple : consumeIterator method to actually consume and therefore trigger the potential concurrency issues the tests were trying to protect against. Doing so raised a failing test in ActiveMQMailQueueBlobTest.
80a45dc
to
4e7a32f
Compare
Read it. |
MailQueue.MailQueueItem mailQueueItem = deque.takeLast(); | ||
dequeuedMails.add(mailQueueItem.getMail()); | ||
mailQueueItem.done(MailQueue.MailQueueItem.CompletionStatus.SUCCESS); | ||
var name = mailQueueItem.getMail().getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test what we want is to :
enqueue->retry->success each mail
in the previous implementation since there was no tracking of the state of a specific mail it could result in a mail getting retried twice once another got success right away.
Since the pulsar mailqueue has a dead letter policy that puts a message in dead letter after 1 retry, retrying a message more than once results in that message never being dequeable again. This in turn means the test woud block and timeout.
4e7a32f
to
347249e
Compare
The pulsar implementation uses a distributed and therefore asynchronous distribution of the removal filters. This change ensures that the tests allow for such an implementation by waiting for the system under test to converge to the desired state. It also ensures that the executor waits as little as possible to get into a consistent state, ensuring fast local execution while allowing a longer wait on the CI. If the default timeout (10s if I understand correctly) is not enough for the CI, it can be safely increased without slowing down local test execution like Thread.sleep() did.
assertThatCode( | ||
() -> assertThat(consumeIterator(items)).isEqualTo(98) // and 998 here so 999 in total | ||
).doesNotThrowAnyException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interestingly, this test fails except for the pulsar implementation.
Most implementations will return the removed item if the browse has been started before the item is removed. The pulsar implementation is able to filter it out even from an operating iterator.
I am considering relaxing the constraint here with this code
assertThatCode(
// the remove may or may not be applied to an already started iterator
// as long as it doesn't make the iterator crash
() -> assertThat(consumeIterator(items)).isGreaterThanOrEqualTo(98).isLessThan(100)
).doesNotThrowAnyException();
347249e
to
ee43c0d
Compare
When I tried to start working on the flaky test I ran the
PulsarMailqueueTest
suite and one of the tests failed. Usinggit bisect
I was able to trace it back to the introduction of the deadletter policy a while ago and yet all the pipelines were ok on the CI.I looked into the
unstable-test
stage and in develocity but couldn't find any trace of thePulsarMailqueueTest
suite being executed.Doing local tests I realized that when the whole class is tagged, for some reason, the suite will not be executed, neither in the normal test stage or in the
unstable-test
stage. This does not only affectPulsarMailQueueTest
. See my attempt at fixing the underlying problem in #2417This PR focuses on resolving the flakyness of the pulsar mailqueue tests by :
Doing so led Matthieu and I to discover another issue in the
ManageableMailQueueContract
: several tests assertions were verifying exactly nothing. We fixed that in the 3rd commit of this PR (a754182 as I write this). Doing so raised a failing test case for theActiveMQMailQueueBlobTest