[[JAMES-3696] step 1 : fix broken tests and isolate flaky tests #2416

jeantil · 2024-09-19T16:25:40Z

When I tried to start working on the flaky test I ran the PulsarMailqueueTest suite and one of the tests failed. Using git bisect I was able to trace it back to the introduction of the deadletter policy a while ago and yet all the pipelines were ok on the CI.
I looked into the unstable-test stage and in develocity but couldn't find any trace of the PulsarMailqueueTest suite being executed.

Doing local tests I realized that when the whole class is tagged, for some reason, the suite will not be executed, neither in the normal test stage or in the unstable-test stage. This does not only affect PulsarMailQueueTest. See my attempt at fixing the underlying problem in #2417

This PR focuses on resolving the flakyness of the pulsar mailqueue tests by :

Fixing the currently broken test
Changing the way we assert to allow for an asychronous distribution of removal filters.

Doing so led Matthieu and I to discover another issue in the ManageableMailQueueContract : several tests assertions were verifying exactly nothing. We fixed that in the 3rd commit of this PR (a754182 as I write this). Doing so raised a failing test case for the ActiveMQMailQueueBlobTest

This doesn't fix the test but it eliminates one possible variation

The test started breaking when the deadletter policy was introduced on the pulsar mailqueue. It wasn't detected because test suites that are tagged unstable are never executed on CI (:scream:) The core issue is that the way the test was setup the same mail could end up being retried several times while others were being acknowledge in success immediately. As long as no dead letter policy was applied, it was fine and the email would be retried several times before being dequeued. With the dead letter policy this was no longer possible and the test never completed because at least one of the mail ended up in dead letter. I changed the test to track the state of each email and apply the proper action (retry or success) ensuring each mail is retried once and then acknowledged.

server/queue/queue-pulsar/src/test/java/org/apache/james/queue/pulsar/PulsarMailQueueTest.java

Arsnael

Read it . Looks good to me so far.

Noticed things sometimes as well breaking in tmail because it's not been really tested on James, there is clearly something better we could do here :)

The previous implementation using `Iterators.consumingIterator` doesn't actually consume anything. It only wraps the iterator it receives in an interator that removes the items from the underlying iterator when next is called. We implemented a very simple : consumeIterator method to actually consume and therefore trigger the potential concurrency issues the tests were trying to protect against. Doing so raised a failing test in ActiveMQMailQueueBlobTest.

chibenwa · 2024-09-20T07:35:21Z

Read it.
To be honnest I do not grasp fully changes here but it seems legitimate.

jeantil · 2024-09-20T11:22:49Z

server/queue/queue-api/src/test/java/org/apache/james/queue/api/MailQueueContract.java

                    MailQueue.MailQueueItem mailQueueItem = deque.takeLast();
-                    dequeuedMails.add(mailQueueItem.getMail());
-                    mailQueueItem.done(MailQueue.MailQueueItem.CompletionStatus.SUCCESS);
+                    var name = mailQueueItem.getMail().getName();


For this test what we want is to :
enqueue->retry->success each mail
in the previous implementation since there was no tracking of the state of a specific mail it could result in a mail getting retried twice once another got success right away.
Since the pulsar mailqueue has a dead letter policy that puts a message in dead letter after 1 retry, retrying a message more than once results in that message never being dequeable again. This in turn means the test woud block and timeout.

The pulsar implementation uses a distributed and therefore asynchronous distribution of the removal filters. This change ensures that the tests allow for such an implementation by waiting for the system under test to converge to the desired state. It also ensures that the executor waits as little as possible to get into a consistent state, ensuring fast local execution while allowing a longer wait on the CI. If the default timeout (10s if I understand correctly) is not enough for the CI, it can be safely increased without slowing down local test execution like Thread.sleep() did.

jeantil · 2024-09-20T17:11:29Z

...er/queue/queue-api/src/test/java/org/apache/james/queue/api/ManageableMailQueueContract.java

+        assertThatCode(
+                () -> assertThat(consumeIterator(items)).isEqualTo(98) // and 998 here so 999 in total
+        ).doesNotThrowAnyException();


interestingly, this test fails except for the pulsar implementation.
Most implementations will return the removed item if the browse has been started before the item is removed. The pulsar implementation is able to filter it out even from an operating iterator.

I am considering relaxing the constraint here with this code

assertThatCode( // the remove may or may not be applied to an already started iterator // as long as it doesn't make the iterator crash () -> assertThat(consumeIterator(items)).isGreaterThanOrEqualTo(98).isLessThan(100) ).doesNotThrowAnyException();

…ract

jeantil requested review from mbaechler and chibenwa and removed request for mbaechler September 19, 2024 16:25

jeantil added 2 commits September 19, 2024 21:00

[devscout] upgrades to pulsar 3.3.1

9464c68

This doesn't fix the test but it eliminates one possible variation

jeantil force-pushed the pulsar/fix-flaky-tests-1 branch from bd22335 to 80a45dc Compare September 19, 2024 19:01

jeantil changed the title ~~[JAMES-3696] step 1 : fix broken tests and isolate flaky tests~~ [DRAFT][JAMES-3696] step 1 : fix broken tests and isolate flaky tests Sep 19, 2024

jeantil commented Sep 19, 2024

View reviewed changes

server/queue/queue-pulsar/src/test/java/org/apache/james/queue/pulsar/PulsarMailQueueTest.java Outdated Show resolved Hide resolved

jeantil commented Sep 19, 2024

View reviewed changes

server/queue/queue-pulsar/src/test/java/org/apache/james/queue/pulsar/PulsarMailQueueTest.java Outdated Show resolved Hide resolved

Arsnael reviewed Sep 20, 2024

View reviewed changes

jeantil force-pushed the pulsar/fix-flaky-tests-1 branch from 80a45dc to 4e7a32f Compare September 20, 2024 07:05

jeantil changed the title ~~[DRAFT][JAMES-3696] step 1 : fix broken tests and isolate flaky tests~~ [[JAMES-3696] step 1 : fix broken tests and isolate flaky tests Sep 20, 2024

jeantil requested a review from Arsnael September 20, 2024 07:05

jeantil commented Sep 20, 2024

View reviewed changes

jeantil force-pushed the pulsar/fix-flaky-tests-1 branch from 4e7a32f to 347249e Compare September 20, 2024 11:24

jeantil commented Sep 20, 2024

View reviewed changes

fixup! [JAMES-3696] improves flaky tests from ManageableMailQueueCont…

ee43c0d

…ract

jeantil force-pushed the pulsar/fix-flaky-tests-1 branch from 347249e to ee43c0d Compare September 20, 2024 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[[JAMES-3696] step 1 : fix broken tests and isolate flaky tests #2416

[[JAMES-3696] step 1 : fix broken tests and isolate flaky tests #2416

jeantil commented Sep 19, 2024 •

edited

Loading

Arsnael left a comment

chibenwa commented Sep 20, 2024

jeantil Sep 20, 2024

jeantil Sep 20, 2024 •

edited

Loading

[[JAMES-3696] step 1 : fix broken tests and isolate flaky tests #2416

Are you sure you want to change the base?

[[JAMES-3696] step 1 : fix broken tests and isolate flaky tests #2416

Conversation

jeantil commented Sep 19, 2024 • edited Loading

Arsnael left a comment

Choose a reason for hiding this comment

chibenwa commented Sep 20, 2024

jeantil Sep 20, 2024

Choose a reason for hiding this comment

jeantil Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

jeantil commented Sep 19, 2024 •

edited

Loading

jeantil Sep 20, 2024 •

edited

Loading