Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[[JAMES-3696] step 1 : fix broken tests and isolate flaky tests #2416

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jeantil
Copy link
Contributor

@jeantil jeantil commented Sep 19, 2024

When I tried to start working on the flaky test I ran the PulsarMailqueueTest suite and one of the tests failed. Using git bisect I was able to trace it back to the introduction of the deadletter policy a while ago and yet all the pipelines were ok on the CI.
I looked into the unstable-test stage and in develocity but couldn't find any trace of the PulsarMailqueueTest suite being executed.

Doing local tests I realized that when the whole class is tagged, for some reason, the suite will not be executed, neither in the normal test stage or in the unstable-test stage. This does not only affect PulsarMailQueueTest. See my attempt at fixing the underlying problem in #2417

This PR focuses on resolving the flakyness of the pulsar mailqueue tests by :

  • Fixing the currently broken test
  • Changing the way we assert to allow for an asychronous distribution of removal filters.

Doing so led Matthieu and I to discover another issue in the ManageableMailQueueContract : several tests assertions were verifying exactly nothing. We fixed that in the 3rd commit of this PR (a754182 as I write this). Doing so raised a failing test case for the ActiveMQMailQueueBlobTest

@jeantil jeantil requested review from mbaechler and chibenwa and removed request for mbaechler September 19, 2024 16:25
This doesn't fix the test but it eliminates one possible variation
The test started breaking when the deadletter policy was introduced on the pulsar mailqueue.
It wasn't detected because test suites that are tagged unstable are never executed on CI (:scream:)

The core issue is that the way the test was setup the same mail could end up being retried several times while others were being acknowledge in success immediately.
As long as no dead letter policy was applied, it was fine and the email would be retried several times before being dequeued. With the dead letter policy this was no longer possible and the test never completed because at least one of the mail ended up in dead letter.

I changed the test to track the state of each email and apply the proper action (retry or success) ensuring each mail is retried once and then acknowledged.
@jeantil jeantil changed the title [JAMES-3696] step 1 : fix broken tests and isolate flaky tests [DRAFT][JAMES-3696] step 1 : fix broken tests and isolate flaky tests Sep 19, 2024
Copy link
Contributor

@Arsnael Arsnael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read it . Looks good to me so far.

Noticed things sometimes as well breaking in tmail because it's not been really tested on James, there is clearly something better we could do here :)

The previous implementation using `Iterators.consumingIterator`
doesn't actually consume anything. It only wraps the iterator it receives in an interator that removes the items from the underlying iterator when next is called.

We implemented a very simple : consumeIterator method to actually consume and therefore trigger the potential concurrency issues the tests were trying to protect against.

Doing so raised a failing test in ActiveMQMailQueueBlobTest.
@jeantil jeantil changed the title [DRAFT][JAMES-3696] step 1 : fix broken tests and isolate flaky tests [[JAMES-3696] step 1 : fix broken tests and isolate flaky tests Sep 20, 2024
@chibenwa
Copy link
Contributor

Read it.
To be honnest I do not grasp fully changes here but it seems legitimate.

MailQueue.MailQueueItem mailQueueItem = deque.takeLast();
dequeuedMails.add(mailQueueItem.getMail());
mailQueueItem.done(MailQueue.MailQueueItem.CompletionStatus.SUCCESS);
var name = mailQueueItem.getMail().getName();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test what we want is to :
enqueue->retry->success each mail
in the previous implementation since there was no tracking of the state of a specific mail it could result in a mail getting retried twice once another got success right away.
Since the pulsar mailqueue has a dead letter policy that puts a message in dead letter after 1 retry, retrying a message more than once results in that message never being dequeable again. This in turn means the test woud block and timeout.

The pulsar implementation uses a distributed and therefore asynchronous distribution of the removal filters.

This change ensures that the tests allow for such an implementation by waiting for the system under test to converge to the desired state.

It also ensures that the executor waits as little as possible to get into a consistent state, ensuring fast local execution while allowing a longer wait on the CI. If the default timeout (10s if I understand correctly) is not enough for the CI, it can be safely increased without slowing down local test execution like Thread.sleep() did.
Comment on lines 473 to 477
assertThatCode(
() -> assertThat(consumeIterator(items)).isEqualTo(98) // and 998 here so 999 in total
).doesNotThrowAnyException();
Copy link
Contributor Author

@jeantil jeantil Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interestingly, this test fails except for the pulsar implementation.
Most implementations will return the removed item if the browse has been started before the item is removed. The pulsar implementation is able to filter it out even from an operating iterator.

I am considering relaxing the constraint here with this code

        assertThatCode(
                // the remove may or may not be applied to an already started iterator
                // as long as it doesn't make the iterator crash
                () -> assertThat(consumeIterator(items)).isGreaterThanOrEqualTo(98).isLessThan(100)
        ).doesNotThrowAnyException();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants