-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chaos testing: break down Redis #1013
Comments
cc @chibenwa |
Local test result:
Rate limiting and Rspamd mailetIt should be noted that we have already declared Rate limitingThe I attempted to modify the source code to override the timeout by Reactor (ex: set to 10 seconds). In this case, the mailet threw an exception, but it was ignored, and the next mailet in the pipeline continued execution. The recipient received an email from the sender successfully. // A warning flag has been set for this feature because we should configurable the timeout exception for this mailet RspamdThe SSO via ApisixIt is not possible to log in or log out via SSO. The Error occurred at: Jmap/ Redis event busIt is not possible to receive a response from Jmap methods: Another related exception:
Last: we can not start a new Tmail (or restart) when stop redis. |
Just to be sure, did you relied on a redis cluster for those tests ? Or did you work on a single container? By using redis as a pub-sub component for Apache James, then getting some level of reliance of Redis is IMO acceptable, it would be Ok to tolerate failures / depoendency to Redis for that pub sub component.
+1 for the timeout in the RateLimiting mailets configuration, none by default. I do not understand why lettuce driver do not handle the timeout itself. They documented a default timeout to 60 seconds. We need to understand why it is not the case IMO. We could also configure by default this mailet to ignore exceptions...
Tested with a redis cluster? We might want to add a parameter
If operating on top of a lone redis, then failure at the JMAP level seems ok to me at first glance. However failing to clearly timeout IS an issue. If Redis is KO those JMAP requests should fail fast, in the 5s range IMO.
That's indeed a problem: we shall be able to do reboot TMail (when not using Redis for pub sub). If using Redis for pub sub then failing starting James would be acceptable if redis is down... Thoughts? |
I used the single redis container for test The redis cluster (master-slave) on staging k8s is enough for what we want? |
I checked the staging, topology is 1 master + 2 replicas. |
A bit more explanation on that. Sounds not good actually. Alternatives I think:
|
Now that I think about it again, wasn't there an issue using the redis cluster with one of our component? Maybe apisix? |
No ideally redis-cluster cluster cluster should be used for testing. IMO redis topology shall be...
|
How many master in redis-cluster? |
3 node cluster |
Redis-cluster lab (local)Docker-compose lab:
Testing with redis-cluster can lead to various scenarios, so before describing them, I'll make a few remarks:
1. redis cluster: 3 node master, 0 node replicas(a requirement for building a cluster is to have a minimum of 3 master nodes):
When
2. redis cluster: 3 node master, 3 node replicasScenario sample:
During this time, monitoring the Redis logs, there will be logs like:
Tmail-backend and Redis clusterRspamd
Rate limiting, Jmap/ Redis event bus
{
"sessionState": "2c9f1b12-b35a-43e6-9af2-0106fb53a943",
"methodResponses": [
[
"Email/send",
{
"accountId": "b0d9e55c1a2682586469bc2a23dbb2c671e138ee61e0362972fd7c3d265ea9b2",
"newState": "2c9f1b12-b35a-43e6-9af2-0106fb53a943",
"notCreated": {
"K87": {
"type": "serverFail",
"description": "CLUSTERDOWN The cluster is down"
}
}
},
"c1"
]
]
} Related error log regarding the Lettuce library:
Warning log when starting TMail:
Another note:
|
Interesting experiment.
So Can TMail recover reconnecting to the Redis Cluster after the Redis Cluster is backed normally?
I can not understand this. The
Dont forget to fire a fix for it ^^ |
So what I understand is that we can't use redis cluster with rspamd, correct? Same for sentinel I would guess then if you can only point one endpoint? Or maybe the headless endpoint with k8s that redirects to all redis pods addresses would do the trick? |
Redis does support Redis Sentinel: I am unsure about Redis Cluster as I do not see Rspamd mentions. |
I remeber unsupported as it lacked some REDIS commands. |
Some summary:
|
Some questions:
|
Open
my opinion: lower is better
+1 The key dispatch by Redis for the notification feature is not critical, |
WHat is the impact of false positive ie you fallback when there is nothing? |
Even when the master node is down, or not |
That was not the question. Upon a master slave failover... ... do we loose unreplicated data? ... How long does the failover takes? ... Are there other side effects? Based on these answer we might want to put a low value, or a defensive value to prevent too-frequent switches... |
Upon a master slave failover...
yes, Example cases:
// the Redis document write that does not support strong consistency
config: Redis document:
Related to
Updated answer: |
Why
Expectation: TMail core service should not be disrupted more or less by Redis outage.
How
Experiment on preprod with what happens to TMail deployment if Redis is down.
Some related Redis features:
Identify issues and propose enhancements to help TMail deployment be more fault tolerant and resilient.
The text was updated successfully, but these errors were encountered: