-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 reliability: write accross 2 AZ #1166
Comments
the personal view, I see it looks not good practice. I recently received a claim from the Ops team regarding an S3 bucket. Looks like S3 replication feature: https://aws.amazon.com/s3/features/replication/#:~:text=Amazon%20S3%20CRR%20automatically%20replicates,access%20in%20different%20geographic%20regions. |
Agreed but it is not supported by OVH: we do not have a choice here. |
I have just seen that too. Likely we can publicly use that now. |
Just tested, seem to work OK. I'm following this tutorial: https://help.ovhcloud.com/csm/asia-public-cloud-storage-s3-asynchronous-replication-buckets?id=kb_article_view&sysparm_article=KB0062424#using-the-cli Note: only work on objects uploaded after applying the replication rule. See https://help.ovhcloud.com/csm/asia-public-cloud-storage-s3-asynchronous-replication-buckets?id=kb_article_view&sysparm_article=KB0062424#what-is-replicated-and-what-is-not |
Let's validate if S3 asynchronous replication is acceptable by the customer first. |
Edit: @PatrickPereiraLinagora will further cehck with customer if async replication is acceptable to them. TL;DR following march incident our margin for maneuver are not great. We might be forced to swallow our hats, but at least we will try! ALSO it turns out I badly understood the ticket, and we would also want to maintain automatic write availability. Namely: Nominal case
Partial failure
This means we need to set up a RabbitMQ queue to retry failed writes. The listener of the queue would then asynchronously read blobStoreA to complete the write on blobStoreB. Total failure
Read pathRead operation are performed in A, and fallback to B in case of error**, or if the object is not found in A**. |
TODO: write tickets |
Remark: can plug it to blob module chooser to just encrypt with aes once for both s3 blobstores |
Should we have a cron job to ensure the consistency of 2 AZ? |
cron job for trigger what?// ah, cron job for trigger webadmin, rerun task from deadletter queue |
maybe checking any mismatch between 2 AZ or executing event dead letter, in case retry fail |
as my understand, we always trust |
"When write blobStoreA fail -> total fail (even save to blobStoreB success)" -----------> isn't this partial failure? |
No. The reverse is true too, read again #1166 (comment) |
I tried to read it again, but do not see what wrong
Benoit give 3 examples, it does not contain case A fail, B success, but I think it nearly to case Where did I go wrong? |
From Benoit: Partial failure
This means we need to set up a RabbitMQ queue to retry failed writes. The listener of the queue would then asynchronously read blobStoreA to complete the write on blobStoreB. I think the step |
The reverse meaning: |
I see "(or the reverse)" I propose: WHEN write on blobStoreA fails and write on blobStoreB succeeds -> TOTAL FAIL WDYT? |
Isn't it the job of the rabbitmq queue to retry the failed item on one of the blobstore? Read blobStoreA for getting blob. Missing? Means then read on blobstoreB and write on A. Not sure about your concern here |
@chibenwa thoughts on @vttranlina concern above? |
We are dealing with immutable data. Not a concern as long as we rely on RabbitMQ for resiliency. We would only get residual data on failure anyway, the way we get them with out current architecture anyway. Not a concern Though I will be short on time to provide you a formal demonstration. |
Need deployment at least deployed on CNB preprod for me to concider this done! |
Sorry my bad |
Description
A customer rely wishes data loss never to happen again and is paranoid about it.
We wishes to offer them a Twake mail feature in order to write across 2 availability zones, synchronously.
(disclaimer: I personally advocate against this feature...)
Thus in case of failure:
Configuration changes
In blob.properties
Plugged to a Tmail backend module chooser.
Code & location
maven module:
tmail-baclend/blob/secondary-blob-store
Write a
SecondaryBlobStoreDAO
class that takes 2 blob store DAOPlug this into the TMail blob module chooser
Definition of done:
SecondaryBlobStoreDAO
The text was updated successfully, but these errors were encountered: