Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support (S3, GCP, Azure) storage classes #737

Conversation

mohammad-aburadeh
Copy link
Contributor

Medusa does not support specifying the storage class name when uploading backups to S3/GCP/Azure.
This is very important for many customers as it can help to reduce the storage cost.

Closes #568

Copy link

sonarcloud bot commented Apr 7, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Copy link
Contributor

@rzvoncek rzvoncek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @mohammad-aburadeh and sorry to keep you waiting for so long. I've finally managed to have a proper look at this and tried to test it out. Sadly, I have bad news.

In the context of S3, we need to limit the sotrage classes we support. It doesn't seem to be possible to read from Glacier and below, so we should not let Medusa write those because it won't be able to read after itself.

In the context of GCS, the explicit storage classes don't seem to work at all. The header seems just ignored. I didn't find a way to do this aside from setting a default value on the entire bucket.

In the context of Azure, we need to pass enums, not strings, because that's what the client we use expects.

Because of this, I'm sorry to return the PR to you and kindly ask you to do one more iteration to fix/improve things.

medusa-example.ini Show resolved Hide resolved
medusa-example.ini Show resolved Hide resolved
medusa-example.ini Show resolved Hide resolved
docs/Configuration.md Show resolved Hide resolved
tests/resources/config/medusa-azure_blobs.ini Show resolved Hide resolved
medusa/storage/google_storage.py Show resolved Hide resolved
resp = await self.gcs_storage.upload(
bucket=self.bucket_name,
object_name=object_key,
file_data=data,
force_resumable_upload=True,
timeout=-1,
headers=ex_header,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not get this to work with GCS. All the uploads I did ended up with the Standard storage class. Swithich the default bucket storage class mode to Managed with Autoclass did not help either.

It's like the GCS ignores the HTTP header. Even if I set it in the request, the response comes with 'storageClass': 'STANDARD',

medusa-example.ini Show resolved Hide resolved
medusa/storage/azure_storage.py Show resolved Hide resolved
@federicobaldo
Copy link

hi everyone, any help needed here? this feature is very interesting and would like to try using it ASAP

@kantipudipythian
Copy link

Hi,
We tried the following steps to add the storage_class. We are using AWS S3 bucket for storing the backup files.

  1. Upgraded the Medusa version to 0.21.0 from 0.17.2.
  2. Added the storage_class parameter to the medusa.ini file.
  3. We updated config.py, abstract_storage.py, s3_base_storage.py files accordingly.
  4. Ran differential backup.

The backup was successful. But it is taking 1 hour to complete. The previous backups would finish within 2-5minutes. We observed that the manifest.json file is taking more time. Can you please let us know what might be the issue?

@rzvoncek
Copy link
Contributor

I've implemented the suggested changes and added integration tests over at https://github.com/thelastpickle/cassandra-medusa/pull/777/checks

@rzvoncek rzvoncek closed this Jun 12, 2024
@kantipudipythian
Copy link

Hi,

Below are the tests done on the cluster:

Old Medusa Version: 0.17.2
New Medusa Version: 0.21.0
Storage_class: STANDARD_IA

Test1 (New Medusa version):
New bucket,
storage_class parameter in the medusa.ini file
Started backup.
The 2nd backup took the same time as the first backup around 50 minutes.

Test2 (old Medusa version):
New bucket,
storage_class parameter in the medusa.ini file
Started backup.
The backup was successful, 1st backup took 50 minutes to complete. The 2nd backup was done within 1minute.

Test3 (old Medusa version):
New bucket,
storage_class parameter in the medusa.ini file
Modified config.py, abstract_storage.py, s3_base_storage.py
Started backup.
The backup was successful, 1st backup took 6 minutes to complete. The 2nd backup was done within 1minute. (There are few backups in the bucket while taking this backup.)

Test4 (old Medusa version):
New bucket,
storage_class parameter in the medusa.ini file
Modified config.py, abstract_storage.py, s3_base_storage.py
Started backup with the empty new bucket.
The backup was successful, 1st backup took 50mins to complete. The 2nd backup was done within 1minute.

Can you please let us know why the New Medusa version with STANDARD_IA is taking more time?

Thanks,
Kanthi Rekha.

@mohammad-aburadeh mohammad-aburadeh deleted the support_storage_class branch July 23, 2024 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow specifying storage class name when uploading backups to S3
6 participants