Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix various issues that existed in the rotate_db_snapshots DAG #2158

Merged
merged 6 commits into from
May 25, 2023

Conversation

sarayourfriend
Copy link
Collaborator

@sarayourfriend sarayourfriend commented May 22, 2023

Fixes

Originally this started as a PR to fix hook_params not being templated in the AWS operators, but it turned into much more.

Original issue being fixed: If you have access to production Airflow, here are the logs:

https://airflow.openverse.engineering/dags/rotate_db_snapshots/grid?search=rotate_db_snapshots&dag_run_id=manual__2023-05-22T03%3A56%3A33.707669%2B00%3A00&task_id=create_snapshot&tab=logs

Log output copied:

[2023-05-22, 03:56:38 UTC] {taskinstance.py:1847} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/operators/rds.py", line 103, in execute
    create_instance_snap = self.hook.conn.create_db_snapshot(
  File "/usr/local/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/base_aws.py", line 649, in conn
    return self.get_client_type(region_name=self.region_name)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/base_aws.py", line 614, in get_client_type
    return session.client(
  File "/home/airflow/.local/lib/python3.10/site-packages/boto3/session.py", line 299, in client
    return self._session.create_client(
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/session.py", line 918, in create_client
    region_name = self._resolve_region_name(region_name, config)
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/session.py", line 1002, in _resolve_region_name
    validate_region_name(region_name)
  File "/home/airflow/.local/lib/python3.10/site-packages/botocore/utils.py", line 1272, in validate_region_name
    raise InvalidRegionError(region_name=region_name)
botocore.exceptions.InvalidRegionError: Provided region_name '{{ var.value.AIRFLOW_RDS_REGION }}' doesn't match a supported format.
[2023-05-22, 03:56:38 UTC] {taskinstance.py:1368} INFO - Marking task as FAILED. dag_id=rotate_db_snapshots, task_id=create_snapshot, execution_date=20230522T035633, start_date=20230522T035637, end_date=20230522T035638

Description

Summary of fixes:

  • Pull region variable from Variable.get to work around hook_params not being a templated field on the AWS operators
  • Use the RdsHook in the Python operator to delete previous snapshots. This is mostly to make it easier to use the AWS_RDS_CONN_ID everywhere in this DAG.
  • As above, use AWS_RDS_CONN_ID everywhere (to enable local testing)
  • Remove unnecessary datetime cast (boto3 already returns datetime objects)
  • Be more selective about which snapshots get deleted: do not try to delete RDS automated snapshots (RDS does not allow them to be manually managed) and use a more specific prefix when creating snapshots in the DAG so we can select only snapshots created by the DAG to delete.

Testing Instructions

Run the DAG locally against the staging API database by setting CATALOG_RDS_DB_IDENTIFIER to dev-openverse-db. Create a new AWS connection via the Airflow UI named aws_rds_conn_id. Set CATALOG_RDS_SNAPSHOTS_TO_RETAIN to 2 and check against the RDS snapshots dashboard here: https://us-east-1.console.aws.amazon.com/rds/home?region=us-east-1#database:id=dev-openverse-db;is-cluster=false;tab=maintenance-and-backups

You should see a new snapshot get created. While the DAG waits for the snapshot to become available, you'll see an additional snapshot in the list with the prefix from the DAG. After the snapshot becomes available, you should see the oldest of the previously existing snapshots get deleted.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sarayourfriend sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels May 22, 2023
@sarayourfriend
Copy link
Collaborator Author

It doesn't seem like there's actually any way to configure the API versions used by Airflow. It totally ignores additional settings for anything other than S3, as far as I can tell, when configuring the STS client. If @AetherUnbound and @stacimc can confirm there isn't anything obvious I'm doing wrong that is causing this to occur locally, I'll open an issue with Airflow to ask for help/see if there's a bug here or if there's a way to explicitly configure the API version for STS (if indeed that's really what's going on here). The "missing parameter" message confuses me because the STS error doesn't appear to be related to that. AWS claims it only emits missing parameter errors when a parameter is missing (seems obvious) but there are no missing parameters, at least not at the level of our code.

@AetherUnbound
Copy link
Collaborator

If you're trying to run this locally, it's likely attempting to use the local dev AWS account (which interacts with moto as a mock S3) in order to run the actual DB operations. I recall receiving this error too when testing for #2099 - I ended up adding a separate conn ID (AWS_RDS_CONN_ID) which could be used to set up different AWS creds: https://github.com/WordPress/openverse/pull/2099/files#diff-8937af2ab4478aea5b53bb3cce8177e8cfb4eb6b0ff064a1a2d51e5e15782304R17

The reason we wouldn't want to change the base AWS connection info locally is because that would mean that all logging files and any ingestion that gets run locally would also be sent to production. We haven't had to do a whole lot of direct-to-AWS connection work yet beyond S3 so we don't have the best dev setup flow for this. It seems like setting up the AWS_RDS_CONN_ID piece on its own might be helpful for this kind of work. I plan on addressing the feedback and merging #2099 today, perhaps you could try testing this after a rebase and using that conn ID to see if that addresses the error?

More sensible Airflow variable names to reduce confusion of what is being referenced

Use the RdsHook to standardise the usage of AWS_RDS_CONN_ID for all RDS operations and simplify connection method

Update unit tests

Only delete snapshots managed by the Airflow DAG (ignore automatic snapshots and manually created snapshots that do not have the DAG-specific prefix)
@sarayourfriend sarayourfriend changed the title Workaround lack of hook_params templating Fix various issues that existed in the rotate_db_snapshots DAG May 24, 2023
@sarayourfriend sarayourfriend force-pushed the fix/region-name-template-rendering branch from c07749c to ded9ff1 Compare May 24, 2023 02:42
@sarayourfriend sarayourfriend marked this pull request as ready for review May 24, 2023 02:42
@sarayourfriend sarayourfriend requested a review from a team as a code owner May 24, 2023 02:42
@sarayourfriend sarayourfriend requested review from krysal and stacimc May 24, 2023 02:42
Comment on lines 36 to 38
# This cannot be pulled in the DAG itself because ``hook_params``
# on the operators provided by the amazon provider is not templated
RDS_REGION = Variable.get("CATALOG_RDS_REGION", "us-east-1")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder - I didn't need to supply a region in #2099 when I specify it in the Airflow connection parameter: AIRFLOW_CONN_AWS_RDS=aws://<id>:<secret>@?region_name=us-east-1

We really want to try and avoid Variable.gets in DAG files since they get run (unnecessarily) every time the DAGs are parsed. Are we able to remove this if we define the region using the conn ID now that we're using a separate RDS connection?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume so.

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look great, and I was able to test this locally as well and confirm the new snapshot was created and the old one deleted! I have one note for improving the sensor, otherwise this is good to go 🚀

@@ -87,16 +107,15 @@ def rotate_db_snapshots():
db_snapshot_identifier=snapshot_id,
# This is the default for ``target_statuses`` but making it explicit is clearer
target_statuses=["available"],
hook_params=hook_params,
aws_conn_id=AWS_RDS_CONN_ID,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for missing this in previous reviews - this should have the mode=reschedule parameter set as well so it doesn't take up a worker slot while waiting for the snapshot. (in general this reduces overhead for Airflow). Unfortunately Airflow doesn't have a way to set this as the default globally.

Copy link
Collaborator Author

@sarayourfriend sarayourfriend May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wish there was an easy thing like the eslint "no-restricted-syntax" rule for Python to encode these sorts of things. It's so tedious to try to remember all the dozens of little rules and best practices and for reviewers to have to try to catch them. Not to mention when the reviewer isn't someone who is so familiar with Airflow best practices!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same as I was typing this out 😞 Way way easier to remember that way. We might be able to implement custom ruff linting checks perhaps, I'll try to look into it a bit!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently not yet: astral-sh/ruff#283

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would allow us to do this: https://github.com/hchasestevens/bellybutton

I remember now that I opened an issue for it: #1317

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Tested with staging and it worked perfectly.

Comment on lines +49 to +57
# Other manual snapshots may exist; we only want to automatically manage
# ones this DAG created
snapshots = [
snapshot
for snapshot in snapshots
if snapshot["DBSnapshotIdentifier"].startswith(
AIRFLOW_MANAGED_SNAPSHOT_ID_PREFIX
)
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice detail!

@sarayourfriend sarayourfriend merged commit 4d8e8f5 into main May 25, 2023
@sarayourfriend sarayourfriend deleted the fix/region-name-template-rendering branch May 25, 2023 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants