Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add variable to disable removing sql source files during ingestion #4216

Conversation

madewithkode
Copy link
Collaborator

Fixes

Fixes #3847 by @AetherUnbound

Description

Added an Airflow Variable AIRFLOW_VAR_SQL_RM_SOURCE_DATA_AFTER_INGESTION which would be used to control the removal or retention of source files used for ingestion.

Testing Instructions

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@openverse-bot openverse-bot added 🟩 priority: low Low priority and doesn't need to be rushed ✨ goal: improvement Improvement to an existing user-facing feature 🤖 aspect: dx Concerns developers' experience with the codebase 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Apr 27, 2024
@obulat obulat added 🧱 stack: catalog Related to the catalog and Airflow DAGs and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Apr 29, 2024
@AetherUnbound AetherUnbound changed the title Add variable to disable removing sql source files for ingestion workfl… Add variable to disable removing sql source files during ingestion Apr 29, 2024
Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start @madewithkode! You asked about testing, you should actually be able to test this by enabling the inaturalist_workflow locally and letting it run! It will have to download a large resource as part of the execution, so just noting that it may take some time. If everything is set up appropriately (e.g. the Variable is correctly set to false), then the last two steps of the postingestion_tasks workflow should skip and subsequent runs of the DAG locally should not need to redownload the dataset which was downloaded on the first run!

This is also a note for maintainers: there does not seem to be a way to default a Variable retrieval from within a template, and so the task will fail if the Variable is not defined. We will need to add SQL_RM_SOURCE_DATA_AFTER_INGESTION to the production Variables with the default of true, I'll go ahead and do that now since it should have no affect on anything else.

python_callable=(lambda x: x),
op_args=[
"{{ params.sql_rm_source_data_after_ingesting }}",
"{{ var.json.AIRFLOW_VAR_SQL_RM_SOURCE_DATA_AFTER_INGESTION}}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One spacing nit, along with a fix to the variable name (AIRFLOW_VAR is only prepended for the environment variable, it gets removed when used within Airflow).

Suggested change
"{{ var.json.AIRFLOW_VAR_SQL_RM_SOURCE_DATA_AFTER_INGESTION}}",
"{{ var.json.SQL_RM_SOURCE_DATA_AFTER_INGESTION }}",

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should actually be able to test this by enabling the inaturalist_workflow locally and letting it run!

Hi @AetherUnbound, Is it possible to outline actual steps to do this? Or maybe point to me to a section of the doc(if any). In the meantime, I have updated the PR with the suggested changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, apologies! You should be able to take the following steps:

  1. Run just c to start the catalog stack
  2. Navigate to http://localhost:9090
  3. Use airflow/airflow as the username and password respectively to log in
  4. Scroll down to the inaturalist_workflow and click on it
  5. Click the "unpause DAG" button in the top right, a run will get kicked off
    image

For more info about how Airflow works in general, check out their documentation on the UI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much, this makes more sense.

"{{ params.sql_rm_source_data_after_ingesting }}",
"{{ var.json.AIRFLOW_VAR_SQL_RM_SOURCE_DATA_AFTER_INGESTION}}",
],
python_callable=(lambda x, y: False if not x or not y else True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be simplified:

Suggested change
python_callable=(lambda x, y: False if not x or not y else True),
python_callable=(lambda *x: any(x)),

@AetherUnbound
Copy link
Collaborator

AetherUnbound commented Apr 30, 2024

Oh, it looks like we'll also want to add that condition here too:

"remove_api_files": "{{params.sql_rm_source_data_after_ingesting}}"

Apologies for missing that! Using the Variable there as well will prevent the dataset from being redownloaded every time.

@madewithkode madewithkode force-pushed the 3847_add_variable_to_disable_removing_sql_source_files_for_ingestion_workflows branch from 97a5855 to 477cdc6 Compare April 30, 2024 11:05
Comment on lines 331 to 334
"remove_api_files": any(
"{{ params.sql_rm_source_data_after_ingesting }}",
"{{ var.json.SQL_RM_SOURCE_DATA_AFTER_INGESTION}}",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can call any here because (for complicated Airflow reasons) the any gets evaluated when the DAG is parsed, but not when it's run. When the DAG is parsed, the values passed into any will be the strings we see here. They're only converted into the actual values when the DAG is actually run. I think this should work:

Suggested change
"remove_api_files": any(
"{{ params.sql_rm_source_data_after_ingesting }}",
"{{ var.json.SQL_RM_SOURCE_DATA_AFTER_INGESTION}}",
)
"remove_api_files": "{{ params.sql_rm_source_data_after_ingesting and var.json.SQL_RM_SOURCE_DATA_AFTER_INGESTION }}",

@madewithkode madewithkode force-pushed the 3847_add_variable_to_disable_removing_sql_source_files_for_ingestion_workflows branch from 477cdc6 to cef7e6b Compare May 2, 2024 07:45
@madewithkode
Copy link
Collaborator Author

@AetherUnbound all suggested code changes have been made.
P.S: I am unable to verify the intended behavior on my end due to repeated failures while trying to download the inaturalist_workflow source files(they seem quite large and I have a not-so-reliable internet). I'm hoping there's a way for you to verify and okay this from your end.

@AetherUnbound
Copy link
Collaborator

@madewithkode thanks for making those changes! I was able to test this locally, and opted to make two other changes:

  • Set the default of the parameter to False, since we now have the Variable which can act as a default/fallback
  • Change the condition for removing the files to or, so only one of them has to be specified as true in order to ensure deletion

This has the effect making it so that either the parameter or the Variable being set will remove the files, but by default locally, they won't be removed. In production, since we have the Variable set to true, it will always be removed.

I was able to verify this behavior by:

  • Run the DAG once (so the file was downloaded) and check that the cleanup tasks were skipped
  • Mark the DAG as failed after it is completed (so a follow-up run doesn't skip during the "check previous success" step)
  • Run the DAG again and verify that the load_catalog_of_life_names tasks doesn't have to redownload the COL files.

All that to say, this looks good to me and can be taken out of draft when you're ready!

@madewithkode
Copy link
Collaborator Author

@AetherUnbound All these makes sense. Proceeding to unmark the PR as draft now. Thank you for the extra effort!

@madewithkode madewithkode marked this pull request as ready for review May 3, 2024 11:39
@madewithkode madewithkode requested review from a team as code owners May 3, 2024 11:39
Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I ran into some issues locally with the iNaturalist pull and load TaskGroups, but I confirmed that I see this on main as well so it's probably something in my local. I was able to just manually pass those steps and only test the changes made here. Works great for me.

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope it's alright to approve this one too 😁 Thanks again @madewithkode!

@AetherUnbound AetherUnbound merged commit 12e7c87 into WordPress:main May 6, 2024
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖 aspect: dx Concerns developers' experience with the codebase ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Add variable to disable removing SQL source files for ingestion workflows
5 participants