Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add variable to disable removing SQL source files for ingestion workflows #3847

Closed
AetherUnbound opened this issue Feb 29, 2024 · 7 comments · Fixed by #4216
Closed

Add variable to disable removing SQL source files for ingestion workflows #3847

AetherUnbound opened this issue Feb 29, 2024 · 7 comments · Fixed by #4216
Assignees
Labels
🤖 aspect: dx Concerns developers' experience with the codebase ✨ goal: improvement Improvement to an existing user-facing feature good first issue New-contributor friendly help wanted Open to participation from the community 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python

Comments

@AetherUnbound
Copy link
Collaborator

Description

The iNaturalist DAG uses the ingestion workflow's sql_rm_source_data_after_ingesting parameter to determine whether it should remove or retain the source files used for ingestion:

"sql_rm_source_data_after_ingesting": Param(
default=True,
type="boolean",
description=(
"Whether to delete source data from airflow and DB once ingestion"
" is complete. This is used to support data quality testing in"
" SQL-only DAGs like iNaturalist."
),
),

While this is useful for specific runs, the iNaturalist DAG is scheduled, which means that the default run that gets kicked off locally when the DAG is enabled will remove the source files. Since these can be quite large, it's tedious and time consuming to have to manage triggering each run with the sql_rm_source_data_after_ingesting box unchecked.

We should also have an Airflow Variable which will also determine whether the files should be removed or not. The value could potentially be SQL_RM_SOURCE_DATA_AFTER_INGESTION, meaning the name of the variable added to our env.template file would be AIRFLOW_VAR_SQL_RM_SOURCE_DATA_AFTER_INGESTION. This should be True by default in the code, but False as defined in the env.template file so by default, local runs will save source files.

We will also need to update the short-circuit task for skipping this to include checking this variable as well:

check_drop_parameter = ShortCircuitOperator(
task_id="check_drop_parameter",
doc_md="Skip post-ingestion if NOT sql_rm_source_data_after_ingesting.",
op_args=["{{ params.sql_rm_source_data_after_ingesting }}"],
python_callable=(lambda x: x),
trigger_rule=TriggerRule.NONE_SKIPPED,
# just skip the drop steps, not the final reporting step in the dag
ignore_downstream_trigger_rules=False,
)

The check should be such that if either the param or the Airflow Variable are set to False, the files are retained. We should be able to use the {{ var.json.<variable_name> }} syntax for templating this into the op_args similar to the param.

Additional context

See #3846 for the impetus

@AetherUnbound AetherUnbound added good first issue New-contributor friendly help wanted Open to participation from the community ✨ goal: improvement Improvement to an existing user-facing feature 🐍 tech: python Involves Python 🔧 tech: airflow Involves Apache Airflow 🟩 priority: low Low priority and doesn't need to be rushed 🤖 aspect: dx Concerns developers' experience with the codebase 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Feb 29, 2024
@openverse-bot openverse-bot moved this to 📋 Backlog in Openverse Backlog Feb 29, 2024
@Pqformeln
Copy link

Hi,

I would like to work on this issue!

@AetherUnbound
Copy link
Collaborator Author

Fantastic, welcome @Pqformeln! I'll assign the issue to you 😄 Please check out our welcome and quickstart documentation pages, and if you have any questions about this issue feel free to leave them here!

@madewithkode
Copy link
Collaborator

I'd love to take on this @AetherUnbound @obulat

@AetherUnbound
Copy link
Collaborator Author

I'll assign you @madewithkode! Let us know if you have any questions about it 😄

@madewithkode
Copy link
Collaborator

Thanks @AetherUnbound. I think I have basically done what the task requires(not exactly sure how to test though). Can I go ahead and sort of open a draft MR so someone can take a look and see if my understanding of the requirements is correct?

@AetherUnbound
Copy link
Collaborator Author

Absolutely!

@madewithkode
Copy link
Collaborator

Hi @AetherUnbound, I have now made a draft PR here so we can confirm if I'm in the right direction and also provide ways to test.

@openverse-bot openverse-bot moved this from 🏗 In Progress to ✅ Done in Openverse Backlog May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖 aspect: dx Concerns developers' experience with the codebase ✨ goal: improvement Improvement to an existing user-facing feature good first issue New-contributor friendly help wanted Open to participation from the community 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow 🐍 tech: python Involves Python
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants