Add variable to disable removing SQL source files for ingestion workflows #3847
Labels
🤖 aspect: dx
Concerns developers' experience with the codebase
✨ goal: improvement
Improvement to an existing user-facing feature
good first issue
New-contributor friendly
help wanted
Open to participation from the community
🟩 priority: low
Low priority and doesn't need to be rushed
🧱 stack: catalog
Related to the catalog and Airflow DAGs
🔧 tech: airflow
Involves Apache Airflow
🐍 tech: python
Involves Python
Description
The iNaturalist DAG uses the ingestion workflow's
sql_rm_source_data_after_ingesting
parameter to determine whether it should remove or retain the source files used for ingestion:openverse/catalog/dags/providers/provider_dag_factory.py
Lines 422 to 430 in 2cffcb9
While this is useful for specific runs, the iNaturalist DAG is scheduled, which means that the default run that gets kicked off locally when the DAG is enabled will remove the source files. Since these can be quite large, it's tedious and time consuming to have to manage triggering each run with the
sql_rm_source_data_after_ingesting
box unchecked.We should also have an Airflow Variable which will also determine whether the files should be removed or not. The value could potentially be
SQL_RM_SOURCE_DATA_AFTER_INGESTION
, meaning the name of the variable added to ourenv.template
file would beAIRFLOW_VAR_SQL_RM_SOURCE_DATA_AFTER_INGESTION
. This should beTrue
by default in the code, butFalse
as defined in theenv.template
file so by default, local runs will save source files.We will also need to update the short-circuit task for skipping this to include checking this variable as well:
openverse/catalog/dags/providers/provider_api_scripts/inaturalist.py
Lines 347 to 355 in dca0110
The check should be such that if either the param or the Airflow Variable are set to
False
, the files are retained. We should be able to use the{{ var.json.<variable_name> }}
syntax for templating this into theop_args
similar to the param.Additional context
See #3846 for the impetus
The text was updated successfully, but these errors were encountered: