AWS Athena integration reengineering #1698

svdimchenko · 2024-09-09T07:56:08Z

Is your feature request related to a problem? Please describe.
Currently I'm using aws athena as my query engine for dbt transformations.
The problem with integrating elementary is following:

if I use iceberg tables format, it does not support parallel inserts, that's why I face execution time increase when running parallel dbt models with airflow
if I use hive tables format, it resolves parallel ingestion problem, but in this case I face a problem with huge files count increase, so that s3:GetObject and s3:ListBucket cost increase as well

Describe the solution you'd like
There are several possible solutions I can offer to solve the issue:

Implement partitioning for elementary tables and utilise partition fields in monitoring models. Unfortunately, we can not use created_at field with hive table format. So that we'll need to add created_at_date field and utilise it for partition pruning.
Implement possibility to load dbt artifacts to separate backend. For instance, it can be AWS RDS. Currently, elementary loads data from dbt context and there is no possibility to work with dbt's json files: run_results.json, manifest.json etc. Here is datahub's example how json files can be ingested into external database.

Describe alternatives you've considered
As a quick workaround I can keep elementary tables in hive format and setup s3 bucket lifecycle policy to remove outdated elementary's data, but such approach requires accurate s3 bucket tuning for every specific elementary's table which can be tricky.

Would you be willing to contribute this feature?
Once we clarify the most appropriate way of athena integration, I can contribute of course.

The text was updated successfully, but these errors were encountered:

ofek1weiss · 2024-09-23T13:10:28Z

Hey @svdimchenko
A workaround that might work for you is to separate the artifact uploading to a different job, this can be done by doing the following:

Add the following var to your dbt_project.yml:

vars:
  disable_dbt_artifacts_autoupload: true

Run the command dbt run --select edr.dbt_artifacts in a separate job

This will make sure that not all the metadata is uploaded after every job (avoiding the parallel uploading), but is still up to date.

Let me know if this helps 🙏

svdimchenko · 2024-10-03T20:53:42Z

hey @ofek1weiss ! Thanks for your feedback. I'm already using disable_dbt_artifacts_autoupload: true however this does not help with run-results exporting.
What I'm thinking to try is to get dbt run results as python object with dbt programmatic invocation and then run dbt once again with another profile and pass run results from previous dbt run as argument for some macro, which will store data to the backend which suits better.
This approach will not need elementary's on-run-end hook approach.

otherwise it would be nice if edr has some possibility to parse required dbt artifacts and expose to different backends. I'm still thinking about final implementation, so will be glad to discuss other options.

svdimchenko added Feature Request 💡 Triage 👀 labels Sep 9, 2024

ofek1weiss self-assigned this Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Athena integration reengineering #1698

AWS Athena integration reengineering #1698

svdimchenko commented Sep 9, 2024 •

edited by github-actions bot

Loading

ofek1weiss commented Sep 23, 2024

svdimchenko commented Oct 3, 2024

AWS Athena integration reengineering #1698

AWS Athena integration reengineering #1698

Comments

svdimchenko commented Sep 9, 2024 • edited by github-actions bot Loading

ofek1weiss commented Sep 23, 2024

svdimchenko commented Oct 3, 2024

svdimchenko commented Sep 9, 2024 •

edited by github-actions bot

Loading