Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add docs on dbt Cloud integration #1763

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

danthelion
Copy link
Contributor

@danthelion danthelion commented Nov 7, 2024

Description:

(Describe the high level scope of new or changed features)

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

@danthelion danthelion requested a review from mdibaiee November 7, 2024 19:03
Copy link

github-actions bot commented Nov 7, 2024

PR Preview Action v1.4.8
🚀 Deployed preview to https://estuary.github.io/flow/pr-preview/pr-1763/
on branch gh-pages at 2024-11-08 15:24 UTC


- Job ID: The unique identifier for the dbt job you wish to trigger.
- Account ID: Your dbt account identifier.
- API Key: The dbt API key associated with your account. This allows Estuary Flow to authenticate with dbt Cloud and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They also need an Access URL, it is mandatory, I know the connector marks it as non-required, but that's because we previously had Account Prefix and to be backward-compatible we had to keep the new one marked as non-required, but we will validate that one of the two exists every time


### Optional Parameters

- Access URL: The dbt access URL can be found in your dbt Account Settings. Use this URL if your dbt account requires a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth adding here since a few customers have had this issue: If they can't find their Access URL in their dashboard, it is because they are old customers and have not yet migrated to the new API, in this case their Access URL is: https://cloud.getdbt.com/


### Job Management

If you want to avoid triggering multiple overlapping dbt jobs, set Job Trigger Mode to skip. This way, if a job is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth mentioning this is the default behavior


### Regular Data Transformation on New Data

Suppose you have a data pipeline that ingests data into a warehouse every 1 hour (configured via a Sync Frequency),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dbt cloud trigger starts the timer as soon as the first data arrives at the connector, and any subsequent timers are also started when data arrives.

If a connector has a delay of 1 hour, this is how it would look like:

Connector starts up -> runs a first dbt job trigger (this is to ensure consistency when connector restarts) -> materializes one small chunk -> starts timer to trigger dbt job in N minutes -> materializes the rest of chunks -> start 1 hour delay of connector of not backfilling -> trigger dbt job when N minutes have passed since the timer started (this includes during backfills)

So in that sense, it is best that their dbt job trigger interval is not very long. The default is 30 minutes which means 30 minutes after the first bulk of data is committed. It is not very short to avoid many jobs during backfills, but it means during non-backfill periods we will wait 30 minutes after commiting the first commit and then triggering a job. How much of a latency this creates between the final data point being materialized and the dbt job triggering depends on how long it takes for their data to be materialized to the destination

This is the current compromise we have to be able to set a minimum interval between dbt job triggers, support cases where connectors don't use Sync Interval, support use cases where data is arrival is very sparse (once a day for example)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the detailed writeup, I tried to incorporate this as best as possible

@danthelion danthelion requested a review from mdibaiee November 8, 2024 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants