Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dbt] utility class for managing artifacts #19923

Merged

Conversation

alangenfeld
Copy link
Member

@alangenfeld alangenfeld commented Feb 20, 2024

The goal of this class is to centralize the necessary complexity that comes from seeking to provide both a nice developer experience and a performant deployed experience.

Currently, that complexity surfaces in different places resulting in a rough experience

  1. Surfacing this code as part of the scaffold and docs
import os
from pathlib import Path

from dagster_dbt import DbtCliResource

dbt_project_dir = Path(__file__).joinpath("..", "..", "..").resolve()
dbt = DbtCliResource(project_dir=os.fspath(dbt_project_dir))

# If DAGSTER_DBT_PARSE_PROJECT_ON_LOAD is set, a manifest will be created at run time.
# Otherwise, we expect a manifest to be present in the project's target directory.
if os.getenv("DAGSTER_DBT_PARSE_PROJECT_ON_LOAD"):
    dbt_manifest_path = (
        dbt.cli(
            ["--quiet", "parse"],
            target_path=Path("target"),
        )
        .wait()
        .target_path.joinpath("manifest.json")
    )
else:
    dbt_manifest_path = dbt_project_dir.joinpath("target", "manifest.json")

and instructing users set an env var any time they run dagster dev via DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1 dagster dev

  1. Handling the use of a package data directory, notably for pex packaging used by Dagster Cloud Serverless. Currently, the scaffold we present in that condition fails to load locally due to mishandling the assumptions made in different places.

  2. The CI/CD step that has to does the preparation that the code assumes, such as https://github.com/dagster-io/dagster-cloud-action/blob/7c2f65f979a21da1e621f768af6db5265586ee91/github/serverless/dbt/deploy.yml#L47-L56

A before and after of what this would look like for users can be seen in the in the upstack PR https://github.com/dagster-io/dagster/pull/19925/files

How I Tested These Changes

added some unit tests, additional coverage from existing tests in upstack PR that converts scaffold to use this class

@alangenfeld
Copy link
Member Author

alangenfeld commented Feb 20, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @alangenfeld and the rest of your teammates on Graphite Graphite

@alangenfeld alangenfeld force-pushed the al/02-12-dagster_dev_env_var branch from 6483219 to aa1d0b0 Compare February 20, 2024 22:41
@alangenfeld alangenfeld force-pushed the al/02-20-_dbt_utility_class_for_managing_artifacts branch from 4ddbdfb to 3c00ba0 Compare February 20, 2024 22:41
@alangenfeld alangenfeld force-pushed the al/02-12-dagster_dev_env_var branch from aa1d0b0 to 7a59c65 Compare February 20, 2024 23:14
@alangenfeld alangenfeld force-pushed the al/02-20-_dbt_utility_class_for_managing_artifacts branch from 3c00ba0 to 38aa692 Compare February 20, 2024 23:14
@alangenfeld alangenfeld force-pushed the al/02-12-dagster_dev_env_var branch from 7a59c65 to 323c756 Compare February 21, 2024 17:51
@alangenfeld alangenfeld force-pushed the al/02-20-_dbt_utility_class_for_managing_artifacts branch from 38aa692 to 7d6ba17 Compare February 21, 2024 17:51
Copy link

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-pq317c25t-elementl.vercel.app
https://al-02-20--dbt-utility-class-for-managing-artifacts.core-storybook.dagster-docs.io

Built with commit 7d6ba17.
This pull request is being automatically deployed with vercel-action

@alangenfeld alangenfeld marked this pull request as ready for review February 21, 2024 18:14
@alangenfeld
Copy link
Member Author

still needs polish but i think this is ready for feedback, https://github.com/dagster-io/dagster/pull/19925/files is a good place to check how this is expected to play out

@alangenfeld alangenfeld force-pushed the al/02-20-_dbt_utility_class_for_managing_artifacts branch from 7d6ba17 to 487b638 Compare February 21, 2024 18:34
Copy link
Contributor

@rexledesma rexledesma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting

def _base_dir(self) -> Path:
return self._package_data_dir if self._package_data_dir else self.project_dir

def _prepare_manifest(self) -> Path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine that this is an abstractmethod for overriding this class in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking was that prepare_command would avoid a good chunk of cases. Do you think _ prefix should get dropped if people might override?

@alangenfeld alangenfeld force-pushed the al/02-12-dagster_dev_env_var branch from 323c756 to 208a7a4 Compare February 21, 2024 19:59
@alangenfeld alangenfeld force-pushed the al/02-20-_dbt_utility_class_for_managing_artifacts branch from 487b638 to 6b4505e Compare February 21, 2024 19:59
Base automatically changed from al/02-12-dagster_dev_env_var to master February 21, 2024 20:59
Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some questions and suggestions

Comment on lines 101 to 104
# if launched via `dagster dev` cli
bool(os.getenv("DAGSTER_IS_DEV_CLI"))
or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if someone uses dagster dev against pre-compiled manifest.json?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior from the existing recommended setup is that if the env var is set we will still generate a new one, so thats what i went with.

I think if you are actively iterating locally, the previous manifest.json will be there, so its hard to distinguish pre-compiled from development iteration.

I think if you want to develop against a precompiled the output would be to not use this class at all.

I could be convinced that an opt-out env var makes sense though.

@schrockn
Copy link
Member

If this is meant to be used directly by users, an example usage in the PR description with before/after would be useful.

@alangenfeld alangenfeld force-pushed the al/02-20-_dbt_utility_class_for_managing_artifacts branch 2 times, most recently from f37eebf to 046650b Compare February 22, 2024 19:48
@alangenfeld alangenfeld dismissed schrockn’s stale review February 22, 2024 19:49

handled inlines, summary update with link to PR for before & after

Copy link
Contributor

@OwenKephart OwenKephart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might just be my brain breaking, but I found the internal naming conventions here pretty difficult to follow, and so suggested some updates / small refactors

Let me know if I'm off base on my understanding of things here

High level though, I'm excited about this behavior and think this is a good interface for people to interact with.

Disable logging
Default: False
"""
logger = _NULL_LOGGER if quiet else logging.getLogger("dagster-dbt-artifacts")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was more discussion elsewhere, but any particular reason for not logging at all instead of just logging at DEBUG level vs. INFO level? I feel like that's a more common pattern in our codebase.

No strong feelings there just trying to understand.

uses project_dir when in a development context.
"""
if self._should_prepare_at_runtime():
return DbtCliResource(project_dir=os.fspath(self.project_dir))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this get **kwargs passed in as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my larger comment at the top but I'm also finding this confusing as to why _should_prepare_at_runtime is related to what path this should be pointing at. Functionally, I get that the method checks if you're currently running from a dagster dev invocation, but that seems like a separate concern from if DAGSTER_DBT_PARSE_PROJECT_ON_LOAD is set or not. I get that it'd be fairly unlikely to set that outside of a development context, but if we want to make that assertion we should just call this method "_is_local_environment" or something.

That'd make other code paths clearer as well (took me a bit to wrap my head around _should_prepare_at_runtime = this code will only run when in a dev environment)

*,
target_folder: Union[Path, str] = "target",
prepare_command: List[str] = ["parse", "--quiet"],
package_data_dir: Optional[Union[Path, str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason this path shuffling stuff was extremely hard to wrap my head around, and I feel like the weirdness here is potentially stemming from the naming here.

Is it correct that you'd only want to set a package_data_dir in the case that your project_dir is something that is in a different directory than your dagster code? My feeling is that this should be called deployed_project_dir or something of the sort, and the current-day project_dir parameter can be called local_project_dir (at least internally, the user-facing parameter can be the same)

Then, there could be a single property current_project_dir which returns either self._local_project_dir or self._deployed_project_dir depending on the context (i.e. if DAGSTER_IS_DEV_CLI is not set and self._deployed_project_dir is non-None, return deployed dir, otherwise return current dir)

This is almost, but not quite, what self._base_dir is, but I feel like that property muddies the waters a bit and could be removed entirely, as it's only used in get_cli_resource (which could become trivial with that property) and setting the manifest path (which could also use the current_project_dir, as far as I can tell)

@alangenfeld alangenfeld force-pushed the al/02-20-_dbt_utility_class_for_managing_artifacts branch 2 times, most recently from 7164590 to 0342706 Compare February 26, 2024 17:31
@alangenfeld
Copy link
Member Author

thanks for the review @OwenKephart , did a rev incorporating your feedback. Let me know what you think.

Copy link
Contributor

@OwenKephart OwenKephart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 🙌

Only remaining thought is on adding functionality to this class -- calling it DbtArtifacts makes me think that I can do something like:

my_artifacts = DbtArtifacts(...)

@dbt_assets(manifest=my_artifacts.manifest)
def foo():
    ...

rather than manually reading from my_artifacts.manifest_path.

@alangenfeld
Copy link
Member Author

rather than manually reading from my_artifacts.manifest_path

I don't think there is any reason we couldn't add something like that, especially if we make sure to use the same cached manifest load as the rest of dagster-dbt

@alangenfeld alangenfeld force-pushed the al/02-20-_dbt_utility_class_for_managing_artifacts branch from 0342706 to 50bc26b Compare March 4, 2024 19:28
Copy link
Contributor

@rexledesma rexledesma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but I think we should table get_cli_resource to a different PR until we see usage somewhere downstream.

logger.log(level, "Preparation complete.")

@public
def get_cli_resource(self, **kwargs) -> DbtCliResource:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to see how this is used in an example before introducing it.

Are we expecting the following? Feels weird to force the user to instantiate DbtCliResource outside of its class.

defs = Definitions(
    ...,
    resources={
        "dbt": my_artifacts.get_cli_resource(...)
    },
)

Would prefer something else like the following, so DbtCliResource is explicit, and we continue to have typing on DbtCliResource instantiation.

defs = Definitions(
    ...,
    resources={
        "dbt": DbtCliResource(artifacts=my_artifacts, **kwargs)
    },
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to see how this is used in an example before introducing it.

https://github.com/dagster-io/dagster/pull/19925/files is what ive been working with

Agree with your take, will pull for now and put up a PR for artifacts on DbtCliResource

@alangenfeld alangenfeld force-pushed the al/02-20-_dbt_utility_class_for_managing_artifacts branch from 50bc26b to 6b81144 Compare March 4, 2024 20:02
@alangenfeld alangenfeld merged commit 0ac6235 into master Mar 4, 2024
1 check passed
@alangenfeld alangenfeld deleted the al/02-20-_dbt_utility_class_for_managing_artifacts branch March 4, 2024 20:23
@slopp
Copy link
Contributor

slopp commented Mar 11, 2024

@alangenfeld do we have WIP docs for this?

@alangenfeld
Copy link
Member Author

just the API docs / doc strings at this moment

@sryza
Copy link
Contributor

sryza commented Mar 11, 2024

Generally very supportive of this direction of removing this boilerplate for users, but a concern I have about this PR is that introduces this term "prepare" which I don't think users will have many associations with in this context, and I worry they thus might find kind of confusing and opaque? Reading the PR, it took me a while to wrap my head around what it meant. Did you consider something more explicit like generate_manifest_file?

Sorry for coming in late on this.

@sryza
Copy link
Contributor

sryza commented Mar 11, 2024

One more thought on this:

  • In dbt, the term "artifacts" seems to have this specific meaning: https://docs.getdbt.com/reference/artifacts/dbt-artifacts
  • If I saw a class named DbtArtifacts, I would expect that it basically corresponds exactly to that, e.g. directly wraps an artifacts directory.
  • Not sure about this, but might a more accurate name be DbtProject?

@alangenfeld
Copy link
Member Author

Introduces this term "prepare" which I don't think users will have many associations with in this context, and I worry they thus might find kind of confusing and opaque? Reading the PR, it took me a while to wrap my head around what it meant. Did you consider something more explicit like generate_manifest_file?

In addition to generating the manifest, this method optionally takes care of

  • copying over the project in to a package data directory (useful for dagster cloud serverless PEX users)
  • [future PR] taking care of uploading a manifest.json during prod deployment and downloading it in other (branch) deployments to facilitate --defer

This led me to use a more encompassing term like prepare . I am not attached to that name, can try iterating on that in subsequent changes.

DbtArtifacts

Agree overlapping but not being aligned with dbt's definition is not great. I could see DbtProject working. Will also factor this in to coming changes.

PedramNavid pushed a commit that referenced this pull request Mar 28, 2024
The goal of this class is to centralize the necessary complexity that
comes from seeking to provide both a nice developer experience and a
performant deployed experience.

Currently, that complexity surfaces in different places resulting in a
rough experience

1. Surfacing this code as part of the scaffold  and docs 
```
import os
from pathlib import Path

from dagster_dbt import DbtCliResource

dbt_project_dir = Path(__file__).joinpath("..", "..", "..").resolve()
dbt = DbtCliResource(project_dir=os.fspath(dbt_project_dir))

# If DAGSTER_DBT_PARSE_PROJECT_ON_LOAD is set, a manifest will be created at run time.
# Otherwise, we expect a manifest to be present in the project's target directory.
if os.getenv("DAGSTER_DBT_PARSE_PROJECT_ON_LOAD"):
    dbt_manifest_path = (
        dbt.cli(
            ["--quiet", "parse"],
            target_path=Path("target"),
        )
        .wait()
        .target_path.joinpath("manifest.json")
    )
else:
    dbt_manifest_path = dbt_project_dir.joinpath("target", "manifest.json")
```
and instructing users set an env var any time they run dagster dev via
`DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1 dagster dev`

2. Handling the use of a package data directory, notably for pex
packaging used by Dagster Cloud Serverless. Currently, the scaffold we
present in that condition fails to load locally due to mishandling the
assumptions made in different places.

3. The CI/CD step that has to does the preparation that the code
assumes, such as
https://github.com/dagster-io/dagster-cloud-action/blob/7c2f65f979a21da1e621f768af6db5265586ee91/github/serverless/dbt/deploy.yml#L47-L56


A before and after of what this would look like for users can be seen in
the in the upstack PR
https://github.com/dagster-io/dagster/pull/19925/files

## How I Tested These Changes

added some unit tests, additional coverage from existing tests in
upstack PR that converts scaffold to use this class
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants