-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dbt] utility class for managing artifacts #19923
[dbt] utility class for managing artifacts #19923
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @alangenfeld and the rest of your teammates on Graphite |
6483219
to
aa1d0b0
Compare
4ddbdfb
to
3c00ba0
Compare
aa1d0b0
to
7a59c65
Compare
3c00ba0
to
38aa692
Compare
7a59c65
to
323c756
Compare
38aa692
to
7d6ba17
Compare
Deploy preview for dagster-docs ready! Preview available at https://dagster-docs-qdxd0wrv7-elementl.vercel.app Direct link to changed pages: |
Deploy preview for dagit-core-storybook ready! ✅ Preview Built with commit 7d6ba17. |
still needs polish but i think this is ready for feedback, https://github.com/dagster-io/dagster/pull/19925/files is a good place to check how this is expected to play out |
7d6ba17
to
487b638
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting
python_modules/libraries/dagster-dbt/dagster_dbt/dbt_artifacts.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-dbt/dagster_dbt/dbt_artifacts.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-dbt/dagster_dbt/dbt_artifacts.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-dbt/dagster_dbt/dbt_artifacts.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-dbt/dagster_dbt/dbt_artifacts.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-dbt/dagster_dbt/dbt_artifacts.py
Outdated
Show resolved
Hide resolved
def _base_dir(self) -> Path: | ||
return self._package_data_dir if self._package_data_dir else self.project_dir | ||
|
||
def _prepare_manifest(self) -> Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine that this is an abstractmethod for overriding this class in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking was that prepare_command
would avoid a good chunk of cases. Do you think _
prefix should get dropped if people might override?
python_modules/libraries/dagster-dbt/dagster_dbt/dbt_artifacts.py
Outdated
Show resolved
Hide resolved
python_modules/libraries/dagster-dbt/dagster_dbt_tests/core/test_artifacts.py
Outdated
Show resolved
Hide resolved
323c756
to
208a7a4
Compare
487b638
to
6b4505e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have some questions and suggestions
python_modules/libraries/dagster-dbt/dagster_dbt/dbt_artifacts.py
Outdated
Show resolved
Hide resolved
# if launched via `dagster dev` cli | ||
bool(os.getenv("DAGSTER_IS_DEV_CLI")) | ||
or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if someone uses dagster dev
against pre-compiled manifest.json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior from the existing recommended setup is that if the env var is set we will still generate a new one, so thats what i went with.
I think if you are actively iterating locally, the previous manifest.json
will be there, so its hard to distinguish pre-compiled from development iteration.
I think if you want to develop against a precompiled the output would be to not use this class at all.
I could be convinced that an opt-out env var makes sense though.
If this is meant to be used directly by users, an example usage in the PR description with before/after would be useful. |
f37eebf
to
046650b
Compare
handled inlines, summary update with link to PR for before & after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might just be my brain breaking, but I found the internal naming conventions here pretty difficult to follow, and so suggested some updates / small refactors
Let me know if I'm off base on my understanding of things here
High level though, I'm excited about this behavior and think this is a good interface for people to interact with.
Disable logging | ||
Default: False | ||
""" | ||
logger = _NULL_LOGGER if quiet else logging.getLogger("dagster-dbt-artifacts") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was more discussion elsewhere, but any particular reason for not logging at all instead of just logging at DEBUG level vs. INFO level? I feel like that's a more common pattern in our codebase.
No strong feelings there just trying to understand.
uses project_dir when in a development context. | ||
""" | ||
if self._should_prepare_at_runtime(): | ||
return DbtCliResource(project_dir=os.fspath(self.project_dir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this get **kwargs passed in as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my larger comment at the top but I'm also finding this confusing as to why _should_prepare_at_runtime is related to what path this should be pointing at. Functionally, I get that the method checks if you're currently running from a dagster dev
invocation, but that seems like a separate concern from if DAGSTER_DBT_PARSE_PROJECT_ON_LOAD
is set or not. I get that it'd be fairly unlikely to set that outside of a development context, but if we want to make that assertion we should just call this method "_is_local_environment" or something.
That'd make other code paths clearer as well (took me a bit to wrap my head around _should_prepare_at_runtime = this code will only run when in a dev environment)
*, | ||
target_folder: Union[Path, str] = "target", | ||
prepare_command: List[str] = ["parse", "--quiet"], | ||
package_data_dir: Optional[Union[Path, str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason this path shuffling stuff was extremely hard to wrap my head around, and I feel like the weirdness here is potentially stemming from the naming here.
Is it correct that you'd only want to set a package_data_dir in the case that your project_dir is something that is in a different directory than your dagster code? My feeling is that this should be called deployed_project_dir
or something of the sort, and the current-day project_dir
parameter can be called local_project_dir
(at least internally, the user-facing parameter can be the same)
Then, there could be a single property current_project_dir
which returns either self._local_project_dir
or self._deployed_project_dir
depending on the context (i.e. if DAGSTER_IS_DEV_CLI
is not set and self._deployed_project_dir is non-None, return deployed dir, otherwise return current dir)
This is almost, but not quite, what self._base_dir is, but I feel like that property muddies the waters a bit and could be removed entirely, as it's only used in get_cli_resource
(which could become trivial with that property) and setting the manifest path (which could also use the current_project_dir, as far as I can tell)
7164590
to
0342706
Compare
thanks for the review @OwenKephart , did a rev incorporating your feedback. Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome 🙌
Only remaining thought is on adding functionality to this class -- calling it DbtArtifacts
makes me think that I can do something like:
my_artifacts = DbtArtifacts(...)
@dbt_assets(manifest=my_artifacts.manifest)
def foo():
...
rather than manually reading from my_artifacts.manifest_path
.
I don't think there is any reason we couldn't add something like that, especially if we make sure to use the same cached manifest load as the rest of dagster-dbt |
0342706
to
50bc26b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, but I think we should table get_cli_resource
to a different PR until we see usage somewhere downstream.
python_modules/libraries/dagster-dbt/dagster_dbt/dbt_artifacts.py
Outdated
Show resolved
Hide resolved
logger.log(level, "Preparation complete.") | ||
|
||
@public | ||
def get_cli_resource(self, **kwargs) -> DbtCliResource: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to see how this is used in an example before introducing it.
Are we expecting the following? Feels weird to force the user to instantiate DbtCliResource
outside of its class.
defs = Definitions(
...,
resources={
"dbt": my_artifacts.get_cli_resource(...)
},
)
Would prefer something else like the following, so DbtCliResource
is explicit, and we continue to have typing on DbtCliResource
instantiation.
defs = Definitions(
...,
resources={
"dbt": DbtCliResource(artifacts=my_artifacts, **kwargs)
},
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to see how this is used in an example before introducing it.
https://github.com/dagster-io/dagster/pull/19925/files is what ive been working with
Agree with your take, will pull for now and put up a PR for artifacts
on DbtCliResource
python_modules/libraries/dagster-dbt/dagster_dbt/dbt_artifacts.py
Outdated
Show resolved
Hide resolved
50bc26b
to
6b81144
Compare
@alangenfeld do we have WIP docs for this? |
just the API docs / doc strings at this moment |
Generally very supportive of this direction of removing this boilerplate for users, but a concern I have about this PR is that introduces this term "prepare" which I don't think users will have many associations with in this context, and I worry they thus might find kind of confusing and opaque? Reading the PR, it took me a while to wrap my head around what it meant. Did you consider something more explicit like Sorry for coming in late on this. |
One more thought on this:
|
In addition to generating the manifest, this method optionally takes care of
This led me to use a more encompassing term like
Agree overlapping but not being aligned with dbt's definition is not great. I could see |
The goal of this class is to centralize the necessary complexity that comes from seeking to provide both a nice developer experience and a performant deployed experience. Currently, that complexity surfaces in different places resulting in a rough experience 1. Surfacing this code as part of the scaffold and docs ``` import os from pathlib import Path from dagster_dbt import DbtCliResource dbt_project_dir = Path(__file__).joinpath("..", "..", "..").resolve() dbt = DbtCliResource(project_dir=os.fspath(dbt_project_dir)) # If DAGSTER_DBT_PARSE_PROJECT_ON_LOAD is set, a manifest will be created at run time. # Otherwise, we expect a manifest to be present in the project's target directory. if os.getenv("DAGSTER_DBT_PARSE_PROJECT_ON_LOAD"): dbt_manifest_path = ( dbt.cli( ["--quiet", "parse"], target_path=Path("target"), ) .wait() .target_path.joinpath("manifest.json") ) else: dbt_manifest_path = dbt_project_dir.joinpath("target", "manifest.json") ``` and instructing users set an env var any time they run dagster dev via `DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1 dagster dev` 2. Handling the use of a package data directory, notably for pex packaging used by Dagster Cloud Serverless. Currently, the scaffold we present in that condition fails to load locally due to mishandling the assumptions made in different places. 3. The CI/CD step that has to does the preparation that the code assumes, such as https://github.com/dagster-io/dagster-cloud-action/blob/7c2f65f979a21da1e621f768af6db5265586ee91/github/serverless/dbt/deploy.yml#L47-L56 A before and after of what this would look like for users can be seen in the in the upstack PR https://github.com/dagster-io/dagster/pull/19925/files ## How I Tested These Changes added some unit tests, additional coverage from existing tests in upstack PR that converts scaffold to use this class
The goal of this class is to centralize the necessary complexity that comes from seeking to provide both a nice developer experience and a performant deployed experience.
Currently, that complexity surfaces in different places resulting in a rough experience
and instructing users set an env var any time they run dagster dev via
DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1 dagster dev
Handling the use of a package data directory, notably for pex packaging used by Dagster Cloud Serverless. Currently, the scaffold we present in that condition fails to load locally due to mishandling the assumptions made in different places.
The CI/CD step that has to does the preparation that the code assumes, such as https://github.com/dagster-io/dagster-cloud-action/blob/7c2f65f979a21da1e621f768af6db5265586ee91/github/serverless/dbt/deploy.yml#L47-L56
A before and after of what this would look like for users can be seen in the in the upstack PR https://github.com/dagster-io/dagster/pull/19925/files
How I Tested These Changes
added some unit tests, additional coverage from existing tests in upstack PR that converts scaffold to use this class