Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache schema for selected models #4860

Closed

Conversation

karunpoudel
Copy link
Contributor

@karunpoudel karunpoudel commented Mar 13, 2022

resolves #4688

Description

Cache schema for selected models only instead of caching all schemas in a project when --cache_selected_only option is provided in cli or config. This will help improve startup time when a project has many databases and schema but only few models are selected.

This behavior is similar to how schemas are created currently. Dbt only creates schema (if they don't exists) for selected models only rather than all schema defined in a project.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have added information about my change to be included in the CHANGELOG.

@karunpoudel karunpoudel requested a review from a team as a code owner March 13, 2022 00:20
@karunpoudel karunpoudel requested a review from a team March 13, 2022 00:20
@karunpoudel karunpoudel requested review from a team as code owners March 13, 2022 00:20
@cla-bot cla-bot bot added the cla:yes label Mar 13, 2022
Copy link
Contributor

@ChenyuLInx ChenyuLInx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting the PR!! Added some questions and suggestions(some of them are more like making the codebase cleaner since we are here).
One other thing we should add is some tests to test the expected behavior. I understand this is might be tricky in terms of what use cases we want to test and how to properly test those. And I am not sure if we have an existing test for this kind of stuff. But would love to hear your thoughts and work with you on it.

def populate_adapter_cache(self, adapter):
adapter.set_relations_cache(self.manifest)
def populate_adapter_cache(self, adapter, required_schemas: Set[BaseRelation] = None):
if flags.SELECTED_SCHEMA_CACHE is True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like we can just remove this function and call adapter.set_relations_cache in places that call populate_adapter_cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

populate_adapter_cache here in GraphRunnableTask has been inherited by RunTask where it is mostly used. Moving this logic to before_run function would duplicate the same logic in before_run function in GraphRunnableTask and RunTask class.
If there is room to reorganize stuffs around, might make sense to do by someone more experienced with these code then me. This is the first time I am touching dbt-core :-)

@@ -337,11 +337,12 @@ def _get_catalog_schemas(self, manifest: Manifest) -> SchemaSearchMap:
# databases
return info_schema_name_map

def _relations_cache_for_schemas(self, manifest: Manifest) -> None:
def _relations_cache_for_schemas(self, manifest: Manifest, cache_schemas: Set[BaseRelation] = None) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are here, let's just remove this function and move all of the logic into set_relations_cache

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because these methods are defined in the adapters module, these are potentially breaking changes to the adapter interface. E.g. the Postgres plugin reimplements this method (currently with two arguments), and would require updating:

def _relations_cache_for_schemas(self, manifest):
super()._relations_cache_for_schemas(manifest)
self._link_cached_relations(manifest)

@@ -436,8 +436,9 @@ def defer_to_manifest(self, adapter, selected_uids: AbstractSet[str]):

def before_run(self, adapter, selected_uids: AbstractSet[str]):
with adapter.connection_named("master"):
self.create_schemas(adapter, selected_uids)
self.populate_adapter_cache(adapter)
required_schemas = self.get_model_schemas(adapter, selected_uids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 get_model_schemas would remove all of the node that is not in selected_uids, is there a case we need adapter cache to contain more nodes than selected_nodes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenyuLInx I'm not sure! That's the critical question animating in this issue/PR. I will say, having this PR makes it a lot easier to test.

  • Will dbt return incorrect results, if it fails to cache an unselected relation? I don't think so, as long as caching happens consistently at the schema level. (That's an open question for us right now on Spark: [CT-302] [Spike] Benchmark perf for show tables, show views dbt-spark#296)
  • How will this work for defer? When --defer is enabled, it's necessary to call adapter.get_relation for all unselected parents of selected models, to figure out whether to use a "dev" version or defer to the "prod" version. In my local testing, this still works! dbt will just record a cache miss, and run the query independently:
08:35:23.500589 [debug] [MainThread]: On "master": cache miss for schema "jerco.dbt_jcohen_schema_a", this is inefficient

It's likely that this will be less efficient. So, I think it's important that the behavior remains configurable. I'm not opposed to moving forward with this, in order to test the waters. We may someday wish to turn it on by default.

@@ -1084,6 +1084,27 @@ def parse_args(args, cls=DBTArgumentParser):
""",
)

schema_cache_flag = p.add_mutually_exclusive_group()
schema_cache_flag.add_argument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fairly new to all of the command arguments, would like to hear why you chose to add two arguments that would change the same dest vs the default behavior is not use this feature and a just one flag to enable the feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could set selected_schema_cache: true in profile.yml as general config for their project but may choose to override that with --no-selected-schema-cache in cli for specific scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right on! The inheritance order for "global" configs is CLI flag > env var (DBT_{config}) > profile config

I would vote for naming this config in a way that's slightly more generic (not every database calls it "schema," not every database caches at the schema level). Perhaps CACHE_SELECTED_ONLY?

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karunpoudel Very cool! Thanks so much for taking the idea and running with it. I left a few comments

@@ -337,11 +337,12 @@ def _get_catalog_schemas(self, manifest: Manifest) -> SchemaSearchMap:
# databases
return info_schema_name_map

def _relations_cache_for_schemas(self, manifest: Manifest) -> None:
def _relations_cache_for_schemas(self, manifest: Manifest, cache_schemas: Set[BaseRelation] = None) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because these methods are defined in the adapters module, these are potentially breaking changes to the adapter interface. E.g. the Postgres plugin reimplements this method (currently with two arguments), and would require updating:

def _relations_cache_for_schemas(self, manifest):
super()._relations_cache_for_schemas(manifest)
self._link_cached_relations(manifest)

@@ -1084,6 +1084,27 @@ def parse_args(args, cls=DBTArgumentParser):
""",
)

schema_cache_flag = p.add_mutually_exclusive_group()
schema_cache_flag.add_argument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right on! The inheritance order for "global" configs is CLI flag > env var (DBT_{config}) > profile config

I would vote for naming this config in a way that's slightly more generic (not every database calls it "schema," not every database caches at the schema level). Perhaps CACHE_SELECTED_ONLY?

@@ -436,8 +436,9 @@ def defer_to_manifest(self, adapter, selected_uids: AbstractSet[str]):

def before_run(self, adapter, selected_uids: AbstractSet[str]):
with adapter.connection_named("master"):
self.create_schemas(adapter, selected_uids)
self.populate_adapter_cache(adapter)
required_schemas = self.get_model_schemas(adapter, selected_uids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenyuLInx I'm not sure! That's the critical question animating in this issue/PR. I will say, having this PR makes it a lot easier to test.

  • Will dbt return incorrect results, if it fails to cache an unselected relation? I don't think so, as long as caching happens consistently at the schema level. (That's an open question for us right now on Spark: [CT-302] [Spike] Benchmark perf for show tables, show views dbt-spark#296)
  • How will this work for defer? When --defer is enabled, it's necessary to call adapter.get_relation for all unselected parents of selected models, to figure out whether to use a "dev" version or defer to the "prod" version. In my local testing, this still works! dbt will just record a cache miss, and run the query independently:
08:35:23.500589 [debug] [MainThread]: On "master": cache miss for schema "jerco.dbt_jcohen_schema_a", this is inefficient

It's likely that this will be less efficient. So, I think it's important that the behavior remains configurable. I'm not opposed to moving forward with this, in order to test the waters. We may someday wish to turn it on by default.

Copy link
Contributor

@ChenyuLInx ChenyuLInx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks @karunpoudel for making it happen! We added some pre-commit hook to format the code. I think if you follow the instruction here and push the changes then the failed GHA will pass.

@karunpoudel
Copy link
Contributor Author

@jtcohen6, @ChenyuLInx , I am thinking of adding test to check cached relation after supplying this new flag. Do you have existing integration test to verify that caching is working properly for project with multiple database/schema. I can add a new test to that.

@ChenyuLInx
Copy link
Contributor

verify that caching is working properly for project with multiple database/schema
@karunpoudel I don't think we do. Thanks for adding those! Let me know if there's anything I can help with! We haven't made the annocement yet but we are swithing to new testing framework, you can find examples here. And here are some initial docs. This is not fully done so if you feel more comfortable using the original test framework that also works for me. But I highly recommand checking out the new ones.

@jtcohen6
Copy link
Contributor

I'd be happy to have this change as is, as an optional/experimental flag, included in v1.1.0-rc1. We already have a follow-on issue open to add more rigorous testing and tracking (#4961), ahead of this being something we'd want to turn on by default.

@@ -1088,6 +1088,27 @@ def parse_args(args, cls=DBTArgumentParser):
""",
)

schema_cache_flag = p.add_mutually_exclusive_group()
schema_cache_flag.add_argument(
"--cache_selected_only",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I just caught: for consistency with other flags, this should probably be --cache-selected-only (hyphens not underscores)

@jtcohen6
Copy link
Contributor

Rebased + merged in #5036

@jtcohen6 jtcohen6 closed this Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-168] Cache objects for selected resources only?
3 participants