Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-302] [Spike] Benchmark perf for show tables, show views #296

Closed
leahwicz opened this issue Feb 28, 2022 · 3 comments
Closed

[CT-302] [Spike] Benchmark perf for show tables, show views #296

leahwicz opened this issue Feb 28, 2022 · 3 comments

Comments

@leahwicz
Copy link
Contributor

More context here: #228

We want to be able to benchmark perf for these 2 calls to see if we should make improvements here to help with perf.

@github-actions github-actions bot changed the title [Spike] Benchmark perf for show tables, show views [CT-302] [Spike] Benchmark perf for show tables, show views Feb 28, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 2, 2022

In particular we want to benchmark / bake-off achieving the same outcome via these two competing approaches:

  1. show table extended in <database_name> like '<object_one>|<object_two'|...
  2. show tables (should be very fast) + show views (may be substantially slower)

I can share more context when we grab some time synchronously!

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 9, 2022

Experimental code to play around with during our pairing time tomorrow:

After some limited local testing, I'm finding option 1 to be faster, but it would be worth testing against more project/database shapes.

Update:

@jtcohen6
Copy link
Contributor

After some initial local testing (not yet scientific benchmarking): Approach 2 (show tables + show views) is a touch slower with very small datasets (since it requires running two queries per schema), but significantly faster with larger datasets. I wasn't expecting that, so it's worth verifying with some additional project shapes.

Principal risk of approach 1: dbt groups its cached relations by schema. If we run adapter.get_relation on a relation in a schema that hasn't been cached, it's a cache miss, which is inefficient but ultimately correct (dbt looks up the specific relation). If we run it on a relation that is in a schema that's been cached, but that specific relation wasn't selected for caching, dbt will say more confidently that the relation doesn't exist. By caching a subset of a schema, we increase the risk of uncaught cache misses.

Principal risk of approach 2: We're getting less info at cache time than before. Crucially, we no longer get the file format of each table, which we need at runtime for conditional logic in materializations (i.e. is_delta) that tell dbt whether to use create or replace table, and whether merge is allowed during incremental models / snapshots. We'd need to remove these checks (which have been cache lookups). Instead, we'd either run an additional query for each of those models; or trust that if a user has configured a table with file_format: delta, and it already exists in the database, it really is a Delta table.

There are still good reasons to want show table extended like 'specific_relation_1'|'specific_relation_2' for use in catalog generation (docs generate). In that case, there's no real downside to "missing" a relation. Catalog generation also includes source tables (caching doesn't), and I think it's much more likely for source tables to live in schemas with many hundreds/thousands of unrelated tables that aren't relevant to dbt.

I don't feel this is enough, in its own right, to confidently move ahead with approach 2, but it's certainly the direction in which I'm now leaning. The performance seems better at scale, and the risks (really limitations) are better understood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants