Dramatically faster caching #433

jtcohen6 · 2022-08-19T20:00:51Z

Rebase of #342

Description

Resolves [CT-202] Workaround for some limitations due to list_relations_without_caching method #228 by using show tables + show views instead of show table extended ... like '*' (very slow)
Resolve [CT-1051] Not correctly running with schema change #431 by always invalidating cache during get_columns_in_relation to avoid inconsistencies.

Eventually, we may want to investigate the feasibility of column-level cache (in)validation. For now, let's just stick to the core behavior. Columns will still be cached, but get_columns_in_relation will skip over the cached values and run a describe query instead.

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have run changie new to create a changelog entry

jtcohen6 · 2022-08-19T20:01:52Z

dbt/adapters/spark/column.py

 from hologram import JsonDict

 Self = TypeVar("Self", bound="SparkColumn")


 @dataclass
-class SparkColumn(dbtClassMixin, Column):
+class SparkColumn(FakeAPIObject, Column):


This enables SparkColumn to be validated as a field on SparkRelation

jtcohen6 · 2022-08-19T20:03:38Z

dbt/adapters/spark/impl.py

+        return f"updated a nonexistent relationship: {str(self.relation)}"
+
+
+class SparkRelationsCache(RelationsCache):


The events above, and cache methods below, could absolutely move into dbt-core. The big idea here is:

At the start of the run, we populate a sparser cache: which relations exist, what their names are, what types they are

For certain operations, we need to look up more detailed information (e.g. Is this a Delta table?). In that case, we look up the info and update the cached relation, so that the next lookup will be free.

jtcohen6 · 2022-08-19T20:04:20Z

dbt/adapters/spark/impl.py

+            rows: List[agate.Row] = self.execute_macro(
+                GET_COLUMNS_IN_RELATION_RAW_MACRO_NAME, kwargs={"relation": relation}
+            )


This is the describe table extended query that Spark uses to return detailed information, including the columns in the table

jtcohen6 · 2022-08-19T20:07:19Z

dbt/adapters/spark/impl.py

+        # We shouldn't access columns from the cache, until we've implemented
+        # proper cache update or invalidation at the column level
+        # https://github.com/dbt-labs/dbt-spark/issues/431


Most relation attributes are unchanging, unless the relation is dropped and recreated. (dbt-core already has a mechanism to record that drop in its cache.)

However, on a few occasions, dbt does alter the columns within a relation: namely, to handle schema evolution in snapshots and incremental models (on_schema_change). We don't have a core mechanism to invalidate or update the cache when this happens. So, even though we have been and are still recording columns in the cache, we need to make get_columns_in_relation skip the cache and run a query every time.

jtcohen6 · 2022-08-19T20:23:39Z

dbt/adapters/spark/impl.py

+    def get_relation(
+        self, database: Optional[str], schema: str, identifier: str, needs_information=False
+    ) -> Optional[BaseRelation]:


This adds a fourth argument to the get_relation signature, which only has 3 in dbt-core's "base" adapter implementation.

The idea: If we need "sparse" information, a sparse cache lookup will suffice. If we need "bonus" information (e.g. file format), then we need to first check to see if that additional information is available in the cache from a previous lookup. If not, we'll run a query to look it up, and update the cache accordingly.

ueshin

We need to use needs_information=True here too.

dbt-spark/dbt/include/spark/macros/materializations/incremental/incremental.sql

Line 17 in 88d917d

{%- set existing_relation = load_relation(this) -%}

### Description Avoids show table extended command. This is based on dbt-labs/dbt-spark#433. 1. Create a table/view list with `show tables in {{ relation }}` and `show views in {{ relation }}` commands, or `get_tables` API when `catalog` is provided. 2. Retrieve additional information by `describe extended {{ relation }}` command. 1. `get_relation` with `needs_information=True` 2. `get_columns_in_relation`

jtcohen6 and others added 6 commits August 19, 2022 21:01

option 2: show tables + show views

bf93ee5

Squashed commits from #342

2eafd5b

Fixups to get working

f9a86c3

Enable test that should work with Spark3

e32bc8d

Fixup test, code checks

ce35692

Add changelog entry

3218960

cla-bot bot added the cla:yes label Aug 19, 2022

jtcohen6 changed the title ~~Jerco/pr 342 run tests~~ Dramatically faster caching Aug 19, 2022

More code checks

1d1d715

jtcohen6 commented Aug 19, 2022

View reviewed changes

jtcohen6 added 3 commits August 19, 2022 22:30

Look up additional info for snapshots

8ca8cc7

Fix behavior with temp views

0a1e864

Fix mypy: no more flags.USE_CACHE

88d917d

jtcohen6 mentioned this pull request Aug 31, 2022

[CT-1114] Prevent cache inconsistencies during on_schema_change #447

Closed

ueshin reviewed Sep 13, 2022

View reviewed changes

spenaustin mentioned this pull request Nov 1, 2022

[CT-202] Workaround for some limitations due to list_relations_without_caching method #228

Open

ueshin mentioned this pull request Nov 29, 2022

Avoid show table extended command. databricks/dbt-databricks#231

Merged

jtcohen6 closed this Jan 17, 2023

mikealfare deleted the jerco/pr-342-run-tests branch March 1, 2023 00:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dramatically faster caching #433

Dramatically faster caching #433

jtcohen6 commented Aug 19, 2022

jtcohen6 Aug 19, 2022

jtcohen6 Aug 19, 2022

jtcohen6 Aug 19, 2022

jtcohen6 Aug 19, 2022

jtcohen6 Aug 19, 2022 •

edited

Loading

ueshin left a comment

		return f"updated a nonexistent relationship: {str(self.relation)}"


		class SparkRelationsCache(RelationsCache):

Dramatically faster caching #433

Dramatically faster caching #433

Conversation

jtcohen6 commented Aug 19, 2022

Description

Checklist

jtcohen6 Aug 19, 2022

Choose a reason for hiding this comment

jtcohen6 Aug 19, 2022

Choose a reason for hiding this comment

jtcohen6 Aug 19, 2022

Choose a reason for hiding this comment

jtcohen6 Aug 19, 2022

Choose a reason for hiding this comment

jtcohen6 Aug 19, 2022 • edited Loading

Choose a reason for hiding this comment

ueshin left a comment

Choose a reason for hiding this comment

jtcohen6 Aug 19, 2022 •

edited

Loading