-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dramatically faster caching #433
Conversation
from hologram import JsonDict | ||
|
||
Self = TypeVar("Self", bound="SparkColumn") | ||
|
||
|
||
@dataclass | ||
class SparkColumn(dbtClassMixin, Column): | ||
class SparkColumn(FakeAPIObject, Column): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enables SparkColumn
to be validated as a field on SparkRelation
return f"updated a nonexistent relationship: {str(self.relation)}" | ||
|
||
|
||
class SparkRelationsCache(RelationsCache): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The events above, and cache methods below, could absolutely move into dbt-core
. The big idea here is:
- At the start of the run, we populate a sparser cache: which relations exist, what their names are, what types they are
- For certain operations, we need to look up more detailed information (e.g. Is this a Delta table?). In that case, we look up the info and update the cached relation, so that the next lookup will be free.
rows: List[agate.Row] = self.execute_macro( | ||
GET_COLUMNS_IN_RELATION_RAW_MACRO_NAME, kwargs={"relation": relation} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the describe table extended
query that Spark uses to return detailed information, including the columns in the table
# We shouldn't access columns from the cache, until we've implemented | ||
# proper cache update or invalidation at the column level | ||
# https://github.com/dbt-labs/dbt-spark/issues/431 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most relation attributes are unchanging, unless the relation is dropped and recreated. (dbt-core already has a mechanism to record that drop in its cache.)
However, on a few occasions, dbt does alter the columns within a relation: namely, to handle schema evolution in snapshots and incremental models (on_schema_change
). We don't have a core mechanism to invalidate or update the cache when this happens. So, even though we have been and are still recording columns in the cache, we need to make get_columns_in_relation
skip the cache and run a query every time.
def get_relation( | ||
self, database: Optional[str], schema: str, identifier: str, needs_information=False | ||
) -> Optional[BaseRelation]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds a fourth argument to the get_relation
signature, which only has 3 in dbt-core
's "base" adapter implementation.
The idea: If we need "sparse" information, a sparse cache lookup will suffice. If we need "bonus" information (e.g. file format), then we need to first check to see if that additional information is available in the cache from a previous lookup. If not, we'll run a query to look it up, and update the cache accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to use needs_information=True
here too.
{%- set existing_relation = load_relation(this) -%} |
### Description Avoids show table extended command. This is based on dbt-labs/dbt-spark#433. 1. Create a table/view list with `show tables in {{ relation }}` and `show views in {{ relation }}` commands, or `get_tables` API when `catalog` is provided. 2. Retrieve additional information by `describe extended {{ relation }}` command. 1. `get_relation` with `needs_information=True` 2. `get_columns_in_relation`
### Description Avoids show table extended command. This is based on dbt-labs/dbt-spark#433. 1. Create a table/view list with `show tables in {{ relation }}` and `show views in {{ relation }}` commands, or `get_tables` API when `catalog` is provided. 2. Retrieve additional information by `describe extended {{ relation }}` command. 1. `get_relation` with `needs_information=True` 2. `get_columns_in_relation`
Rebase of #342
Description
list_relations_without_caching
method #228 by usingshow tables
+show views
instead ofshow table extended ... like '*'
(very slow)get_columns_in_relation
to avoid inconsistencies.Eventually, we may want to investigate the feasibility of column-level cache (in)validation. For now, let's just stick to the core behavior. Columns will still be cached, but
get_columns_in_relation
will skip over the cached values and run adescribe
query instead.Checklist
changie new
to create a changelog entry