From f6b9d5a11fdcbf434fca7168ed242e1db30f0787 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 22 Aug 2024 11:56:38 -0400 Subject: [PATCH 01/29] Addint athena ref page --- .../resource-configs/athena-configs.md | 604 ++++++++++++++++++ website/sidebars.js | 15 +- 2 files changed, 612 insertions(+), 7 deletions(-) create mode 100644 website/docs/reference/resource-configs/athena-configs.md diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md new file mode 100644 index 00000000000..7791350c999 --- /dev/null +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -0,0 +1,604 @@ +--- +title: "Amazon Athena configurations" +id: "athena-configs" +--- + +## Models + +### Table configuration + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `external_location` | None | The full S3 path to where the table will be saved. Only works with incremental models. Doesn't work with Hive table with `ha` set to `true`. | +| `partitioned_by` | None | An array list of columns by which the table will be partitioned. Currently limited to 100 partitions. | +| `bucketed_by` | None | An array list of the columns to bucket data. Ignored if using Iceberg | +| `bucket_count` | None | The number of buckets for bucketing your data. Ignored if using Iceberg | +| `table_type` | Hive | The type of table. Supports `hive` or `iceberg` | +| `ha` | False | Build the table using the high-availability method. Only available for Hive tables. | +| `format` | Parquet | The data format for the table. Supports `ORC`, `PARQUET`, `AVRO`, `JSON`, and `TEXTFILE` | +| `write_compression` | None | The compression type for any storage format that allows compressions. See [CREATE TABLE AS][#create-table-as] for available options | +| `field_delimeter` | None | Custome field delimiter for when the format is set to `TEXTFIRE` | +| `table_properties` | N/A | The tabe properties to add to the table. For Iceberg only. | +| `native_drop` | N/A | Relation drop operations will be performed with SQL, not direct Glue API calls. No S3 calls will be made to manage data in S3. Data in S3 will only be cleared up for Iceberg tables. See the [AWS docs](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-managing-tables.html) for more info. Iceberg DROP TABLE operations may timeout if they take longer than 60 seconds.| +| `seed_by_insert` | False | Creates seeds using an SQL insert statement. Large seed files can't exceed the Athena 262144 bytes limit. | +| `force_batch` | False | Run the table creation directly in batch insert mode. Useful when the standard table creation fails due to partition limitation. | +| `unique_tmp_table_suffix` | False | Replace the "__dbt_tmp table" suffix with a unique UUID for incremental models using insert overwrite on Hive tables. | +| `temp_schema` | None | Defines a schema to hold temporary create statements used in incremental model runs. Scheme will be created in the models target database if it does not exist. | +| `lf_tags_config` | None | [AWS Lake Formation](#aws-lake-formation-integration) tags to associate with the table and columns. Existing tags will be removed.
* `enabled` (`default=False`) whether LF tags management is enabled for a model
* `tags` dictionary with tags and their values to assign for the model
* `tags_columns` dictionary with a tag key, value and list of columns they must be assigned to | +| `lf_inherited_tags` | None | List of the Lake Formation tag keys that are to be inherited from the database level and shouldn't be removed during the assignment of those defined in `ls_tags_config`. | +| `lf_grants` | None | Lake Formation grants config for `data_cell` filters. | + +#### Configuration examples + + + +```sql +{{ + config( + materialized='incremental', + incremental_strategy='append', + on_schema_change='append_new_columns', + table_type='iceberg', + schema='test_schema', + lf_tags_config={ + 'enabled': true, + 'tags': { + 'tag1': 'value1', + 'tag2': 'value2' + }, + 'tags_columns': { + 'tag1': { + 'value1': ['column1', 'column2'], + 'value2': ['column3', 'column4'] + } + }, + 'inherited_tags': ['tag1', 'tag2'] + } + ) +}} +``` + + + + +```yaml + +lf_tags_config: + enabled: true + tags: + tag1: value1 + tag2: value2 + tags_columns: + tag1: + value1: [ column1, column2 ] + inherited_tags: [ tag1, tag2 ] +``` + + + +Lake Formation grants: + +```python +lf_grants={ + 'data_cell_filters': { + 'enabled': True | False, + 'filters': { + 'filter_name': { + 'row_filter': '', + 'principals': ['principal_arn1', 'principal_arn2'] + } + } + } + } +``` + + +- `lf_tags` and `lf_tags_columns` configs support only attaching lf tags to corresponding resources. +- We recommend managing LF Tags permissions somewhere outside dbt. For example, [terraform](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lakeformation_permissions) or [aws cdk](https://docs.aws.amazon.com/cdk/api/v1/docs/aws-lakeformation-readme.html). +- `data_cell_filters` management can't be automated outside dbt because the filter can't be attached to the table which doesn't exist. Once you `enable` this config, dbt will set all filters and their permissions during every dbt run. Such approach keeps the actual state of row level security configuration actual after every dbt run and apply changes if they occur: drop, create, update filters and their permissions. +- Any tags listed in `lf_inherited_tags` should be strictly inherited from the database level and never overridden at the table and column level +- Currently `dbt-athena` does not differentiate between an inherited tag association and an override of same it made previously +> - For example, If an inherited tag is overridden by an `lf_tags_config` value in one DBT run, and that override is removed prior to a subsequent run, the prior override will linger and no longer be encoded anywhere (in e.g. Terraform where the inherited value is configured nor in the DBT project where the override previously existed but now is gone) + +### Table location + +The saved location a table is determined by the following conditions: + +1. If `external_location` is defined, that value is used. +2. If `s3_data_dir` is defined, the path is determined by that and `s3_data_naming`. +3. If `s3_data_dir` is not defined, data is stored under `s3_staging_dir/tables/`. + +The following options are available for `s3_data_naming`: + +- `unique`: `{s3_data_dir}/{uuid4()}/` +- `table`: `{s3_data_dir}/{table}/` +- `table_unique`: `{s3_data_dir}/{table}/{uuid4()}/` +- `schema_table`: `{s3_data_dir}/{schema}/{table}/` +- `s3_data_naming=schema_table_unique`: `{s3_data_dir}/{schema}/{table}/{uuid4()}/` + +It's possible to set the `s3_data_naming` globally in the target profile, or overwrite the value in the table config or setting up the value for groups of model in dbt_project.yml. + +Note: when using a workgroup with a default output location configured, `s3_data_naming` and any configured buckets are ignored and the location configured in the workgroup is used. + +### Incremental models + +The following [incremental models](https://docs.getdbt.com/docs/build/incremental-models) strategies are supported: + +- `insert_overwrite` (default): The insert overwrite strategy deletes the overlapping partitions from the destination table, and then inserts the new records from the source. This strategy depends on the `partitioned_by` keyword! If no partitions are defined, dbt will fall back to the `append` strategy. +- `append`: Insert new records without updating, deleting or overwriting any existing data. There might be duplicate data (e.g. great for log or historical data). +- `merge`: Conditionally updates, deletes, or inserts rows into an Iceberg table. Used in combination with `unique_key`.Only available when using Iceberg. + +### On schema change + +`on_schema_change` is an option to reflect changes of schema in incremental models. +The following options are supported: + +- `ignore` (default) +- `fail` +- `append_new_columns` +- `sync_all_columns` + +For details, please refer to the [incremental models](https://docs.getdbt.com/docs/build/incremental-models#what-if-the-columns-of-my-incremental-model-change) article. + +### Iceberg + +The adapter supports table materialization for Iceberg. + +Take the following model as an example: + +```sql +{{ config( + materialized='table', + table_type='iceberg', + format='parquet', + partitioned_by=['bucket(user_id, 5)'], + table_properties={ + 'optimize_rewrite_delete_file_threshold': '2' + } +) }} + +select 'A' as user_id, + 'pi' as name, + 'active' as status, + 17.89 as cost, + 1 as quantity, + 100000000 as quantity_big, + current_date as my_date +``` + +Iceberg supports bucketing as hidden partitions. Use the `partitioned_by` config to add specific bucketing +conditions. + +Iceberg supports several table formats for data : `PARQUET`, `AVRO` and `ORC`. + +It is possible to use Iceberg in an incremental fashion, specifically two strategies are supported: + +- `append`: New records are appended to the table (this can lead to duplicates). +- `merge`: Perform an update and insert (and optional delete), where new records are added and existing records are updated. Only available with Athena engine version 3. + - `unique_key`(required): Columns that define a unique record in the source and target tables. + - `incremental_predicates` (optional): SQL conditions that enable custom join clauses in the merge statement. This can + be useful for improving performance via predicate pushdown on the target table. + - `delete_condition` (optional): SQL condition used to identify records that should be deleted. + - `update_condition` (optional): SQL condition used to identify records that should be updated. + - `insert_condition` (optional): SQL condition used to identify records that should be inserted. + - `incremental_predicates`, `delete_condition`, `update_condition` and `insert_condition` can include any column of the incremental table (`src`) or the final table (`target`). Column names must be prefixed by either `src` or `target` to prevent a `Column is ambiguous` error. + +Example of `delete_condition`: + +```sql +{{ config( + materialized='incremental', + table_type='iceberg', + incremental_strategy='merge', + unique_key='user_id', + incremental_predicates=["src.quantity > 1", "target.my_date >= now() - interval '4' year"], + delete_condition="src.status != 'active' and target.my_date < now() - interval '2' year", + format='parquet' +) }} + +select 'A' as user_id, + 'pi' as name, + 'active' as status, + 17.89 as cost, + 1 as quantity, + 100000000 as quantity_big, + current_date as my_date +``` + +`update_condition` example: + +```sql +{{ config( + materialized='incremental', + incremental_strategy='merge', + unique_key=['id'], + update_condition='target.id > 1', + schema='sandbox' + ) +}} + +{% if is_incremental() %} + +select * from ( + values + (1, 'v1-updated') + , (2, 'v2-updated') +) as t (id, value) + +{% else %} + +select * from ( + values + (-1, 'v-1') + , (0, 'v0') + , (1, 'v1') + , (2, 'v2') +) as t (id, value) + +{% endif %} +``` + +Example of `insert_condition`: + +```sql +{{ config( + materialized='incremental', + incremental_strategy='merge', + unique_key=['id'], + insert_condition='target.status != 0', + schema='sandbox' + ) +}} + +select * from ( + values + (1, 0) + , (2, 1) +) as t (id, status) + +``` + +### High availablity table (HA) + +The current implementation of table materialization can lead to downtime, as the target table is dropped and re-created. To have less destructive behavior, it's possible to use the `ha` config on your `table` materialized models. It leverages the table versions feature of the glue catalog, creating a temp table and swapping the target table to the location of the temp table. This materialization is only available for `table_type=hive` and requires using unique locations. For Iceberg, high availability is the default. + + +```sql +{{ config( + materialized='table', + ha=true, + format='parquet', + table_type='hive', + partitioned_by=['status'], + s3_data_naming='table_unique' +) }} + +select 'a' as user_id, + 'pi' as user_name, + 'active' as status +union all +select 'b' as user_id, + 'sh' as user_name, + 'disabled' as status +``` + +By default, the materialization keeps the last 4 table versions,but you can change it by setting `versions_to_keep`. + +#### HA known issues + +- When swapping from a table with partitions to a table without (and the other way around), there could be a little + downtime. If high performances is needed consider bucketing instead of partitions. +- By default, Glue "duplicates" the versions internally, so the last two versions of a table point to the same location. +- It's recommended to set `versions_to_keep` >= 4, as this will avoid having the older location removed. + +### Update glue data catalog + +Persist resource descriptions as column and relation comments to the glue data catalog, and meta as [glue table properties](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#table-properties) and [column parameters](https://docs.aws.amazon.com/glue/latest/webapi/API_Column.html). By default, documentation persistence is disabled, but it can be enabled for specific resources or groups of resources as needed. + +For example: + +```yaml +models: + - name: test_deduplicate + description: another value + config: + persist_docs: + relation: true + columns: true + meta: + test: value + columns: + - name: id + meta: + primary_key: true +``` + +See [persist docs](https://docs.getdbt.com/reference/resource-configs/persist_docs) for more details. + +## Snapshots + +The adapter supports snapshot materialization. It supports both timestamp and check strategy. To create a snapshot +create a snapshot file in the snapshots directory. If the directory does not exist create one. + +### Timestamp strategy + +To use the timestamp strategy refer to +the [dbt docs](https://docs.getdbt.com/docs/build/snapshots#timestamp-strategy-recommended) + +### Check strategy + +To use the check strategy refer to the [dbt docs](https://docs.getdbt.com/docs/build/snapshots#check-strategy) + +### Hard-deletes + +The materialization also supports invalidating hard deletes. Check +the [docs](https://docs.getdbt.com/docs/build/snapshots#hard-deletes-opt-in) to understand usage. + +### Working example + +seed file - employent_indicators_november_2022_csv_tables.csv + +```csv +Series_reference,Period,Data_value,Suppressed +MEIM.S1WA,1999.04,80267, +MEIM.S1WA,1999.05,70803, +MEIM.S1WA,1999.06,65792, +MEIM.S1WA,1999.07,66194, +MEIM.S1WA,1999.08,67259, +MEIM.S1WA,1999.09,69691, +MEIM.S1WA,1999.1,72475, +MEIM.S1WA,1999.11,79263, +MEIM.S1WA,1999.12,86540, +MEIM.S1WA,2000.01,82552, +MEIM.S1WA,2000.02,81709, +MEIM.S1WA,2000.03,84126, +MEIM.S1WA,2000.04,77089, +MEIM.S1WA,2000.05,73811, +MEIM.S1WA,2000.06,70070, +MEIM.S1WA,2000.07,69873, +MEIM.S1WA,2000.08,71468, +MEIM.S1WA,2000.09,72462, +MEIM.S1WA,2000.1,74897, +``` + +model.sql + +```sql +{{ config( + materialized='table' +) }} + +select row_number() over() as id + , * + , cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as refresh_timestamp +from {{ ref('employment_indicators_november_2022_csv_tables') }} +``` + +timestamp strategy - model_snapshot_1 + +```sql +{% snapshot model_snapshot_1 %} + +{{ + config( + strategy='timestamp', + updated_at='refresh_timestamp', + unique_key='id' + ) +}} + +select * +from {{ ref('model') }} {% endsnapshot %} +``` + +invalidate hard deletes - model_snapshot_2 + +```sql +{% snapshot model_snapshot_2 %} + +{{ + config + ( + unique_key='id', + strategy='timestamp', + updated_at='refresh_timestamp', + invalidate_hard_deletes=True, + ) +}} +select * +from {{ ref('model') }} {% endsnapshot %} +``` + +check strategy - model_snapshot_3 + +```sql +{% snapshot model_snapshot_3 %} + +{{ + config + ( + unique_key='id', + strategy='check', + check_cols=['series_reference','data_value'] + ) +}} +select * +from {{ ref('model') }} {% endsnapshot %} +``` + +### Snapshots known issues + +- Incremental Iceberg models - Sync all columns on schema change can't remove columns used for partitioning. The only way, from a dbt perspective, is to do a full-refresh of the incremental model. +- Tables, schemas and database names should only be lowercase +- In order to avoid potential conflicts, make sure [`dbt-athena-adapter`](https://github.com/Tomme/dbt-athena) is not installed in the target environment. +- Snapshot does not support dropping columns from the source table. If you drop a column make sure to drop the column from the snapshot as well. Another workaround is to NULL the column in the snapshot definition to preserve history + +## AWS Lake Formation integration + +The adapter implements AWS Lake Formation tag management in the following way: + +- You can enable or disable lf-tags management via [config](#table-configuration) (disabled by default) +- Once you enable the feature, lf-tags will be updated on every dbt run +- First, all lf-tags for columns are removed to avoid inheritance issues +- Then, all redundant lf-tags are removed from tables and actual tags from table configs are applied +- Finally, lf-tags for columns are applied + +It's important to understand the following points: + +- dbt does not manage `lf-tags` for databases +- dbt does not manage Lake Formation permissions + +That's why you should handle this by yourself manually or using an automation tool like terraform, AWS CDK, etc. You may find the following links useful to manage that: + +* [terraform aws_lakeformation_permissions](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lakeformation_permissions) +* [terraform aws_lakeformation_resource_lf_tags](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lakeformation_resource_lf_tags) + +## Python models + +The adapter supports Python models using [`spark`](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html). + +### Setup + +- A Spark-enabled workgroup created in Athena +- Spark execution role granted access to Athena, Glue and S3 +- The Spark workgroup is added to the `~/.dbt/profiles.yml` file and the profile to be used + is referenced in `dbt_project.yml` + +### Spark-specific table configuration + +- `timeout` (`default=43200`) + - Time out in seconds for each Python model execution. Defaults to 12 hours/43200 seconds. +- `spark_encryption` (`default=false`) + - If this flag is set to true, encrypts data in transit between Spark nodes and also encrypts data at rest stored locally by Spark. +- `spark_cross_account_catalog` (`default=false`) + - When using the Spark Athena workgroup, queries can only be made against catalogs located on the same AWS account by default. However, sometimes you want to query another catalog located on an external AWS account. Setting this additional Spark properties parameter to true will enable querying external catalogs. You can use the syntax `external_catalog_id/database.table` to access the external table on the external catalog (For example, `999999999999/mydatabase.cloudfront_logs` where 999999999999 is the external catalog ID) +- `spark_requester_pays` (`default=false`) + - When an Amazon S3 bucket is configured as requester pays, the account of the user running the query is charged for data access and data transfer fees associated with the query. + - If this flag is set to true, requester pays S3 buckets are enabled in Athena for Spark. + +### Spark notes + +- A session is created for each unique engine configuration defined in the models that are part of the invocation. +- A session's idle timeout is set to 10 minutes. Within the timeout period, if there is a new calculation (Spark Python model) ready for execution and the engine configuration matches, the process will reuse the same session. +- The number of Python models running at a time depends on the `threads`. The number of sessions created for the entire run depends on the number of unique engine configurations and the availability of sessions to maintain thread concurrency. +- For Iceberg tables, it is recommended to use `table_properties` configuration to set the `format_version` to 2. This is to maintain compatibility between Iceberg tables created by Trino with those created by Spark. + +### Example models + +#### Simple pandas model + +```python +import pandas as pd + + +def model(dbt, session): + dbt.config(materialized="table") + + model_df = pd.DataFrame({"A": [1, 2, 3, 4]}) + + return model_df +``` + +#### Simple spark + +```python +def model(dbt, spark_session): + dbt.config(materialized="table") + + data = [(1,), (2,), (3,), (4,)] + + df = spark_session.createDataFrame(data, ["A"]) + + return df +``` + +#### Spark incremental + +```python +def model(dbt, spark_session): + dbt.config(materialized="incremental") + df = dbt.ref("model") + + if dbt.is_incremental: + max_from_this = ( + f"select max(run_date) from {dbt.this.schema}.{dbt.this.identifier}" + ) + df = df.filter(df.run_date >= spark_session.sql(max_from_this).collect()[0][0]) + + return df +``` + +#### Config spark model + +```python +def model(dbt, spark_session): + dbt.config( + materialized="table", + engine_config={ + "CoordinatorDpuSize": 1, + "MaxConcurrentDpus": 3, + "DefaultExecutorDpuSize": 1 + }, + spark_encryption=True, + spark_cross_account_catalog=True, + spark_requester_pays=True + polling_interval=15, + timeout=120, + ) + + data = [(1,), (2,), (3,), (4,)] + + df = spark_session.createDataFrame(data, ["A"]) + + return df +``` + +#### Create pySpark udf using imported external python files + +```python +def model(dbt, spark_session): + dbt.config( + materialized="incremental", + incremental_strategy="merge", + unique_key="num", + ) + sc = spark_session.sparkContext + sc.addPyFile("s3://athena-dbt/test/file1.py") + sc.addPyFile("s3://athena-dbt/test/file2.py") + + def func(iterator): + from file2 import transform + + return [transform(i) for i in iterator] + + from pyspark.sql.functions import udf + from pyspark.sql.functions import col + + udf_with_import = udf(func) + + data = [(1, "a"), (2, "b"), (3, "c")] + cols = ["num", "alpha"] + df = spark_session.createDataFrame(data, cols) + + return df.withColumn("udf_test_col", udf_with_import(col("alpha"))) +``` + +### Known issues in Python models + +- Python models cannot [reference Athena SQL views](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html). +- Third-party Python libraries can be used, but they must be [included in the pre-installed list][pre-installed list] or [imported manually][imported manually]. +- Python models can only reference or write to tables with names meeting the regular expression: `^[0-9a-zA-Z_]+$`. Dashes and special characters are not supported by Spark, even though Athena supports them. +- Incremental models do not fully utilize Spark capabilities. They depend partially on existing SQL-based logic which runs on Trino. +- Snapshot materializations are not supported. +- Spark can only reference tables within the same catalog. +- For tables created outside of the dbt tool, be sure to populate the location field or dbt will throw an error when trying to create the table. + +[pre-installed list]: https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-preinstalled-python-libraries.html +[imported manually]: https://docs.aws.amazon.com/athena/latest/ug/notebooks-import-files-libraries.html + +## Contracts + +The adapter partly supports contract definitions: + +- `data_type` is supported but needs to be adjusted for complex types. Types must be specified entirely (for instance `array`) even though they won't be checked. Indeed, as dbt recommends, we only compare the broader type (array, map, int, varchar). The complete definition is used in order to check that the data types defined in Athena are ok (pre-flight check). +- The adapter does not support the constraints since there is no constraint concept in Athena. + diff --git a/website/sidebars.js b/website/sidebars.js index a3b0cd2d8a4..1454e210617 100644 --- a/website/sidebars.js +++ b/website/sidebars.js @@ -835,26 +835,27 @@ const sidebarSettings = { type: "category", label: "Platform-specific configs", items: [ + "reference/resource-configs/athena-configs", + "reference/resource-configs/impala-configs", "reference/resource-configs/spark-configs", "reference/resource-configs/bigquery-configs", - "reference/resource-configs/databricks-configs", - "reference/resource-configs/fabric-configs", - "reference/resource-configs/postgres-configs", - "reference/resource-configs/redshift-configs", - "reference/resource-configs/snowflake-configs", - "reference/resource-configs/trino-configs", - "reference/resource-configs/impala-configs", "reference/resource-configs/clickhouse-configs", + "reference/resource-configs/databricks-configs", "reference/resource-configs/doris-configs", "reference/resource-configs/firebolt-configs", "reference/resource-configs/greenplum-configs", "reference/resource-configs/infer-configs", "reference/resource-configs/materialize-configs", "reference/resource-configs/azuresynapse-configs", + "reference/resource-configs/fabric-configs", "reference/resource-configs/mssql-configs", "reference/resource-configs/mindsdb-configs", "reference/resource-configs/oracle-configs", + "reference/resource-configs/postgres-configs", + "reference/resource-configs/redshift-configs", "reference/resource-configs/singlestore-configs", + "reference/resource-configs/snowflake-configs", + "reference/resource-configs/trino-configs", "reference/resource-configs/starrocks-configs", "reference/resource-configs/teradata-configs", "reference/resource-configs/upsolver-configs", From b579b232057d640a90b3e65b78c199444e9794a9 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 22 Aug 2024 13:28:59 -0400 Subject: [PATCH 02/29] Update website/docs/reference/resource-configs/athena-configs.md --- website/docs/reference/resource-configs/athena-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 7791350c999..2d10a0420a9 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -434,7 +434,7 @@ from {{ ref('model') }} {% endsnapshot %} ## AWS Lake Formation integration -The adapter implements AWS Lake Formation tag management in the following way: +The following is how the adapter implements AWS Lake Formation tag management: - You can enable or disable lf-tags management via [config](#table-configuration) (disabled by default) - Once you enable the feature, lf-tags will be updated on every dbt run From 9ede0bfe742a458d2570c90e5f23fb2e374ddb88 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 13:46:23 -0400 Subject: [PATCH 03/29] Apply suggestions from code review Co-authored-by: Ly Nguyen <107218380+nghi-ly@users.noreply.github.com> --- .../reference/resource-configs/athena-configs.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 2d10a0420a9..dda4def08c0 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -9,16 +9,16 @@ id: "athena-configs" | Parameter | Default | Description | |-----------|---------|-------------| -| `external_location` | None | The full S3 path to where the table will be saved. Only works with incremental models. Doesn't work with Hive table with `ha` set to `true`. | +| `external_location` | None | The full S3 path to where the table is saved. It only works with incremental models. It doesn't work with Hive tables with `ha` set to `true`. | | `partitioned_by` | None | An array list of columns by which the table will be partitioned. Currently limited to 100 partitions. | | `bucketed_by` | None | An array list of the columns to bucket data. Ignored if using Iceberg | -| `bucket_count` | None | The number of buckets for bucketing your data. Ignored if using Iceberg | -| `table_type` | Hive | The type of table. Supports `hive` or `iceberg` | +| `bucket_count` | None | The number of buckets for bucketing your data. This parameter is ignored if using Iceberg. | +| `table_type` | Hive | The type of table. Supports `hive` or `iceberg`. | | `ha` | False | Build the table using the high-availability method. Only available for Hive tables. | -| `format` | Parquet | The data format for the table. Supports `ORC`, `PARQUET`, `AVRO`, `JSON`, and `TEXTFILE` | +| `format` | Parquet | The data format for the table. Supports `ORC`, `PARQUET`, `AVRO`, `JSON`, and `TEXTFILE`. | | `write_compression` | None | The compression type for any storage format that allows compressions. See [CREATE TABLE AS][#create-table-as] for available options | -| `field_delimeter` | None | Custome field delimiter for when the format is set to `TEXTFIRE` | -| `table_properties` | N/A | The tabe properties to add to the table. For Iceberg only. | +| `field_delimeter` | None | Specify the custom field delimiter to use when the format is set to `TEXTFIRE`. | +| `table_properties` | N/A | The table properties to add to the table. This is only for Iceberg. | | `native_drop` | N/A | Relation drop operations will be performed with SQL, not direct Glue API calls. No S3 calls will be made to manage data in S3. Data in S3 will only be cleared up for Iceberg tables. See the [AWS docs](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-managing-tables.html) for more info. Iceberg DROP TABLE operations may timeout if they take longer than 60 seconds.| | `seed_by_insert` | False | Creates seeds using an SQL insert statement. Large seed files can't exceed the Athena 262144 bytes limit. | | `force_batch` | False | Run the table creation directly in batch insert mode. Useful when the standard table creation fails due to partition limitation. | From 0d0f0bdcec71f72d6c1d4a8138401c3c0876feb0 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 13:48:20 -0400 Subject: [PATCH 04/29] Update athena-configs.md Adding description --- website/docs/reference/resource-configs/athena-configs.md | 1 + 1 file changed, 1 insertion(+) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index dda4def08c0..fb4854b47f5 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -1,5 +1,6 @@ --- title: "Amazon Athena configurations" +description: "Reference guide for the Amazon Athena adapter for dbt Core and dbt Cloud." id: "athena-configs" --- From 36ade39807c9be8b93825b9a36235b802974356b Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 13:49:07 -0400 Subject: [PATCH 05/29] Update website/docs/reference/resource-configs/athena-configs.md --- website/docs/reference/resource-configs/athena-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index fb4854b47f5..28eac0984f4 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -12,7 +12,7 @@ id: "athena-configs" |-----------|---------|-------------| | `external_location` | None | The full S3 path to where the table is saved. It only works with incremental models. It doesn't work with Hive tables with `ha` set to `true`. | | `partitioned_by` | None | An array list of columns by which the table will be partitioned. Currently limited to 100 partitions. | -| `bucketed_by` | None | An array list of the columns to bucket data. Ignored if using Iceberg | +| `bucketed_by` | None | An array list of the columns to bucket data. Ignored if using Iceberg. | | `bucket_count` | None | The number of buckets for bucketing your data. This parameter is ignored if using Iceberg. | | `table_type` | Hive | The type of table. Supports `hive` or `iceberg`. | | `ha` | False | Build the table using the high-availability method. Only available for Hive tables. | From 823f70a501d447dbc0310f5697515583592d2219 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 13:56:18 -0400 Subject: [PATCH 06/29] Editorial changes --- .../reference/resource-configs/athena-configs.md | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 28eac0984f4..f2d8bc086ee 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -31,6 +31,8 @@ id: "athena-configs" #### Configuration examples +Example of the models `schema.yml` file: + ```sql @@ -60,6 +62,8 @@ id: "athena-configs" ``` +Example of the `dbt_project.yml` + ```yaml @@ -92,13 +96,15 @@ lf_grants={ } ``` +There are some limitations and recommendations that should be considered: - `lf_tags` and `lf_tags_columns` configs support only attaching lf tags to corresponding resources. - We recommend managing LF Tags permissions somewhere outside dbt. For example, [terraform](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lakeformation_permissions) or [aws cdk](https://docs.aws.amazon.com/cdk/api/v1/docs/aws-lakeformation-readme.html). -- `data_cell_filters` management can't be automated outside dbt because the filter can't be attached to the table which doesn't exist. Once you `enable` this config, dbt will set all filters and their permissions during every dbt run. Such approach keeps the actual state of row level security configuration actual after every dbt run and apply changes if they occur: drop, create, update filters and their permissions. -- Any tags listed in `lf_inherited_tags` should be strictly inherited from the database level and never overridden at the table and column level -- Currently `dbt-athena` does not differentiate between an inherited tag association and an override of same it made previously -> - For example, If an inherited tag is overridden by an `lf_tags_config` value in one DBT run, and that override is removed prior to a subsequent run, the prior override will linger and no longer be encoded anywhere (in e.g. Terraform where the inherited value is configured nor in the DBT project where the override previously existed but now is gone) +- `data_cell_filters` management can't be automated outside dbt because the filter can't be attached to the table, which doesn't exist. Once you `enable` this config, dbt will set all filters and their permissions during every dbt run. Such an approach keeps the actual state of row-level security configuration after every dbt run and applies changes if they occur: drop, create, and update filters and their permissions. +- Any tags listed in `lf_inherited_tags` should be strictly inherited from the database level and never overridden at the table and column level. +- Currently, `dbt-athena` does not differentiate between an inherited tag association and an override it made previously. + - For example, If a `lf_tags_config` value overrides an inherited tag in one run, and that override is removed before a subsequent run, the prior override will linger and no longer be encoded anywhere (for example, Terraform where the inherited value is configured nor in the DBT project where the override previously existed but now is gone). + ### Table location From a9511e95d6cf810eb600b8ebd00bb93c987d149c Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 13:56:55 -0400 Subject: [PATCH 07/29] Update website/docs/reference/resource-configs/athena-configs.md Co-authored-by: Ly Nguyen <107218380+nghi-ly@users.noreply.github.com> --- website/docs/reference/resource-configs/athena-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index f2d8bc086ee..03eefcc53b3 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -108,7 +108,7 @@ There are some limitations and recommendations that should be considered: ### Table location -The saved location a table is determined by the following conditions: +The saved location of a table is determined by the following conditions: 1. If `external_location` is defined, that value is used. 2. If `s3_data_dir` is defined, the path is determined by that and `s3_data_naming`. From 3dbb57ffaf6dfea014c6196fb2b405c51375abbe Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 13:58:20 -0400 Subject: [PATCH 08/29] Apply suggestions from code review --- website/docs/reference/resource-configs/athena-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 03eefcc53b3..8ef5fd368f0 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -122,7 +122,7 @@ The following options are available for `s3_data_naming`: - `schema_table`: `{s3_data_dir}/{schema}/{table}/` - `s3_data_naming=schema_table_unique`: `{s3_data_dir}/{schema}/{table}/{uuid4()}/` -It's possible to set the `s3_data_naming` globally in the target profile, or overwrite the value in the table config or setting up the value for groups of model in dbt_project.yml. +It's possible to set the `s3_data_naming` globally in the target profile, overwrite the value in the table config, or set up the value for groups of the models in dbt_project.yml. Note: when using a workgroup with a default output location configured, `s3_data_naming` and any configured buckets are ignored and the location configured in the workgroup is used. From f73aca8308b977e2ae3d254761e009964113918c Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 13:58:35 -0400 Subject: [PATCH 09/29] Update website/docs/reference/resource-configs/athena-configs.md --- website/docs/reference/resource-configs/athena-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 8ef5fd368f0..b314a918749 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -108,7 +108,7 @@ There are some limitations and recommendations that should be considered: ### Table location -The saved location of a table is determined by the following conditions: +The saved location of a table is determined in precedence by the following conditions: 1. If `external_location` is defined, that value is used. 2. If `s3_data_dir` is defined, the path is determined by that and `s3_data_naming`. From 7eeb22b39087549c22571c9dd7f482b9f13a81fc Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:00:03 -0400 Subject: [PATCH 10/29] Update website/docs/reference/resource-configs/athena-configs.md --- website/docs/reference/resource-configs/athena-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index b314a918749..1006de09203 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -124,7 +124,7 @@ The following options are available for `s3_data_naming`: It's possible to set the `s3_data_naming` globally in the target profile, overwrite the value in the table config, or set up the value for groups of the models in dbt_project.yml. -Note: when using a workgroup with a default output location configured, `s3_data_naming` and any configured buckets are ignored and the location configured in the workgroup is used. +Note: If you're using a workgroup with a default output location configured, `s3_data_naming` ignores any configured buckets and uses the location configured in the workgroup. ### Incremental models From 6b4964ce756c8dc98688f3ee8206a21278208837 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:08:44 -0400 Subject: [PATCH 11/29] editorial changes --- .../resource-configs/athena-configs.md | 27 +++++++++++++------ 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 1006de09203..770907e506a 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -17,7 +17,7 @@ id: "athena-configs" | `table_type` | Hive | The type of table. Supports `hive` or `iceberg`. | | `ha` | False | Build the table using the high-availability method. Only available for Hive tables. | | `format` | Parquet | The data format for the table. Supports `ORC`, `PARQUET`, `AVRO`, `JSON`, and `TEXTFILE`. | -| `write_compression` | None | The compression type for any storage format that allows compressions. See [CREATE TABLE AS][#create-table-as] for available options | +| `write_compression` | None | The compression type for any storage format that allows compressions. | | `field_delimeter` | None | Specify the custom field delimiter to use when the format is set to `TEXTFIRE`. | | `table_properties` | N/A | The table properties to add to the table. This is only for Iceberg. | | `native_drop` | N/A | Relation drop operations will be performed with SQL, not direct Glue API calls. No S3 calls will be made to manage data in S3. Data in S3 will only be cleared up for Iceberg tables. See the [AWS docs](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-managing-tables.html) for more info. Iceberg DROP TABLE operations may timeout if they take longer than 60 seconds.| @@ -31,7 +31,9 @@ id: "athena-configs" #### Configuration examples -Example of the models `schema.yml` file: + + + @@ -62,7 +64,9 @@ Example of the models `schema.yml` file: ``` -Example of the `dbt_project.yml` + + + @@ -80,7 +84,9 @@ Example of the `dbt_project.yml` -Lake Formation grants: + + + ```python lf_grants={ @@ -96,6 +102,10 @@ lf_grants={ } ``` + + + + There are some limitations and recommendations that should be considered: - `lf_tags` and `lf_tags_columns` configs support only attaching lf tags to corresponding resources. @@ -122,7 +132,7 @@ The following options are available for `s3_data_naming`: - `schema_table`: `{s3_data_dir}/{schema}/{table}/` - `s3_data_naming=schema_table_unique`: `{s3_data_dir}/{schema}/{table}/{uuid4()}/` -It's possible to set the `s3_data_naming` globally in the target profile, overwrite the value in the table config, or set up the value for groups of the models in dbt_project.yml. +To set the `s3_data_naming` globally in the target profile, overwrite the value in the table config, or set up the value for groups of the models in dbt_project.yml. Note: If you're using a workgroup with a default output location configured, `s3_data_naming` ignores any configured buckets and uses the location configured in the workgroup. @@ -130,9 +140,10 @@ Note: If you're using a workgroup with a default output location configured, `s3 The following [incremental models](https://docs.getdbt.com/docs/build/incremental-models) strategies are supported: -- `insert_overwrite` (default): The insert overwrite strategy deletes the overlapping partitions from the destination table, and then inserts the new records from the source. This strategy depends on the `partitioned_by` keyword! If no partitions are defined, dbt will fall back to the `append` strategy. -- `append`: Insert new records without updating, deleting or overwriting any existing data. There might be duplicate data (e.g. great for log or historical data). -- `merge`: Conditionally updates, deletes, or inserts rows into an Iceberg table. Used in combination with `unique_key`.Only available when using Iceberg. +- `insert_overwrite` (default): The insert-overwrite strategy deletes the overlapping partitions from the destination table and then inserts the new records from the source. This strategy depends on the `partitioned_by` keyword! dbt will fall back to the `append` strategy if no partitions are defined. +- `append`: Insert new records without updating, deleting or overwriting any existing data. There might be duplicate data (great for log or historical data). +- `merge`: Conditionally updates, deletes, or inserts rows into an Iceberg table. Used in combination with `unique_key`.It is only available when using Iceberg. + ### On schema change From 7e89a5cb008d3679a925054a57f4ac87966b3eca Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:10:14 -0400 Subject: [PATCH 12/29] Apply suggestions from code review Co-authored-by: Ly Nguyen <107218380+nghi-ly@users.noreply.github.com> --- .../docs/reference/resource-configs/athena-configs.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 770907e506a..bc866226981 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -147,21 +147,20 @@ The following [incremental models](https://docs.getdbt.com/docs/build/incrementa ### On schema change -`on_schema_change` is an option to reflect changes of schema in incremental models. -The following options are supported: +The `on_schema_change` option reflects changes of the schema in incremental models. The values you can set this to are: - `ignore` (default) - `fail` - `append_new_columns` - `sync_all_columns` -For details, please refer to the [incremental models](https://docs.getdbt.com/docs/build/incremental-models#what-if-the-columns-of-my-incremental-model-change) article. +To learn more, refer to [What if the columns of my incremental model change](/docs/build/incremental-models#what-if-the-columns-of-my-incremental-model-change). ### Iceberg The adapter supports table materialization for Iceberg. -Take the following model as an example: +For example: ```sql {{ config( @@ -186,7 +185,7 @@ select 'A' as user_id, Iceberg supports bucketing as hidden partitions. Use the `partitioned_by` config to add specific bucketing conditions. -Iceberg supports several table formats for data : `PARQUET`, `AVRO` and `ORC`. +Iceberg supports these table formats for data : `PARQUET`, `AVRO` and `ORC`. It is possible to use Iceberg in an incremental fashion, specifically two strategies are supported: From c4126e1014247a54323ada82cfdd971f061a9113 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:11:21 -0400 Subject: [PATCH 13/29] Update website/docs/reference/resource-configs/athena-configs.md --- website/docs/reference/resource-configs/athena-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index bc866226981..01deeaaba05 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -187,7 +187,7 @@ conditions. Iceberg supports these table formats for data : `PARQUET`, `AVRO` and `ORC`. -It is possible to use Iceberg in an incremental fashion, specifically two strategies are supported: +To use Iceberg incrementally, use one of the following supported strategies: - `append`: New records are appended to the table (this can lead to duplicates). - `merge`: Perform an update and insert (and optional delete), where new records are added and existing records are updated. Only available with Athena engine version 3. From 424fabfd379fddf41b0877cc90f12c601445dbe8 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:14:33 -0400 Subject: [PATCH 14/29] EDitorial changes --- website/docs/reference/resource-configs/athena-configs.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 770907e506a..b486f6cd39e 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -191,15 +191,15 @@ Iceberg supports several table formats for data : `PARQUET`, `AVRO` and `ORC`. It is possible to use Iceberg in an incremental fashion, specifically two strategies are supported: - `append`: New records are appended to the table (this can lead to duplicates). -- `merge`: Perform an update and insert (and optional delete), where new records are added and existing records are updated. Only available with Athena engine version 3. - - `unique_key`(required): Columns that define a unique record in the source and target tables. +- `merge`: Perform an update and insert (and optional delete), where new and existing records are added. It is only available with Athena engine version 3. + - `unique_key`(required): Columns defining a unique source and target table record. - `incremental_predicates` (optional): SQL conditions that enable custom join clauses in the merge statement. This can - be useful for improving performance via predicate pushdown on the target table. + help improve performance via predicate pushdown on the target table. - `delete_condition` (optional): SQL condition used to identify records that should be deleted. - `update_condition` (optional): SQL condition used to identify records that should be updated. - `insert_condition` (optional): SQL condition used to identify records that should be inserted. - `incremental_predicates`, `delete_condition`, `update_condition` and `insert_condition` can include any column of the incremental table (`src`) or the final table (`target`). Column names must be prefixed by either `src` or `target` to prevent a `Column is ambiguous` error. - + Example of `delete_condition`: ```sql From 380d1491e0d1d0fc76a33fdcd0a2622b49dbad82 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:15:36 -0400 Subject: [PATCH 15/29] Update website/docs/reference/resource-configs/athena-configs.md --- website/docs/reference/resource-configs/athena-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 2de09a3e7b2..34a25c682c5 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -197,7 +197,7 @@ To use Iceberg incrementally, use one of the following supported strategies: - `delete_condition` (optional): SQL condition used to identify records that should be deleted. - `update_condition` (optional): SQL condition used to identify records that should be updated. - `insert_condition` (optional): SQL condition used to identify records that should be inserted. - - `incremental_predicates`, `delete_condition`, `update_condition` and `insert_condition` can include any column of the incremental table (`src`) or the final table (`target`). Column names must be prefixed by either `src` or `target` to prevent a `Column is ambiguous` error. + - `incremental_predicates`, `delete_condition`, `update_condition` and `insert_condition` can include any column of the incremental table (`src`) or the final table (`target`). Column names must be prefixed by either `src` or `target` to prevent a `Column is ambiguous` error. Example of `delete_condition`: From 916bc764cb5aa343db920581f69a263364098429 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:16:35 -0400 Subject: [PATCH 16/29] Update website/docs/reference/resource-configs/athena-configs.md --- website/docs/reference/resource-configs/athena-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 34a25c682c5..5ab73dfb531 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -197,7 +197,7 @@ To use Iceberg incrementally, use one of the following supported strategies: - `delete_condition` (optional): SQL condition used to identify records that should be deleted. - `update_condition` (optional): SQL condition used to identify records that should be updated. - `insert_condition` (optional): SQL condition used to identify records that should be inserted. - - `incremental_predicates`, `delete_condition`, `update_condition` and `insert_condition` can include any column of the incremental table (`src`) or the final table (`target`). Column names must be prefixed by either `src` or `target` to prevent a `Column is ambiguous` error. + - `incremental_predicates`, `delete_condition`, `update_condition` and `insert_condition` can include any column of the incremental table (`src`) or the final table (`target`). Column names must be prefixed by either `src` or `target` to prevent a `Column is ambiguous` error. Example of `delete_condition`: From 088f36bfafd16e49471cb59873d14a7c2e1af8ee Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:17:03 -0400 Subject: [PATCH 17/29] Apply suggestions from code review Co-authored-by: Ly Nguyen <107218380+nghi-ly@users.noreply.github.com> Co-authored-by: Amy Chen <46451573+amychen1776@users.noreply.github.com> --- website/docs/reference/resource-configs/athena-configs.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 5ab73dfb531..fd8a0b81c0c 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -221,7 +221,7 @@ select 'A' as user_id, current_date as my_date ``` -`update_condition` example: +Example of `update_condition`: ```sql {{ config( @@ -274,9 +274,9 @@ select * from ( ``` -### High availablity table (HA) +### High availability (HA) table -The current implementation of table materialization can lead to downtime, as the target table is dropped and re-created. To have less destructive behavior, it's possible to use the `ha` config on your `table` materialized models. It leverages the table versions feature of the glue catalog, creating a temp table and swapping the target table to the location of the temp table. This materialization is only available for `table_type=hive` and requires using unique locations. For Iceberg, high availability is the default. +The current implementation of table materialization can lead to downtime, as the target table is dropped and re-created. For less destructive behavior, you can use the `ha` config on your `table` materialized models. It leverages the table versions feature of the glue catalog, which creates a temporary table and swaps the target table to the location of the temporary table. This materialization is only available for `table_type=hive` and requires using unique locations. For Iceberg, high availability is the default. ```sql @@ -289,7 +289,7 @@ The current implementation of table materialization can lead to downtime, as the s3_data_naming='table_unique' ) }} -select 'a' as user_id, +select 'a' as user_id, 'pi' as user_name, 'active' as status union all From abd8012149106232cb1663415a5af038d64cea56 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:25:49 -0400 Subject: [PATCH 18/29] Adding tabs --- .../resource-configs/athena-configs.md | 21 ++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 2de09a3e7b2..383b9c2a574 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -197,9 +197,12 @@ To use Iceberg incrementally, use one of the following supported strategies: - `delete_condition` (optional): SQL condition used to identify records that should be deleted. - `update_condition` (optional): SQL condition used to identify records that should be updated. - `insert_condition` (optional): SQL condition used to identify records that should be inserted. - - `incremental_predicates`, `delete_condition`, `update_condition` and `insert_condition` can include any column of the incremental table (`src`) or the final table (`target`). Column names must be prefixed by either `src` or `target` to prevent a `Column is ambiguous` error. - -Example of `delete_condition`: + +`incremental_predicates`, `delete_condition`, `update_condition` and `insert_condition` can include any column of the incremental table (`src`) or the final table (`target`). Column names must be prefixed by either `src` or `target` to prevent a `Column is ambiguous` error. + + + + ```sql {{ config( @@ -221,7 +224,9 @@ select 'A' as user_id, current_date as my_date ``` -`update_condition` example: + + + ```sql {{ config( @@ -254,7 +259,9 @@ select * from ( {% endif %} ``` -Example of `insert_condition`: + + + ```sql {{ config( @@ -274,6 +281,10 @@ select * from ( ``` + + + + ### High availablity table (HA) The current implementation of table materialization can lead to downtime, as the target table is dropped and re-created. To have less destructive behavior, it's possible to use the `ha` config on your `table` materialized models. It leverages the table versions feature of the glue catalog, creating a temp table and swapping the target table to the location of the temp table. This materialization is only available for `table_type=hive` and requires using unique locations. For Iceberg, high availability is the default. From c8c522315411ce7e81b1fe20e25e7f2d1ff9b1fb Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:35:38 -0400 Subject: [PATCH 19/29] editorial changes --- website/docs/reference/resource-configs/athena-configs.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 1726982b4d9..cabf9172d67 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -289,6 +289,7 @@ select * from ( The current implementation of table materialization can lead to downtime, as the target table is dropped and re-created. For less destructive behavior, you can use the `ha` config on your `table` materialized models. It leverages the table versions feature of the glue catalog, which creates a temporary table and swaps the target table to the location of the temporary table. This materialization is only available for `table_type=hive` and requires using unique locations. For Iceberg, high availability is the default. +By default, the materialization keeps the last 4 table versions,but you can change it by setting `versions_to_keep`. ```sql {{ config( @@ -309,12 +310,10 @@ select 'b' as user_id, 'disabled' as status ``` -By default, the materialization keeps the last 4 table versions,but you can change it by setting `versions_to_keep`. #### HA known issues -- When swapping from a table with partitions to a table without (and the other way around), there could be a little - downtime. If high performances is needed consider bucketing instead of partitions. +- There could be a little downtime when swapping from a table with partitions to a table without (and the other way around). If higher performance is needed, consider bucketing instead of partitions. - By default, Glue "duplicates" the versions internally, so the last two versions of a table point to the same location. - It's recommended to set `versions_to_keep` >= 4, as this will avoid having the older location removed. @@ -322,6 +321,7 @@ By default, the materialization keeps the last 4 table versions,but you can chan Persist resource descriptions as column and relation comments to the glue data catalog, and meta as [glue table properties](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#table-properties) and [column parameters](https://docs.aws.amazon.com/glue/latest/webapi/API_Column.html). By default, documentation persistence is disabled, but it can be enabled for specific resources or groups of resources as needed. + For example: ```yaml From e29b28588162cc6d177cefc639bfd8dbdd5709f5 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:37:06 -0400 Subject: [PATCH 20/29] Apply suggestions from code review Co-authored-by: Ly Nguyen <107218380+nghi-ly@users.noreply.github.com> --- .../resource-configs/athena-configs.md | 41 +++++++++---------- 1 file changed, 20 insertions(+), 21 deletions(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index cabf9172d67..6879667b820 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -340,26 +340,25 @@ models: primary_key: true ``` -See [persist docs](https://docs.getdbt.com/reference/resource-configs/persist_docs) for more details. +Refer to [persist_docs](https://docs.getdbt.com/reference/resource-configs/persist_docs) for more details. ## Snapshots -The adapter supports snapshot materialization. It supports both timestamp and check strategy. To create a snapshot -create a snapshot file in the snapshots directory. If the directory does not exist create one. +The adapter supports snapshot materialization. It supports both the timestamp and check strategies. To create a snapshot, create a snapshot file in the `snapshots` directory. You'll need to create this directory if it doesn't already exist. ### Timestamp strategy -To use the timestamp strategy refer to -the [dbt docs](https://docs.getdbt.com/docs/build/snapshots#timestamp-strategy-recommended) + +Refer to [Timestamp strategy](/docs/build/snapshots#timestamp-strategy-recommended) for details on how to use it. + ### Check strategy -To use the check strategy refer to the [dbt docs](https://docs.getdbt.com/docs/build/snapshots#check-strategy) +Refer to [Check strategy](/docs/build/snapshots#check-strategy) for details on how to use it. -### Hard-deletes +### Hard deletes -The materialization also supports invalidating hard deletes. Check -the [docs](https://docs.getdbt.com/docs/build/snapshots#hard-deletes-opt-in) to understand usage. +The materialization also supports invalidating hard deletes. For usage details, refer to [Hard deletes](/docs/build/snapshots#hard-deletes-opt-in). ### Working example @@ -462,20 +461,20 @@ from {{ ref('model') }} {% endsnapshot %} ## AWS Lake Formation integration -The following is how the adapter implements AWS Lake Formation tag management: +The following describes how the adapter implements the AWS Lake Formation tag management: -- You can enable or disable lf-tags management via [config](#table-configuration) (disabled by default) -- Once you enable the feature, lf-tags will be updated on every dbt run -- First, all lf-tags for columns are removed to avoid inheritance issues +- [Enable](#table-configuration) LF tags management with the `lf_tags_config` parameter. By default, it's disabled. +- Once enabled, LF tags are updated on every dbt run. +- First, all lf-tags for columns are removed to avoid inheritance issues. - Then, all redundant lf-tags are removed from tables and actual tags from table configs are applied - Finally, lf-tags for columns are applied It's important to understand the following points: -- dbt does not manage `lf-tags` for databases -- dbt does not manage Lake Formation permissions +- dbt doesn't manage `lf-tags` for databases +- dbt doesn't manage Lake Formation permissions -That's why you should handle this by yourself manually or using an automation tool like terraform, AWS CDK, etc. You may find the following links useful to manage that: +That's why it's important to take care of this yourself or use an automation tool such as terraform and AWS CDK. For more details, refer to: * [terraform aws_lakeformation_permissions](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lakeformation_permissions) * [terraform aws_lakeformation_resource_lf_tags](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lakeformation_resource_lf_tags) @@ -508,7 +507,7 @@ The adapter supports Python models using [`spark`](https://docs.aws.amazon.com/a - A session is created for each unique engine configuration defined in the models that are part of the invocation. - A session's idle timeout is set to 10 minutes. Within the timeout period, if there is a new calculation (Spark Python model) ready for execution and the engine configuration matches, the process will reuse the same session. - The number of Python models running at a time depends on the `threads`. The number of sessions created for the entire run depends on the number of unique engine configurations and the availability of sessions to maintain thread concurrency. -- For Iceberg tables, it is recommended to use `table_properties` configuration to set the `format_version` to 2. This is to maintain compatibility between Iceberg tables created by Trino with those created by Spark. +- For Iceberg tables, it's recommended to use the `table_properties` configuration to set the `format_version` to `2`. This helps maintain compatibility between Iceberg tables created by Trino with those created by Spark. ### Example models @@ -612,10 +611,10 @@ def model(dbt, spark_session): ### Known issues in Python models -- Python models cannot [reference Athena SQL views](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html). -- Third-party Python libraries can be used, but they must be [included in the pre-installed list][pre-installed list] or [imported manually][imported manually]. -- Python models can only reference or write to tables with names meeting the regular expression: `^[0-9a-zA-Z_]+$`. Dashes and special characters are not supported by Spark, even though Athena supports them. -- Incremental models do not fully utilize Spark capabilities. They depend partially on existing SQL-based logic which runs on Trino. +- Python models can't [reference Athena SQL views](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html). +- You can use third-party Python libraries, however, they must be [included in the pre-installed list][pre-installed list] or [imported manually][imported manually]. +- Python models can only reference or write to tables with names matching the regular expression: `^[0-9a-zA-Z_]+$`. Spark doesn't support dashes or special characters, even though Athena supports them. +- Incremental models don't fully utilize Spark capabilities. They depend partially on existing SQL-based logic that runs on Trino. - Snapshot materializations are not supported. - Spark can only reference tables within the same catalog. - For tables created outside of the dbt tool, be sure to populate the location field or dbt will throw an error when trying to create the table. From 22087b908960adfd52af5743415f940abacc282a Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:52:55 -0400 Subject: [PATCH 21/29] Adding tabs --- .../resource-configs/athena-configs.md | 143 ++++-------------- 1 file changed, 33 insertions(+), 110 deletions(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 6879667b820..785b6bdf0eb 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -360,104 +360,12 @@ Refer to [Check strategy](/docs/build/snapshots#check-strategy) for details on h The materialization also supports invalidating hard deletes. For usage details, refer to [Hard deletes](/docs/build/snapshots#hard-deletes-opt-in). -### Working example - -seed file - employent_indicators_november_2022_csv_tables.csv - -```csv -Series_reference,Period,Data_value,Suppressed -MEIM.S1WA,1999.04,80267, -MEIM.S1WA,1999.05,70803, -MEIM.S1WA,1999.06,65792, -MEIM.S1WA,1999.07,66194, -MEIM.S1WA,1999.08,67259, -MEIM.S1WA,1999.09,69691, -MEIM.S1WA,1999.1,72475, -MEIM.S1WA,1999.11,79263, -MEIM.S1WA,1999.12,86540, -MEIM.S1WA,2000.01,82552, -MEIM.S1WA,2000.02,81709, -MEIM.S1WA,2000.03,84126, -MEIM.S1WA,2000.04,77089, -MEIM.S1WA,2000.05,73811, -MEIM.S1WA,2000.06,70070, -MEIM.S1WA,2000.07,69873, -MEIM.S1WA,2000.08,71468, -MEIM.S1WA,2000.09,72462, -MEIM.S1WA,2000.1,74897, -``` - -model.sql - -```sql -{{ config( - materialized='table' -) }} - -select row_number() over() as id - , * - , cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as refresh_timestamp -from {{ ref('employment_indicators_november_2022_csv_tables') }} -``` - -timestamp strategy - model_snapshot_1 - -```sql -{% snapshot model_snapshot_1 %} - -{{ - config( - strategy='timestamp', - updated_at='refresh_timestamp', - unique_key='id' - ) -}} - -select * -from {{ ref('model') }} {% endsnapshot %} -``` - -invalidate hard deletes - model_snapshot_2 - -```sql -{% snapshot model_snapshot_2 %} - -{{ - config - ( - unique_key='id', - strategy='timestamp', - updated_at='refresh_timestamp', - invalidate_hard_deletes=True, - ) -}} -select * -from {{ ref('model') }} {% endsnapshot %} -``` - -check strategy - model_snapshot_3 - -```sql -{% snapshot model_snapshot_3 %} - -{{ - config - ( - unique_key='id', - strategy='check', - check_cols=['series_reference','data_value'] - ) -}} -select * -from {{ ref('model') }} {% endsnapshot %} -``` - ### Snapshots known issues -- Incremental Iceberg models - Sync all columns on schema change can't remove columns used for partitioning. The only way, from a dbt perspective, is to do a full-refresh of the incremental model. +- Incremental Iceberg models - Sync all columns on schema change. Columns used for partitioning can't be removed. From a dbt perspective, the only way is to fully refresh the incremental model. - Tables, schemas and database names should only be lowercase -- In order to avoid potential conflicts, make sure [`dbt-athena-adapter`](https://github.com/Tomme/dbt-athena) is not installed in the target environment. -- Snapshot does not support dropping columns from the source table. If you drop a column make sure to drop the column from the snapshot as well. Another workaround is to NULL the column in the snapshot definition to preserve history +- To avoid potential conflicts, make sure [`dbt-athena-adapter`](https://github.com/Tomme/dbt-athena) is not installed in the target environment. +- Snapshot does not support dropping columns from the source table. If you drop a column, make sure to drop the column from the snapshot as well. Another workaround is to NULL the column in the snapshot definition to preserve the history. ## AWS Lake Formation integration @@ -466,8 +374,8 @@ The following describes how the adapter implements the AWS Lake Formation tag ma - [Enable](#table-configuration) LF tags management with the `lf_tags_config` parameter. By default, it's disabled. - Once enabled, LF tags are updated on every dbt run. - First, all lf-tags for columns are removed to avoid inheritance issues. -- Then, all redundant lf-tags are removed from tables and actual tags from table configs are applied -- Finally, lf-tags for columns are applied +- Then, all redundant lf-tags are removed from tables and actual tags from table configs are applied. +- Finally, lf-tags for columns are applied. It's important to understand the following points: @@ -485,10 +393,10 @@ The adapter supports Python models using [`spark`](https://docs.aws.amazon.com/a ### Setup -- A Spark-enabled workgroup created in Athena -- Spark execution role granted access to Athena, Glue and S3 +- A Spark-enabled workgroup created in Athena. +- Spark execution role granted access to Athena, Glue and S3. - The Spark workgroup is added to the `~/.dbt/profiles.yml` file and the profile to be used - is referenced in `dbt_project.yml` + is referenced in `dbt_project.yml`. ### Spark-specific table configuration @@ -505,13 +413,15 @@ The adapter supports Python models using [`spark`](https://docs.aws.amazon.com/a ### Spark notes - A session is created for each unique engine configuration defined in the models that are part of the invocation. -- A session's idle timeout is set to 10 minutes. Within the timeout period, if there is a new calculation (Spark Python model) ready for execution and the engine configuration matches, the process will reuse the same session. -- The number of Python models running at a time depends on the `threads`. The number of sessions created for the entire run depends on the number of unique engine configurations and the availability of sessions to maintain thread concurrency. -- For Iceberg tables, it's recommended to use the `table_properties` configuration to set the `format_version` to `2`. This helps maintain compatibility between Iceberg tables created by Trino with those created by Spark. +A session's idle timeout is set to 10 minutes. Within the timeout period, if a new calculation (Spark Python model) is ready for execution and the engine configuration matches, the process will reuse the same session. +- The number of Python models running simultaneously depends on the `threads`. The number of sessions created for the entire run depends on the number of unique engine configurations and the availability of sessions to maintain thread concurrency. +- For Iceberg tables, it's recommended to use the `table_properties` configuration to set the `format_version` to `2`. This helps maintain compatibility between the Iceberg tables Trino created and those Spark created. ### Example models -#### Simple pandas model + + + ```python import pandas as pd @@ -525,7 +435,9 @@ def model(dbt, session): return model_df ``` -#### Simple spark + + + ```python def model(dbt, spark_session): @@ -537,8 +449,9 @@ def model(dbt, spark_session): return df ``` + -#### Spark incremental + ```python def model(dbt, spark_session): @@ -554,7 +467,9 @@ def model(dbt, spark_session): return df ``` -#### Config spark model + + + ```python def model(dbt, spark_session): @@ -579,7 +494,11 @@ def model(dbt, spark_session): return df ``` -#### Create pySpark udf using imported external python files + + + + +Using imported external python files: ```python def model(dbt, spark_session): @@ -609,6 +528,10 @@ def model(dbt, spark_session): return df.withColumn("udf_test_col", udf_with_import(col("alpha"))) ``` + + + + ### Known issues in Python models - Python models can't [reference Athena SQL views](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html). @@ -626,6 +549,6 @@ def model(dbt, spark_session): The adapter partly supports contract definitions: -- `data_type` is supported but needs to be adjusted for complex types. Types must be specified entirely (for instance `array`) even though they won't be checked. Indeed, as dbt recommends, we only compare the broader type (array, map, int, varchar). The complete definition is used in order to check that the data types defined in Athena are ok (pre-flight check). -- The adapter does not support the constraints since there is no constraint concept in Athena. +- `data_type` is supported but needs to be adjusted for complex types. Types must be specified entirely (for example, `array`) even though they won't be checked. Indeed, as dbt recommends, we only compare the broader type (array, map, int, varchar). The complete definition is used to check that the data types defined in Athena are ok (pre-flight check). +- The adapter does not support the constraints since Athena has no constraint concept. From 445fd7a3549f06a7b2389593aed96a1c4585b72b Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:55:20 -0400 Subject: [PATCH 22/29] Editorial changes --- website/docs/reference/resource-configs/athena-configs.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 785b6bdf0eb..a5f8a7ec876 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -535,12 +535,13 @@ def model(dbt, spark_session): ### Known issues in Python models - Python models can't [reference Athena SQL views](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html). -- You can use third-party Python libraries, however, they must be [included in the pre-installed list][pre-installed list] or [imported manually][imported manually]. +- You can use third-party Python libraries; however, they must be [included in the pre-installed list][pre-installed list] or [imported manually][imported manually]. - Python models can only reference or write to tables with names matching the regular expression: `^[0-9a-zA-Z_]+$`. Spark doesn't support dashes or special characters, even though Athena supports them. - Incremental models don't fully utilize Spark capabilities. They depend partially on existing SQL-based logic that runs on Trino. - Snapshot materializations are not supported. - Spark can only reference tables within the same catalog. -- For tables created outside of the dbt tool, be sure to populate the location field or dbt will throw an error when trying to create the table. +- For tables created outside of the dbt tool, be sure to populate the location field, or dbt will throw an error when creating the table. + [pre-installed list]: https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-preinstalled-python-libraries.html [imported manually]: https://docs.aws.amazon.com/athena/latest/ug/notebooks-import-files-libraries.html From b2bb9637dbcdf711bbfb6206571a85e9988c0c03 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:56:10 -0400 Subject: [PATCH 23/29] Update website/docs/reference/resource-configs/athena-configs.md --- website/docs/reference/resource-configs/athena-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index a5f8a7ec876..4d62204c7bb 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -1,6 +1,6 @@ --- title: "Amazon Athena configurations" -description: "Reference guide for the Amazon Athena adapter for dbt Core and dbt Cloud." +description: "Reference article for the Amazon Athena adapter for dbt Core and dbt Cloud." id: "athena-configs" --- From 1823e60bcf1924c41da790de423913faf376083d Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 14:57:57 -0400 Subject: [PATCH 24/29] Update website/docs/reference/resource-configs/athena-configs.md --- website/docs/reference/resource-configs/athena-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 4d62204c7bb..c183dfbcf36 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -391,7 +391,7 @@ That's why it's important to take care of this yourself or use an automation too The adapter supports Python models using [`spark`](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html). -### Setup +### Prerequisites - A Spark-enabled workgroup created in Athena. - Spark execution role granted access to Athena, Glue and S3. From af258c79e0ad0fc92af1fbf1be889fb012923f88 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 15:20:50 -0400 Subject: [PATCH 25/29] Adding a table --- .../reference/resource-configs/athena-configs.md | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index a5f8a7ec876..8702c0d7464 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -400,15 +400,13 @@ The adapter supports Python models using [`spark`](https://docs.aws.amazon.com/a ### Spark-specific table configuration -- `timeout` (`default=43200`) - - Time out in seconds for each Python model execution. Defaults to 12 hours/43200 seconds. -- `spark_encryption` (`default=false`) - - If this flag is set to true, encrypts data in transit between Spark nodes and also encrypts data at rest stored locally by Spark. -- `spark_cross_account_catalog` (`default=false`) - - When using the Spark Athena workgroup, queries can only be made against catalogs located on the same AWS account by default. However, sometimes you want to query another catalog located on an external AWS account. Setting this additional Spark properties parameter to true will enable querying external catalogs. You can use the syntax `external_catalog_id/database.table` to access the external table on the external catalog (For example, `999999999999/mydatabase.cloudfront_logs` where 999999999999 is the external catalog ID) -- `spark_requester_pays` (`default=false`) - - When an Amazon S3 bucket is configured as requester pays, the account of the user running the query is charged for data access and data transfer fees associated with the query. - - If this flag is set to true, requester pays S3 buckets are enabled in Athena for Spark. +| Configuration | Default | Description | +|---------------|---------|--------------| +| `timeout` | 43200 | Time out in seconds for each Python model execution. Defaults to 12 hours/43200 seconds. | +| `spark_encryption` | False | When set to `true,` it encrypts data stored locally by Spark and in transit between Spark nodes. | +| `spark_cross_account_catalog` | False | When using the Spark Athena workgroup, queries can only be made against catalogs on the same AWS account by default. Setting this parameter to true will enable querying external catalogs if you want to query another catalog on an external AWS account.

Use the syntax `external_catalog_id/database.table` to access the external table on the external catalog (For example, `999999999999/mydatabase.cloudfront_logs` where 999999999999 is the external catalog ID).| +| `spark_requester_pays` | False | When set to true, if an Amazon S3 bucket is configured as `requester pays`, the user account running the query is charged for data access and data transfer fees associated with the query. | + ### Spark notes From eb774c1e22b4ff15d29791ac8eb4dac3d82f3eea Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 16:21:42 -0400 Subject: [PATCH 26/29] Apply suggestions from code review --- website/docs/reference/resource-configs/athena-configs.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index dd309a93f1d..bc716700577 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -185,9 +185,9 @@ select 'A' as user_id, Iceberg supports bucketing as hidden partitions. Use the `partitioned_by` config to add specific bucketing conditions. -Iceberg supports these table formats for data : `PARQUET`, `AVRO` and `ORC`. +Iceberg supports the `PARQUET`, `AVRO` and `ORC` table formats for data . -To use Iceberg incrementally, use one of the following supported strategies: +The following are the supported strategies for using Iceberg incrementally: - `append`: New records are appended to the table (this can lead to duplicates). - `merge`: Perform an update and insert (and optional delete), where new and existing records are added. It is only available with Athena engine version 3. From d2c0d9438f9a81239420fa14f41e94ba8b727274 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 16:26:52 -0400 Subject: [PATCH 27/29] Apply suggestions from code review --- .../reference/resource-configs/athena-configs.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index bc716700577..dd841eccfb3 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -190,13 +190,12 @@ Iceberg supports the `PARQUET`, `AVRO` and `ORC` table formats for data . The following are the supported strategies for using Iceberg incrementally: - `append`: New records are appended to the table (this can lead to duplicates). -- `merge`: Perform an update and insert (and optional delete), where new and existing records are added. It is only available with Athena engine version 3. - - `unique_key`(required): Columns defining a unique source and target table record. - - `incremental_predicates` (optional): SQL conditions that enable custom join clauses in the merge statement. This can - help improve performance via predicate pushdown on the target table. - - `delete_condition` (optional): SQL condition used to identify records that should be deleted. - - `update_condition` (optional): SQL condition used to identify records that should be updated. - - `insert_condition` (optional): SQL condition used to identify records that should be inserted. +- `merge`: Perform an update and insert (and optional delete) where new and existing records are added. This is only available with Athena engine version 3. + - `unique_key`(required): Columns that define a unique source and target table record. + - `incremental_predicates` (optional): The SQL conditions that enable custom join clauses in the merge statement. This helps improve performance via predicate pushdown on target tables. + - `delete_condition` (optional): SQL condition that identifies records that should be deleted. + - `update_condition` (optional): SQL condition that identifies records that should be updated. + - `insert_condition` (optional): SQL condition that identifies records that should be inserted. `incremental_predicates`, `delete_condition`, `update_condition` and `insert_condition` can include any column of the incremental table (`src`) or the final table (`target`). Column names must be prefixed by either `src` or `target` to prevent a `Column is ambiguous` error. From c19452d7f6dff78991491ea3a6507ff84e956d91 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 16:41:42 -0400 Subject: [PATCH 28/29] Update website/docs/reference/resource-configs/athena-configs.md --- website/docs/reference/resource-configs/athena-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index dd841eccfb3..3e16ec6fc70 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -318,7 +318,7 @@ select 'b' as user_id, ### Update glue data catalog -Persist resource descriptions as column and relation comments to the glue data catalog, and meta as [glue table properties](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#table-properties) and [column parameters](https://docs.aws.amazon.com/glue/latest/webapi/API_Column.html). By default, documentation persistence is disabled, but it can be enabled for specific resources or groups of resources as needed. +You can persist your column and model level descriptions to the Glue Data Catalog as [glue table properties](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#table-properties) and [column parameters](https://docs.aws.amazon.com/glue/latest/webapi/API_Column.html). To enable this, set the configuration to `true` as shown in the following examples. By default, documentation persistence is disabled, but it can be enabled for specific resources or groups of resources as needed. For example: From f5c5d6cbf81b35c2d8c29cd5477b72644b671c98 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Thu, 29 Aug 2024 16:42:25 -0400 Subject: [PATCH 29/29] Update website/docs/reference/resource-configs/athena-configs.md --- website/docs/reference/resource-configs/athena-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/athena-configs.md b/website/docs/reference/resource-configs/athena-configs.md index 3e16ec6fc70..f871ede9fab 100644 --- a/website/docs/reference/resource-configs/athena-configs.md +++ b/website/docs/reference/resource-configs/athena-configs.md @@ -318,7 +318,7 @@ select 'b' as user_id, ### Update glue data catalog -You can persist your column and model level descriptions to the Glue Data Catalog as [glue table properties](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#table-properties) and [column parameters](https://docs.aws.amazon.com/glue/latest/webapi/API_Column.html). To enable this, set the configuration to `true` as shown in the following examples. By default, documentation persistence is disabled, but it can be enabled for specific resources or groups of resources as needed. +You can persist your column and model level descriptions to the Glue Data Catalog as [glue table properties](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#table-properties) and [column parameters](https://docs.aws.amazon.com/glue/latest/webapi/API_Column.html). To enable this, set the configuration to `true` as shown in the following example. By default, documentation persistence is disabled, but it can be enabled for specific resources or groups of resources as needed. For example: