feat: Implement support bucket function for more than 100 partitions #549

sanromeo · 2023-12-28T11:51:08Z

Description

Resolves: #529

Fix error FUNCTION_NOT_FOUND: line 2:21: Function 'bucket' not registered by adding regex for extract bucket column from partitioned_by config
Implement murmur3_hash method to calculate bucket number of bucket column value as it works for Iceberg tables in Athena
Implement support for more than 100 partitions when bucket function used into partitioned_by config for Iceberg tables
Improve get_partition_batches macros with put bucket_column values to appropriate bucket number and than and add values to final WHERE condition for batches

Tested on next column types for bucket function:

Not works now for next types (needed additional implementation in dbt-athena-community adapter):

Bytes
UUID
Decimal

Model used to test

{{
  config(
    schema='sandbox',
    materialized='table',
    partitioned_by=['DAY(date_column)', 'doy', 'bucket(random_str, 5)'],
    table_type='iceberg'
  )
}}

WITH random_strings AS (
    SELECT
        CHR(CAST(65 + FLOOR(RANDOM() * 26) AS BIGINT)) ||
        CHR(CAST(65 + FLOOR(RANDOM() * 26) AS BIGINT)) ||
        CHR(CAST(65 + FLOOR(RANDOM() * 26) AS BIGINT)) AS random_str
    FROM
        (SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3) AS temp_table
)
SELECT
    CAST(date_column AS DATE) as date_column,
    doy(date_column) as doy,
    rnd.random_str
FROM (
    VALUES (
        SEQUENCE(FROM_ISO8601_DATE('2023-01-01'), FROM_ISO8601_DATE('2023-07-24'), INTERVAL '1' DAY)
    )
) AS t1(date_array)
CROSS JOIN UNNEST(date_array) AS t2(date_column)
JOIN random_strings rnd ON true

A part of log output:

10:19:45  BATCH PROCESSING: 5 OF 5
10:19:45  Using athena connection "model.test"
10:19:45  On model.test: 
    insert into "awsdatacatalog"."sandbox"."test__ha" ("date_column", "doy", "random_str")
                select "date_column", "doy", "random_str"
                from "awsdatacatalog"."sandbox"."test__ha__tmp_not_partitioned"
                where (date_trunc('day', date_column)=DATE'2023-07-20' and doy=201 and random_str IN ('ASK', 'WLI')) or (date_trunc('day', date_column)=DATE'2023-07-20' and doy=201 and random_str IN ('JIC')) or (date_trunc('day', date_column)=DATE'2023-07-21' and doy=202 and random_str IN ('ASK', 'WLI')) or (date_trunc('day', date_column)=DATE'2023-07-21' and doy=202 and random_str IN ('JIC')) or (date_trunc('day', date_column)=DATE'2023-07-22' and doy=203 and random_str IN ('ASK', 'WLI')) or (date_trunc('day', date_column)=DATE'2023-07-22' and doy=203 and random_str IN ('JIC')) or (date_trunc('day', date_column)=DATE'2023-07-23' and doy=204 and random_str IN ('ASK', 'WLI')) or (date_trunc('day', date_column)=DATE'2023-07-23' and doy=204 and random_str IN ('JIC')) or (date_trunc('day', date_column)=DATE'2023-07-24' and doy=205 and random_str IN ('ASK', 'WLI')) or (date_trunc('day', date_column)=DATE'2023-07-24' and doy=205 and random_str IN ('JIC'))
10:19:45  Athena adapter: Athena query ID 75e66acd-b28d-44c1-aa69-95164b903b9d
10:19:52  SQL status: OK 15 in 7.0 seconds

Info to compare how WHERE clause above generated for this values:

`JIC` has bucket number - 3
`ASK` ,`WLI` have bucket number - 4

Checklist

You followed contributing section
You kept your Pull Request small and focused on a single feature or bug fix.
You added unit testing when necessary
You added functional testing when necessary

sanromeo · 2023-12-28T11:53:27Z

Now I will also write and add the unit and functional tests 👌

dev-requirements.txt

dbt/adapters/athena/impl.py

tests/unit/test_adapter.py

svdimchenko · 2023-12-28T23:12:39Z

@sanromeo awesome job done 💪
@mrshu could you please test if this PR resolves your issue with bucketing ?

dbt/adapters/athena/impl.py

tests/functional/adapter/test_partitions.py

tests/unit/test_adapter.py

dbt/include/athena/macros/materializations/models/helpers/get_partition_batches.sql

…ion-limit-for-bucketing

…ate and datetime column, Separate logic for bucket and non-bucket columns in different macros

…ion-limit-for-bucketing

svdimchenko · 2024-01-04T10:32:10Z

@nicor88 @Jrmyy do you want to test it as well ?

nicor88

@svdimchenko good looks good and functional tests are enough for me.

Thanks for the implementation @sanromeo 💯

Gatsby-Lee · 2024-12-18T00:33:37Z

encountered this issue.
Thank you for the fix @sanromeo

Implement support bucket function for more than 100 partitions

3f28a80

sanromeo requested review from jessedobbelaere, Jrmyy, mattiamatrix, nicor88 and svdimchenko as code owners December 28, 2023 11:51

sanromeo changed the title ~~Implement support bucket function for more than 100 partitions~~ feat: Implement support bucket function for more than 100 partitions Dec 28, 2023

svdimchenko reviewed Dec 28, 2023

View reviewed changes

dev-requirements.txt Outdated Show resolved Hide resolved

sanromeo added 4 commits December 28, 2023 17:20

Add unit-tests

7f92e24

remove mmh3 from dev-requirements.txt

2d05a5f

fix unit test

6f555d6

Add functional test for bucket partitioning

8fe887a

svdimchenko reviewed Dec 28, 2023

View reviewed changes

dbt/adapters/athena/impl.py Outdated Show resolved Hide resolved

svdimchenko added the enable-functional-tests label Dec 28, 2023

svdimchenko reviewed Dec 28, 2023

View reviewed changes

tests/unit/test_adapter.py Outdated Show resolved Hide resolved

svdimchenko reviewed Dec 28, 2023

View reviewed changes

dbt/adapters/athena/impl.py Outdated Show resolved Hide resolved

Change typing, use parametrize in tests, add link to adopted method

4bb41fc

nicor88 reviewed Jan 2, 2024

View reviewed changes

tests/functional/adapter/test_partitions.py Show resolved Hide resolved

nicor88 reviewed Jan 2, 2024

View reviewed changes

tests/unit/test_adapter.py Outdated Show resolved Hide resolved

nicor88 reviewed Jan 2, 2024

View reviewed changes

dbt/include/athena/macros/materializations/models/helpers/get_partition_batches.sql Outdated Show resolved Hide resolved

sanromeo and others added 8 commits January 3, 2024 10:20

Merge branch 'dbt-athena:main' into feat/implementation-athena-partit…

c71bef3

…ion-limit-for-bucketing

Add functional test for only bucket partitioning, Add unit test for d…

25b4f9f

…ate and datetime column, Separate logic for bucket and non-bucket columns in different macros

Fix loop compilation error

095dfcd

Fix enumerate compilation error

4173e93

Fix counter

156c903

Fix counter

5eb3621

Merge branch 'dbt-athena:main' into feat/implementation-athena-partit…

ec20fd1

…ion-limit-for-bucketing

fix regex in get_partition_batches marcos

0902cdd

sanromeo added 2 commits January 3, 2024 16:22

fix regex in get_partition_batches marcos

fb7d986

fix regex in process_bucket_column macros

f590f2f

svdimchenko approved these changes Jan 4, 2024

View reviewed changes

nicor88 approved these changes Jan 4, 2024

View reviewed changes

svdimchenko merged commit 7e53aca into dbt-labs:main Jan 4, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement support bucket function for more than 100 partitions #549

feat: Implement support bucket function for more than 100 partitions #549

sanromeo commented Dec 28, 2023 •

edited

Loading

sanromeo commented Dec 28, 2023

svdimchenko commented Dec 28, 2023

svdimchenko commented Jan 4, 2024

nicor88 left a comment

Gatsby-Lee commented Dec 18, 2024

feat: Implement support bucket function for more than 100 partitions #549

feat: Implement support bucket function for more than 100 partitions #549

Conversation

sanromeo commented Dec 28, 2023 • edited Loading

Description

Model used to test

Checklist

sanromeo commented Dec 28, 2023

svdimchenko commented Dec 28, 2023

svdimchenko commented Jan 4, 2024

nicor88 left a comment

Choose a reason for hiding this comment

Gatsby-Lee commented Dec 18, 2024

sanromeo commented Dec 28, 2023 •

edited

Loading