Make MAXIMUM_SEED_SIZE_MIB configurable #11177

jtcohen6 · 2024-12-24T16:26:52Z

resolves #7117
resolves #7124

Reapply changes from #7125. (This proved easier than rebasing the commits directly.)

Problem

We apply an arbitrary limit of 1 MiB to seeds (CSVs), for the specific purpose of hashing contents and comparing those hashed contents during state:modified comparison. (That's a mebibyte, not a megabyte, for those who care to distinguish between the two).

That is, dbt doesn't raise an error if it detects a seed larger than MAXIMUM_SEED_SIZE_MIB, but certain features become unavailable and dbt does not make any guarantees about acceptable performance.

We could adjust this for inflation (1 MiB in 2020 is worth ~1.2 MiB today), but this PR takes the more forward-looking approach of making the value configurable by end users. Users can instruct dbt to compare the contents of larger seeds, so long as they're willing to "pay the price" of hashing the contents of large seeds.

We will need to update docs: https://docs.getdbt.com/reference/node-selection/state-comparison-caveats

Solution

From #7125:

Increasing the maximum size of seed files where the content is hashed for state comparison will enable a greater use of deferred runs with updated seed file contents. By making this a configuration with an environment variable users are able to override the default 1 MiB limit.

Furthermore, to support reading larger files in memory constrained environments, a new method is added to read and compute file contents incrementally. As this is performed on bytes for better performance we continue using the previous UTF-8 content method for small files to not mess up current states where seed files are not stored with UTF-8.

Checklist

I have read the contributing guide and understand what's expected of me.
I have run this code in development, and it appears to resolve the stated issue.
This PR includes tests, or tests are not required or relevant for this PR.
This PR has no interface changes (e.g., macros, CLI, logs, JSON artifacts, config files, adapter interface, etc.) or this PR has already received feedback and approval from Product or DX.
This PR includes type annotations for new and modified functions.

Co-authored by: Noah Holm <[email protected]> Co-authored by: Jeremy Cohen <[email protected]>

codecov · 2024-12-24T16:29:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.88%. Comparing base (459d156) to head (20be925).

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #11177      +/-   ##
==========================================
- Coverage   88.93%   88.88%   -0.05%     
==========================================
  Files         186      186              
  Lines       24054    24075      +21     
==========================================
+ Hits        21392    21400       +8     
- Misses       2662     2675      +13

Flag	Coverage Δ
integration	`86.19% <93.33%> (-0.14%)`	⬇️
unit	`62.03% <46.66%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Unit Tests	`62.03% <46.66%> (-0.02%)`	⬇️
Integration Tests	`86.19% <93.33%> (-0.14%)`	⬇️

jtcohen6

🎩 Given a my_small_seed.csv (<1 MB) and my_large_seed.csv (~3 MB)

Using dbt-core@main:

% dbt parse && mv target/manifest.json state && dbt ls -s state:modified --state state    
16:29:42  Running with dbt=1.10.0-a1
16:29:42  Registered adapter: duckdb=1.9.1
16:29:42  Performance info: /Users/jerco/dev/scratch/testy/target/perf_info.json
16:29:43  Running with dbt=1.10.0-a1
16:29:43  Registered adapter: duckdb=1.9.1
16:29:43  Found 1 operation, 1 model, 2 seeds, 424 macros
16:29:43  Found a seed (testy.my_large_seed) >1MB in size at the same path, dbt cannot tell if it has changed: assuming they are the same
16:29:43  The selection criterion 'state:modified' does not match any enabled nodes
16:29:43  No nodes selected!

Switching to dbt-core@jerco/redo-pr-7125 (without recreating state/manifest.json):

% dbt ls -s state:modified --state state    
16:30:03  Running with dbt=1.10.0-a1
16:30:03  Registered adapter: duckdb=1.9.1
16:30:03  Found 1 operation, 1 model, 2 seeds, 424 macros
16:30:03  Found a seed (testy.my_large_seed) >1MiB in size at the same path, dbt cannot tell if it has changed: assuming they are the same
16:30:03  The selection criterion 'state:modified' does not match any enabled nodes
16:30:03  No nodes selected!

This proves that the checksum of my_small_seed has not changed.

Now, let's set the config and redo:

% export DBT_MAXIMUM_SEED_SIZE_MIB=10
% dbt parse && mv target/manifest.json state && dbt ls -s state:modified --state state
16:30:45  Running with dbt=1.10.0-a1
16:30:45  Registered adapter: duckdb=1.9.1
16:30:45  Performance info: /Users/jerco/dev/scratch/testy/target/perf_info.json
16:30:46  Running with dbt=1.10.0-a1
16:30:46  Registered adapter: duckdb=1.9.1
16:30:46  Found 1 operation, 1 model, 2 seeds, 424 macros
16:30:46  The selection criterion 'state:modified' does not match any enabled nodes
16:30:46  No nodes selected!

Manually edit one row in the large seed:

% dbt ls -s state:modified --state state
16:36:58  Running with dbt=1.10.0-a1
16:36:58  Registered adapter: duckdb=1.9.1
16:36:58  Found 1 operation, 1 model, 2 seeds, 424 macros
testy.my_large_seed

jtcohen6 · 2024-12-24T16:33:36Z

core/dbt/parser/read_files.py

        # We don't want to calculate a hash of this file. Use the path.
        source_file = SourceFile.big_seed(match)
    else:
-        file_contents = load_file_contents(match.absolute_path, strip=True)
-        checksum = FileHash.from_contents(file_contents)
+        checksum = FileHash.from_path(match.absolute_path)


I have confirmed that this is not a "breaking" change, insofar as the same seed produces the same checksum before and after this change.

jtcohen6 · 2024-12-24T16:34:04Z

core/dbt/artifacts/resources/base.py

@@ -60,6 +61,27 @@ def from_contents(cls, contents: str, name="sha256") -> "FileHash":
        checksum = hashlib.new(name, data).hexdigest()
        return cls(name=name, checksum=checksum)

+    @classmethod
+    def from_path(cls, path: str, name="sha256") -> "FileHash":


Non-breaking change in the dbt/artifacts/ directory, so I will add the artifact_minor_upgrade label to this PR

Reapply changes from #7125

704120c

Co-authored by: Noah Holm <[email protected]> Co-authored by: Jeremy Cohen <[email protected]>

jtcohen6 requested a review from a team as a code owner December 24, 2024 16:26

cla-bot bot added the cla:yes label Dec 24, 2024

jtcohen6 mentioned this pull request Dec 24, 2024

Make MAXIMUM_SEED_SIZE_MIB configurable #7125

Open

6 tasks

jtcohen6 commented Dec 24, 2024

View reviewed changes

Update MB -> MiB in functional test

20be925

jtcohen6 added the artifact_minor_upgrade To bypass the CI check by confirming that the change is not breaking label Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make MAXIMUM_SEED_SIZE_MIB configurable #11177

Make MAXIMUM_SEED_SIZE_MIB configurable #11177

jtcohen6 commented Dec 24, 2024

codecov bot commented Dec 24, 2024 •

edited

Loading

jtcohen6 left a comment

jtcohen6 Dec 24, 2024

jtcohen6 Dec 24, 2024

Make MAXIMUM_SEED_SIZE_MIB configurable #11177

Are you sure you want to change the base?

Make MAXIMUM_SEED_SIZE_MIB configurable #11177

Conversation

jtcohen6 commented Dec 24, 2024

Problem

Solution

Checklist

codecov bot commented Dec 24, 2024 • edited Loading

Codecov Report

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 Dec 24, 2024

Choose a reason for hiding this comment

jtcohen6 Dec 24, 2024

Choose a reason for hiding this comment

codecov bot commented Dec 24, 2024 •

edited

Loading