Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make MAXIMUM_SEED_SIZE_MIB configurable #7125

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

noppaz
Copy link

@noppaz noppaz commented Mar 5, 2023

resolves #7117
resolves #7124

Description

Increasing the maximum size of seed files where the content is hashed for state comparison will enable a greater use of deferred runs with updated seed file contents. By making this a configuration with an environment variable users are able to override the default 1 MiB limit.

Furthermore, to support reading larger files in memory constrained environments, a new method is added to read and compute file contents incrementally. As this is performed on bytes for better performance we continue using the previous UTF-8 content method for small files to not mess up current states where seed files are not stored with UTF-8.

Checklist

@noppaz noppaz requested review from a team as code owners March 5, 2023 09:53
@noppaz noppaz requested review from peterallenwebb and aranke March 5, 2023 09:53
@cla-bot
Copy link

cla-bot bot commented Mar 5, 2023

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @noppaz

@noppaz noppaz force-pushed the incremental-seed-hash branch from da6c118 to 880b035 Compare March 6, 2023 08:33
@cla-bot cla-bot bot added the cla:yes label Mar 6, 2023
@acurtis-evi
Copy link

acurtis-evi commented Mar 6, 2023

I added a PR against your PR for changes that I'd recommend. Added change for compatibility and merging our changes.

acurtis-evi and others added 2 commits March 7, 2023 14:44
* Make MAXIMUM_SEED_SIZE configurable

* Updated MAXIMUM_SEED_SIZE_NAME

* Added changie

* Update core/dbt/constants.py

Co-authored-by: Doug Beatty <[email protected]>

* Adding suggested change

* Fixed comparison

* Added comment

* Added constants change and removed changie

---------

Co-authored-by: Doug Beatty <[email protected]>
@noppaz noppaz changed the title compute seed file hash incrementally Make MAXIMUM_SEED_SIZE configurable Mar 7, 2023
@acurtis-evi
Copy link

This looks good to me!

@dbeatty10 dbeatty10 added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Mar 7, 2023
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @noppaz! A few comments after a first read through the code.

MAXIMUM_SEED_SIZE_NAME = "1MB"

def get_max_seed_size():
mx = os.getenv("DBT_MAXIMUM_SEED_SIZE", "1")
Copy link
Contributor

@jtcohen6 jtcohen6 Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that this config's name include the unit: DBT_MAXIMUM_SEED_SIZE_MIB

Rather than accessing an env var directly here, this should really be implemented as a "global config" (available in our Flags module). I don't think it would make very much sense to pass as a CLI flag (--maximum-seed-size), but I could very much see someone wanting to set this in "user config":

# profiles.yml -- for now
config:
  maximum_seed_size_mb: 5

That means:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this in fa836bb. It required some additional moving of the usage of the global flag as it is not a constant anymore and the constants.py file is referenced much earlier in the import cycle.

Comment on lines 13 to 14
DEFAULT_MAXIMUM_SEED_SIZE = 1 * 1024 * 1024
MAXIMUM_SEED_SIZE = get_max_seed_size() * DEFAULT_MAXIMUM_SEED_SIZE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need both of these constants defined. MAXIMUM_SEED_SIZE should just be get_max_seed_size() * 1024 * 1024. The get_max_seed_size() function will return the default value of 1 if the user does not set/override.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for checking @jtcohen6!

One reason for keeping DEFAULT_MAXIMUM_SEED_SIZE would be to be able to keep the logic of load_seed_source_file() in read_files.py. Since we do not convert the file from bytes in the new incremental logic we opted to continue computing the checksum with the "old" method for files smaller than 1 MiB so that we'd ensure backward compatibility with current manifests. This is since the checksum of a file encoded as anything other than UTF-8 will be different depending on if it gets UTF-8 encoded or is read as raw bytes.

The reason for not converting from bytes to UTF-8 string is performance and somewhat cleaner code but if we want to remove DEFAULT_MAXIMUM_SEED_SIZE maybe we should remove the elif and only compute the seed file hash incrementally instead. I see two options then:

  1. Read the file as string and ensure UTF-8 encoding before hashing
  2. Continue reading the raw bytes and accept that some projects with seed files that are not utf-8 might have a diff in their manifest deferral when upgrading to this version of dbt-core.

What's your thoughts on this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented this suggestion with removal of DEFAULT_MAXIMUM_SEED_SIZE and went with alternative 1 above. The changes can be seen in 53021d9.

@noppaz noppaz requested a review from a team April 10, 2023 14:05
@noppaz
Copy link
Author

noppaz commented May 3, 2023

Hey @jtcohen6, I was hoping we'd manage to find some more of your time after 1.5 release (which is looking extremely good btw). Would appreciate some more looks at this. I can't manage to tag @dbt-labs/oss-maintainers mentioned here, therefore pinging you directly.

@dbeatty10
Copy link
Contributor

@noppaz sorry for our silence here!

@jtcohen6 has limited availability currently, so I'm going to give a read through the discussion so far and try out this branch.

@dbeatty10
Copy link
Contributor

Will close and re-open to try to trigger the CI checks.

@dbeatty10 dbeatty10 closed this May 17, 2023
@dbeatty10 dbeatty10 reopened this May 17, 2023
@dbeatty10 dbeatty10 requested a review from a team as a code owner May 17, 2023 16:31
@dbeatty10 dbeatty10 requested review from emmyoop and removed request for a team May 17, 2023 16:31
@noppaz noppaz changed the title Make MAXIMUM_SEED_SIZE configurable Make MAXIMUM_SEED_SIZE_MIB configurable May 19, 2023
@noppaz
Copy link
Author

noppaz commented May 19, 2023

@dbeatty10 thanks for the tests, I've made adjustments and I'm able to get them passing locally now so I think another round of tests and eyes on the code would be the next step.

@noppaz
Copy link
Author

noppaz commented May 23, 2023

I see that the tests are failing on Windows only now. I think the issue is related to this comment.

However, this shines some light on what is kind of weird to me; the expected checksum is different on Windows and Linux/MacOS as can be seen in the tests from c2e2958. Should we not generate the same checksum so manifests can be correctly compared across platforms?

@emmyoop emmyoop removed their request for review October 5, 2023 13:25
@JavierLopezT
Copy link

JavierLopezT commented Mar 12, 2024

@noppaz @dbeatty10 Currently our CI/CD doesn't work because we have one seed bigger than 1MiB, so we would need this.

Is there any intention of finishing it up? Thank you very much!

@noppaz
Copy link
Author

noppaz commented Mar 14, 2024

@noppaz @dbeatty10 Currently our CI/CD doesn't work because we have one seed bigger than 1MiB, so we would need this.

Is there any intention of finishing it up? Thank you very much!

Hey @JavierLopezT, with no intention of sounding bitter, I gave up on this work due to low interest from dbt Labs. Would be happy to finish it up if they'd be up for it. The problem remains for myself as well.

@dbeatty10 dbeatty10 added the community This PR is from a community member label Apr 3, 2024
jtcohen6 added a commit that referenced this pull request Dec 24, 2024
Co-authored by: Noah Holm <[email protected]>
Co-authored by: Jeremy Cohen <[email protected]>
jtcohen6 pushed a commit that referenced this pull request Dec 24, 2024
Co-authored by: Noah Holm <[email protected]>
Co-authored by: Jeremy Cohen <[email protected]>
@jtcohen6
Copy link
Contributor

@noppaz We've heard some renewed interest in making seed size configurable (for the purposes of state:modified comparison), so I ported the changes from this PR into a new one:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes community This PR is from a community member ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
5 participants