Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up bundle uploads #193

Merged
merged 4 commits into from
Feb 21, 2024
Merged

Speed up bundle uploads #193

merged 4 commits into from
Feb 21, 2024

Conversation

dtheyer
Copy link
Contributor

@dtheyer dtheyer commented Feb 12, 2024

Currently Bundle uploads can take a significant amount of time (on the order of 10s of seconds) for larger repos. A significant amount of time is spent checking if the files being uploaded to the bundle match a glob pattern in a chefignores file.

The time complexity of checking chefignores ends up being roughly O(2n*m) where n is the number of files in the repo and m is the number of glob patterns in the chef ignore file. This is only roughly correct since chefignore can be overridden in subdirectories which allows m to be variable.

The previous implementation serially performed a DFS of the repo, checked the chefignore file, and wrote the files to the tgz. This implementation still serially performs a DFS, but it parallelizes the chef ignore check and initial archive generation. It generates intermediate tars in tempfiles, then combines them into a tgz. The user can specify the number of processes used and the compression level via the "bundle_generation_processes" and "bundle_compression_level" config keys respectively.

One other notable change is that rather than using just the chefignore file in the base repo, this implementation checks for chefignore files per file which allows it to correctly pick up chefignore files in subdirectories. This has a significant performance penalty, but this is regression if offset by the parallelization changes.

For testing I started with a repo consisting of the cookbooks in https://github.com/facebook/chef-cookbooks
To increase the size of the repo I made fake files using for i in {1..n}; do base64 /dev/urandom | head -c 10000 > file$i.txt; done

Repo Size Implementation time
small (674 files) old real 0m1.041s -- user 0m0.908s -- sys 0m0.127s
small (674 files) 32 processes real 0m0.995s -- user 0m1.833s -- sys 0m0.810s
small (674 files) 4 processes real 0m1.096s -- user 0m1.224s -- sys 0m0.287s
small (674 files) 1 process real 0m1.208s -- user 0m1.019s -- sys 0m0.182s
medium (6293 files) old real 0m2.711s -- user 0m2.438s -- sys 0m0.264s
medium (6293 files) 32 processes real 0m1.609s -- user 0m4.510s -- sys 0m1.261s
medium (6293 files) 4 processes real 0m2.007s -- user 0m3.440s -- sys 0m0.704s
medium (6293 files) 1 process real 0m3.917s -- user 0m3.300s -- sys 0m0.608s
Large (16293) old real 0m18.197s -- user 0m15.976s -- sys 0m2.167s
Large (16293) 32 processes real 0m12.031s -- user 0m19.636s -- sys 0m3.735s
Large (16293) 4 processes real 0m15.260s -- user 0m18.572s -- sys 0m2.977s
Large (16293) 1 process real 0m23.627s -- user 0m18.591s -- sys 0m4.984s

@joshuamiller01
Copy link
Contributor

Certainly like the idea. Are the CI failures related?

Also, in your testing, can you confirm the bundle contents before and after are equivalent (or if not equivalent, only different in expected ways)?

@dtheyer
Copy link
Contributor Author

dtheyer commented Feb 15, 2024

The CI failures look related to my use of an endless range which looks to have been added for ruby 2.6, but is not handled correctly by rubo cop? rubocop/rubocop#7159

I can refactor that line to remove the endless range operator and make the CI happy.

I can confirm in my testing the bundles were equivalent with the exception of the new logic using subdirectory chefignore files. I validated this by getting a file list using tar to get a file list before and after and diffing them:

$ tar -ztvf old_tt.tgz | sort > /tmp/output_old.txt
$ tar -ztvf old_tt.tgz | sort > /tmp/output_new.txt

I tested this with both the synthetic repo I described earlier as well as with a repo that is used in production.

@dafyddcrosby
Copy link
Contributor

The CI failures look related to my use of an endless range which looks to have been added for ruby 2.6, but is not handled correctly by rubo cop? rubocop/rubocop#7159

I can refactor that line to remove the endless range operator and make the CI happy.

I think it's that we still target Ruby 2.5 syntax https://github.com/facebook/chef-cookbooks/blob/main/.rubocop.yml#L3. I should be bumping that to 2.7 in ~next month, but might be easier to refactor in the meantime

@joshuamiller01 joshuamiller01 merged commit 214307e into facebook:main Feb 21, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants