Speed up bundle uploads #193

dtheyer · 2024-02-12T19:54:22Z

Currently Bundle uploads can take a significant amount of time (on the order of 10s of seconds) for larger repos. A significant amount of time is spent checking if the files being uploaded to the bundle match a glob pattern in a chefignores file.

The time complexity of checking chefignores ends up being roughly O(2n*m) where n is the number of files in the repo and m is the number of glob patterns in the chef ignore file. This is only roughly correct since chefignore can be overridden in subdirectories which allows m to be variable.

The previous implementation serially performed a DFS of the repo, checked the chefignore file, and wrote the files to the tgz. This implementation still serially performs a DFS, but it parallelizes the chef ignore check and initial archive generation. It generates intermediate tars in tempfiles, then combines them into a tgz. The user can specify the number of processes used and the compression level via the "bundle_generation_processes" and "bundle_compression_level" config keys respectively.

One other notable change is that rather than using just the chefignore file in the base repo, this implementation checks for chefignore files per file which allows it to correctly pick up chefignore files in subdirectories. This has a significant performance penalty, but this is regression if offset by the parallelization changes.

For testing I started with a repo consisting of the cookbooks in https://github.com/facebook/chef-cookbooks
To increase the size of the repo I made fake files using for i in {1..n}; do base64 /dev/urandom | head -c 10000 > file$i.txt; done

Repo Size	Implementation	time
small (674 files)	old	real 0m1.041s -- user 0m0.908s -- sys 0m0.127s
small (674 files)	32 processes	real 0m0.995s -- user 0m1.833s -- sys 0m0.810s
small (674 files)	4 processes	real 0m1.096s -- user 0m1.224s -- sys 0m0.287s
small (674 files)	1 process	real 0m1.208s -- user 0m1.019s -- sys 0m0.182s
medium (6293 files)	old	real 0m2.711s -- user 0m2.438s -- sys 0m0.264s
medium (6293 files)	32 processes	real 0m1.609s -- user 0m4.510s -- sys 0m1.261s
medium (6293 files)	4 processes	real 0m2.007s -- user 0m3.440s -- sys 0m0.704s
medium (6293 files)	1 process	real 0m3.917s -- user 0m3.300s -- sys 0m0.608s
Large (16293)	old	real 0m18.197s -- user 0m15.976s -- sys 0m2.167s
Large (16293)	32 processes	real 0m12.031s -- user 0m19.636s -- sys 0m3.735s
Large (16293)	4 processes	real 0m15.260s -- user 0m18.572s -- sys 0m2.977s
Large (16293)	1 process	real 0m23.627s -- user 0m18.591s -- sys 0m4.984s

joshuamiller01 · 2024-02-15T18:49:36Z

Certainly like the idea. Are the CI failures related?

Also, in your testing, can you confirm the bundle contents before and after are equivalent (or if not equivalent, only different in expected ways)?

dtheyer · 2024-02-15T20:02:03Z

The CI failures look related to my use of an endless range which looks to have been added for ruby 2.6, but is not handled correctly by rubo cop? rubocop/rubocop#7159

I can refactor that line to remove the endless range operator and make the CI happy.

I can confirm in my testing the bundles were equivalent with the exception of the new logic using subdirectory chefignore files. I validated this by getting a file list using tar to get a file list before and after and diffing them:

$ tar -ztvf old_tt.tgz | sort > /tmp/output_old.txt
$ tar -ztvf old_tt.tgz | sort > /tmp/output_new.txt

I tested this with both the synthetic repo I described earlier as well as with a repo that is used in production.

dafyddcrosby · 2024-02-16T18:18:27Z

The CI failures look related to my use of an endless range which looks to have been added for ruby 2.6, but is not handled correctly by rubo cop? rubocop/rubocop#7159

I can refactor that line to remove the endless range operator and make the CI happy.

I think it's that we still target Ruby 2.5 syntax https://github.com/facebook/chef-cookbooks/blob/main/.rubocop.yml#L3. I should be bumping that to 2.7 in ~next month, but might be easier to refactor in the meantime

Speed up bundle uploads

82c4a02

facebook-github-bot added the CLA Signed label Feb 12, 2024

David Heyer added 3 commits February 20, 2024 11:56

Ensure ruby 2.5 syntax and that temp files are cleaned up

ebbed24

Fix second infinite range

4f12ce2

Handle lints

dfa6ebb

joshuamiller01 merged commit 214307e into facebook:main Feb 21, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up bundle uploads #193

Speed up bundle uploads #193

dtheyer commented Feb 12, 2024

joshuamiller01 commented Feb 15, 2024

dtheyer commented Feb 15, 2024

dafyddcrosby commented Feb 16, 2024

Speed up bundle uploads #193

Speed up bundle uploads #193

Conversation

dtheyer commented Feb 12, 2024

joshuamiller01 commented Feb 15, 2024

dtheyer commented Feb 15, 2024

dafyddcrosby commented Feb 16, 2024