Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloning this repo consumes 1.4GB on disk #771

Open
kfranqueiro opened this issue Jul 23, 2024 · 3 comments
Open

Cloning this repo consumes 1.4GB on disk #771

kfranqueiro opened this issue Jul 23, 2024 · 3 comments
Assignees

Comments

@kfranqueiro
Copy link
Contributor

Upon cloning this repo for the first time today, I was somewhat unnerved to find that git needed to download 1.2GB of data, ultimately occupying 1.4GB of disk space.

Further digging leads to the conclusion that the gh-pages branch is the main culprit:

$ git rev-list --disk-usage=human --objects HEAD..gh-pages
1.20 GiB

Moreover, most of this isn't due to files currently present on the branch, but rather files in its history:

$ git checkout gh-pages
Switched to branch 'gh-pages'

$ du -sh
1.6G    .

$ du -sh .git
1.3G    .git

(i.e., there is roughly 300 MB actually on the branch, while the other 1.3GB is in commit history)

The content-assets and content-images folders account for most of the presently occupied space on the branch, containing a handful of MP4 videos and many PNG images.

Possible solutions

Squash gh-pages history

Fully remedying the current state would likely require rewriting git history of the gh-pages branch, to squash the majority of the 1.3 GB history with no bearing on the present contents. The silver lining is I would suspect that only automated systems push to gh-pages most of the time, so this wouldn't be nearly as disruptive as having to do so on the main branch.

Git LFS for binary files

Using git for long-term storage of binary assets inevitably leads to this result. Maybe we should look into using Git LFS with this repo?

Potential workarounds

If you want to clone the repo in its current state without pulling down over a gigabyte of data, you can clone only the default branch using git clone --single-branch repo-url. (This costs roughly 160MB of space rather than 1.4GB.)

If you need to checkout a particular existing branch after doing the above, you can explicitly add only the branch you want:

git remote set-branches --add origin branch-name
git fetch
git checkout branch-name
@remibetin
Copy link
Member

remibetin commented Sep 5, 2024

Thanks for bringing this up @kfranqueiro.

Please find below my additional investigation and comments.

Size of the repository: 1.3 GB

remote: Enumerating objects: 1383802, done.
remote: Counting objects: 100% (58719/58719), done.
remote: Compressing objects: 100% (7211/7211), done.
remote: Total 1383802 (delta 28328), reused 58463 (delta 28107), pack-reused 1325083 (from 1)
Receiving objects: 100% (1383802/1383802), 1.26 GiB | 15.26 MiB/s, done.
Resolving deltas: 100% (673094/673094), done.

git count-objects -v:

count: 0
size: 0
in-pack: 1383802
packs: 1
size-pack: 1362244
prune-packable: 0
garbage: 0
size-garbage: 0

du -hs .git/objects:

1,3G	.git/objects

Finding large objects

Results of script from Atlassian documentation

All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file.
size   pack   SHA                                       location
34676  34410  5a29f69be4147a629cd972c1a4ba5f7a17eae96a  content-assets/wcag-act-rules/test-assets/perspective-video/perspective-video-silent.mp4
33756  33488  dba211b9a3a4408abfc81b7ee8ace6ea611009aa  content-assets/wcag-act-rules/test-assets/perspective-video/perspective-video-with-captions-silent.mp4
8250   2403   92d930e643b17700a8e66e05a5b469e9e543b295  assets/search/tipuesearch_content.js
7701   1683   2f9d83622d722325d45bc68197f6500012b931e7  assets/search/tipuesearch_content.js
4267   933    dfe836746e8278f536211f646b3020f6d2b330ec  assets/search/tipuesearch_content.js
3824   1092   711c123deda693e7d2d9bb7b1f1a6825955868ff  assets/search/tipuesearch_content.js
3824   1092   2d7e357fb60b450c07ff1cd0cd33e39ba5a1cf58  assets/search/tipuesearch_content.js
3636   1039   14e0c9532fb8f4c80ac9b7573ecb5dbee386d391  assets/search/tipuesearch_content.js
3636   1039   32f073ebf647f17659c53bcc63153fe1f5d97dd0  assets/search/tipuesearch_content.js
3636   1041   a2f25aeab2479f06fe29ebede293ca5514fa2c81  assets/search/tipuesearch_content.js
3636   1041   24946c284a03808bbefbe989204a6565126b6755  assets/search/tipuesearch_content.js
3625   1027   e81a93cbc196e0e98a4af1cab306ac88b9c2acff  assets/search/tipuesearch_content.js
3517   1013   290fc0043a4d567cecf655fe0ddc1217b58aada1  assets/search/tipuesearch_content.js
3397   970    077987522086033507e66babbd853e2a29715d71  assets/search/tipuesearch_content.js
3389   965    9bd8dfaaff2ff1f69ff43cb70b698151aa80f517  assets/search/tipuesearch_content.js
3383   966    46a28b858f2bbd16b844d1bf66cbe5620f6a62c9  assets/search/tipuesearch_content.js
3377   962    5045c405e79ec34af9349bd72484478d4ea56c3c  assets/search/tipuesearch_content.js
3185   921    d8c10a3df10b9c7a934230a0e4ffa96fd41c1e9e  assets/search/tipuesearch_content.js
3185   921    4b28ada3b45214295aaf1d2d9d21e803972574bb  assets/search/tipuesearch_content.js
2962   2877   f3af2a44f71c3fc9e31cdee373106ef09f0e8a5d  content-assets/wcag-act-rules/test-assets/rabbit-video/video-with-voiceover.webm

Main large files:

  • Videos used in ACT Rules
  • tipuesearch_content.js: files generated for the search feature.

Possible solutions

Host videos elsewhere

Videos used in ACT rules could be hosted on https://media.w3.org/ and removed from Git history (see Atlassian documentation for removing large files)

I do not think we need to use Git LFS (at least for now): we do not use many large files (such as videos) in this project; these files are not frequently updated; and thus do not need to be tracked with Git. Directly hosting them somewhere else appear good enough to me.

Warning

From Atlassian documentation about reducing repository size:
"Once large files have been removed, it is a best practice for everyone using the repository to make a new clone; otherwise, if someone does a force push, they will push the large files again and you’ll be back to where you started."

Optimize images

As you noted, there are many PNG files in the content-images folder.

Many could be replaced with JPG files or smaller versions (images used in How People with Disabilities Use the Web subpages in particular). The PNG/larger versions could then be removed from Git repository.

Squash gh-pages history (from @kfranqueiro)

After the above cleaning, I agree with the solution of regularly squashing gh-pages history: only automated systems push to gh-pages indeed and we want to track the project history in the main branch.

I would exclude a more radical approach of using this branch as an orphan branch: we want multiple commits in gh-pages in case of emergency, to quickly rollback the published site to a previous version.

Test squash with my fork

After squashing the gh-branch and re-cloning my forked repo, I get the following results:

remote: Enumerating objects: 17295, done.
remote: Counting objects: 100% (4226/4226), done.
remote: Compressing objects: 100% (1335/1335), done.
remote: Total 17295 (delta 1964), reused 3819 (delta 1889), pack-reused 13069 (from 1)
Receiving objects: 100% (17295/17295), 183.07 MiB | 8.10 MiB/s, done.
Resolving deltas: 100% (8953/8953), done.

git count-objects -v:

count: 0
size: 0
in-pack: 17295
packs: 1
size-pack: 187935
prune-packable: 0
garbage: 0
size-garbage: 0

du -hs .git/objects:

193M	.git/objects

@remibetin remibetin self-assigned this Sep 5, 2024
@shawna-slh
Copy link
Collaborator

Videos used in ACT rules could be hosted on https://media.w3.org/ and removed from Git history

Maybe y'all have already communicated this: Ken is pursuing adding some videos to a new subdirectory of https://media.w3.org/wai/
@iadawn @kfranqueiro Any reason not to add these in that request? Perhaps best to have separate directories — I leave that for y'all to figure out.

@remibetin
Copy link
Member

remibetin commented Sep 12, 2024

@kfranqueiro @shawna-slh @iadawn

I have squashed the history of the gh-pages branch. This went well for the live website 🎉

I now get the following results:

remote: Enumerating objects: 17343, done.
remote: Counting objects: 100% (4274/4274), done.
remote: Compressing objects: 100% (1397/1397), done.
remote: Total 17343 (delta 2055), reused 3698 (delta 1875), pack-reused 13069 (from 1)
Receiving objects: 100% (17343/17343), 182.43 MiB | 17.55 MiB/s, done.
Resolving deltas: 100% (9044/9044), done.

git count-objects -v:

count: 0
size: 0
in-pack: 17343
packs: 1
size-pack: 187286
prune-packable: 0
garbage: 0
size-garbage: 0

du -hs .git/objects:

192M	.git/objects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants