Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce repo size #88

Closed
ebgoldstein opened this issue Feb 11, 2020 · 5 comments
Closed

reduce repo size #88

ebgoldstein opened this issue Feb 11, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@ebgoldstein
Copy link
Contributor

generally we might just reduce the repo size (i.e., remove catalogs, since the module can build catalogs)...

Chris even mentioned it in his review, which i have turned into an issue: #86

I know our the JOSS bot had troubles building the paper because the repo was big (see the review issue).. originally I wanted the catalog to be in the repo for search features, but that enhancement is a bit downstream.

@Matmorcat , are you ok if the catalogs are removed?

@ebgoldstein ebgoldstein added the enhancement New feature or request label Feb 11, 2020
@Matmorcat
Copy link
Contributor

Matmorcat commented Feb 13, 2020

@ebgoldstein Yes, I was thinking eventually, for neatness sake, the archives would be available to the program to download from a separate source.

It is a bit unorthodox to have the generated catalogs distributed with the code, but having the catalogs available (which they are on the fig-share site) is crucial for people who don't want to have to download all the archives to see data that has already been discovered before!

@ebgoldstein
Copy link
Contributor Author

ebgoldstein commented Feb 13, 2020

@Matmorcat — ok, so i can move the archives to a seperate place (likely another github repository). It's a good point that it's helpful for people who may want to know what is available w/o downloading all the files (saying this as i have a work computer that has spent 2 fulls days of downloading just the florence archives 😆 )..

also btw — The catalogs are not in the figshare repository (only the tags)..

I can remove them tommorrow and then close this issue.

@chrisleaman
Copy link
Contributor

Also, you should look at BFG Repo-Cleaner to help remove the large files from the git history. If you don't, even if you delete them, they will still be in the history and the repo will still be large.

@ebgoldstein
Copy link
Contributor Author

i removed the catalogs. I am a bit scared of dealing with the git history.. if someone else wants to be brave, feel free... otherwise i will leave it for now out of fear.

@chrisleaman
Copy link
Contributor

I had a go at cleaning the history, but couldn't push it back to the repo due to permissions. Someone else can run these commands and clean the history. You'll need a bit of a workaround since you can't specify the full path with BFG though. More details on the bfg page. Good luck if you decide to try this!

In the repo, use git rev-list and grep to get the id's of the blobs for files in the data/ folder.

ADUNSW+z5189959@leaman /c/Users/z5189959/Desktop/psi-collect (master)
λ git rev-list --all --objects -- data/ | git cat-file --batch-check='%(objectname) %(objecttype) %(rest)' | grep -Pe '^\w+ blob'
cec3538275cca7bea6c998590302f45904c9f5f0 blob data/catalogs/v1/Barry.csv
1eac7b4385131655a690e55361044cab575dfa75 blob data/catalogs/v1/Dorian.csv
74642417717fd491749e59c6cd8de492b3501b68 blob data/catalogs/v1/Florence.csv
21ada943f02b382f4fb1e6c7532baa3021793ac9 blob data/catalogs/v1/Gordon.csv
e6c6c2ef86129f7900ed4b780758ab83541fe6dc blob data/catalogs/v1/Michael.csv
2be642a6fe5cd81905e99dfc638c112c44783b28 blob data/catalogs/v1/catalog.csv
36b352d9fe1462bf8eec21e7947b90949a119134 blob data/catalogs/v2/Barry.csv
7bf96b962096aad61eafd4f0e023d22eb5574784 blob data/catalogs/v2/Dorian.csv
8c9d90f0fb43e4d4358291150fd8462218801a93 blob data/catalogs/v2/Florence.csv
42ea780c631145d7f3af24e13324db07be242185 blob data/catalogs/v2/Gordon.csv
bfbfebbc5a530a36a90f8bc9a09b02d64d6e6eae blob data/catalogs/v2/Michael.csv
2d5e14790ef2568033a9923e34de942f0a041c5a blob data/catalogs/v2/global.csv
660c8411912549a9207ee07aba1d8c3bab8bd4b8 blob data/archive_cache/.gitignore
2c8aba9f47c3ac3ea1c5896913ba2daaa50b13f3 blob data/input/.gitignore

Save these blobs to ./to-delete.txt:

ADUNSW+z5189959@leaman /c/Users/z5189959/Desktop/psi-collect (master)
λ git rev-list --all --objects -- data/ | git cat-file --batch-check='%(objectname) %(objecttype) %(rest)' | grep -Pe '^\w+ blob' | cut -d' ' -f1 > ./to-delete.txt

Run BFG to remove the blobs specified in ./to-delete.txt:

ADUNSW+z5189959@leaman /c/Users/z5189959/Desktop/psi-collect (master)
λ java -jar bfg.jar --no-blob-protection --strip-blobs-with-ids ./to-delete.txt

Using repo : C:\Users\z5189959\Desktop\psi-collect\.git

Found 0 objects to protect
Found 2 tag-pointing refs : refs/tags/v1.0.1, refs/tags/v1.0.2
Found 13 commit-pointing refs : HEAD, refs/heads/master, refs/remotes/origin/HEAD, ...

Protected commits
-----------------

You're not protecting any commits, which means the BFG will modify the contents of even *current* commits.

This isn't recommended - ideally, if your current commits are dirty, you should fix up your working copy and commit that, check that your build still works, and only then run the BFG to clean up your history.

Cleaning
--------

Found 1843 commits
Cleaning commits:       100% (1843/1843)
Cleaning commits completed in 6,814 ms.

Updating 13 Refs
----------------

        Ref                                                             Before     After
        -----------------------------------------------------------------------------------
        refs/heads/master                                             | e4e74ef0 | 5ec3d826
        refs/remotes/origin/dependabot/pip/markdown-3.2.1             | 2a9d6289 | f9b5c8e2
        refs/remotes/origin/dependabot/pip/mkdocs-minify-plugin-0.2.3 | 8305e907 | a2758b00
        refs/remotes/origin/master                                    | e4e74ef0 | 5ec3d826
        refs/tags/v0.3.0                                              | 10a41fb4 | 7faddc8b
        refs/tags/v0.4.0                                              | a8fd0a51 | c9d9fc4b
        refs/tags/v0.4.1                                              | aab4bc14 | 82626417
        refs/tags/v0.5.0                                              | ca8b468e | ea390ab3
        refs/tags/v0.5.4                                              | 86c6d2ca | 5e1a568d
        refs/tags/v0.6                                                | e238bcff | 68284948
        refs/tags/v1.0.0                                              | 9f977c6b | 6398ef7a
        refs/tags/v1.0.1                                              | b3d261b0 | 4f1b2073
        refs/tags/v1.0.2                                              | 0fe90084 | 0415bbc6

Updating references:    100% (13/13)
...Ref update completed in 59 ms.

Commit Tree-Dirt History
------------------------

        Earliest                                              Latest
        |                                                          |
        DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD

        D = dirty commits (file tree fixed)
        m = modified commits (commit message or parents changed)
        . = clean commits (no changes to file tree)

                                Before     After
        -------------------------------------------
        First modified commit | 4f2ec0c4 | 24dfd2cc
        Last dirty commit     | 2a9d6289 | f9b5c8e2

Deleted files
-------------

        Filename                               Git id
        -------------------------------------------------------------------------------
        .gitignore                           | 2c8aba9f (48 B), 660c8411 (49 B)
        Barry.csv                            | 36b352d9 (2.2 MB), cec35382 (1.7 MB)
        Dorian.csv                           | 7bf96b96 (3.2 MB), 1eac7b43 (2.2 MB)
        Florence.csv                         | 8c9d90f0 (7.7 MB), 74642417 (5.5 MB)
        Gordon.csv                           | 21ada943 (387.7 KB), 42ea780c (523.7 KB)
        Michael.csv                          | bfbfebbc (2.4 MB), e6c6c2ef (1.8 MB)
        catalog for image pixel and size.csv | 2be642a6 (12.2 MB)
        catalog.csv                          | 2be642a6 (12.2 MB)
        global.csv                           | 2d5e1479 (16.4 MB)


In total, 3694 object ids were changed. Full details are logged here:

        C:\Users\z5189959\Desktop\psi-collect.bfg-report\2020-02-14\07-52-35

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive


--
You can rewrite history in Git - don't let Trump do it for real!
Trump's administration has lied consistently, to make people give up on ever
being told the truth. Don't give up: https://www.aclu.org/
--

Prune the history:

ADUNSW+z5189959@leaman /c/Users/z5189959/Desktop/psi-collect (master)
λ git reflog expire --expire=now --all && git gc --prune=now --aggressive
Enumerating objects: 24526, done.
Counting objects: 100% (24526/24526), done.
Delta compression using up to 8 threads
Compressing objects: 100% (23323/23323), done.
Writing objects: 100% (24526/24526), done.
Total 24526 (delta 10727), reused 11729 (delta 0)
Removing duplicate objects: 100% (256/256), done.

Push back the repo:

git push

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants