-
-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT: reduce repo size #727
Comments
Thanks for keeping the issue alive. I am still interested on applying git large file storage to this repo, but before diving in I decide to do the "home work" and spend the day looking into your documentation and also tried different install procedures. You can assign me the task if you wish! Regards |
Nice job! I'm also trying to read more about git filter-branch and git bfg. I have used git LFS at this repo, maybe there's something in that repo that could help us. |
EDIT: migrating to git-lfs on github is better described by github here. Currently the 'data' has the following distribution of files: # All Files
RocketPy on master is 📦 v1.6.1 via 🐍 v3.12.3
❯ du -h --max-depth=0 data
162M data
# .csv files
RocketPy on master is 📦 v1.6.1 via 🐍 v3.12.3
❯ find -type f -name "*.csv" -exec du -ch {} + | grep 'total' | awk '{print $1}'
20M
# .rc files
RocketPy on master is 📦 v1.6.1 via 🐍 v3.12.3
❯ find -type f -name "*.nc" -exec du -ch {} + | grep 'total' | awk '{print $1}'
142M There is also about 1 Mb of Git LFS official repo has a tutorial on how to migrate a repository. Problems I envision to implement this:
Alternative:
Those were the investigations for today, as soon as I implement git-lfs on my repo I will bring you the numbers and more discussions, if needed. |
Just now I actually read your comment. I will look into the tools you mentioned, the paper repo seems to be a different case (did you migrate?). I think the main problem is doing the migration and coordinating with everyone else to use git-lfs. |
@aureliobarbosa I have to say I liked the idea of trying the alternative solution first, and then we can try the git LFS. |
Hey @Gui-FernandesBR, I agree with you about trying cleaning the git history. In this direction, I evaluated tools for cleaning the git history and found that git-filter-repo seems to be a better solution. It has an option to analyze the size of previously deleted files, folders and files by extension (inside git history). Since I am supposing that you going to keep versions of data files inside the repository, I opted to investigate the sizes by file, while sorting them in reverse order. Below is a snapshot of CSV file I generated (I will send it to the team via Discord). Contrary to initial expectation, only a few big CSV files are stored in the git history and the villains include .nc files, as expected, and notebooks (which store data inside it, of course...). The first big Considering this, my recommendation would be to delete only those 49 files, since this seems the simplest action to be done. After installing the git filter-repo --invert-paths --paths-from-file files-i-dont-want-anymore.txt Note that it DOES rewrite the git history and all developers would need to clone the repository again. I recommend to do this 'surgery' when you finish all PRs you expect to do before the next minor version. mycode/projects/rocketpy-dev
❯ more path-deleted-sizes.csv
1, 46978562, 16949319, 2019-02-07, 'docs/sampleDispersionDataReader.ipynb'
2, 46978562, 16949319, 2019-02-07, 'disp/sampleDispersionDataReader.ipynb'
3, 46902996, 40188650, 2023-01-01, 'data/weather/Alcantara_2016_ERA-5.nc'
4, 46774852, 40263054, 2023-01-01, 'data/weather/Alcantara_2017_ERA-5.nc'
5, 46774852, 40126497, 2023-01-01, 'data/weather/Alcantara_2015_ERA-5.nc'
6, 46774852, 40116024, 2023-01-01, 'data/weather/Alcantara_2018_ERA-5.nc'
7, 43664596, 36926254, 2022-09-24, 'data/weather/EuroC_single_level_reanalysis_2000_2021.nc'
8, 21323940, 18974499, 2023-01-01, 'data/weather/CLBI_2016_ERA-5.nc'
9, 21323936, 19420900, 2023-01-01, 'data/weather/SpaceportAmerica_2016_ERA-5.nc'
10, 21265684, 18923920, 2023-01-01, 'data/weather/CLBI_2018_ERA-5.nc'
11, 21265684, 18904131, 2023-01-01, 'data/weather/CLBI_2017_ERA-5.nc'
12, 21265680, 19476628, 2023-01-01, 'data/weather/SpaceportAmerica_2017_ERA-5.nc'
13, 21265680, 19373762, 2023-01-01, 'data/weather/SpaceportAmerica_2015_ERA-5.nc'
14, 21265680, 18918037, 2023-01-01, 'data/weather/CLBI_2015_ERA-5.nc'
15, 13753819, 4050298, 2024-09-21, 'docs/notebooks/fins_roll.csv'
16, 12908689, 2940044, 2024-09-21, 'docs/notebooks/coeff_testing.ipynb'
17, 12894004, 3692363, 2020-03-22, 'nbks/Dispersion Sample.disp_input'
18, 12355867, 3862899, 2021-04-07, 'docs/notebooks/valetudo_dispersion/valetudo_dispersion.ipynb'
19, 11866021, 4080483, 2024-08-04, 'docs/notebooks/airbrakes_example.ipynb'
20, 10830351, 3809773, 2020-03-22, 'nbks/Getting Started - Examples.ipynb'
21, 9860124, 3218580, 2021-04-07, 'docs/notebooks/dispersion_analysis.ipynb'
22, 9802609, 123010, 2020-03-22, 'nbks/rocketpyAlpha.py'
23, 9005216, 8966518, 2022-09-24, 'data/weather/EuroC_pressure_levels_reanalysis_2002-2021.nc'
24, 8589588, 4696361, 2024-08-04, 'docs/notebooks/air_brakes_example.ipynb'
25, 8232142, 2572448, 2023-08-10, 'docs/notebooks/example_hybrid.ipynb'
26, 7849765, 2155194, 2021-04-07, 'docs/notebooks/valetudo_dispersion/Monte_carlo_valetudo.valetudo_disp_o
ut.txt'
27, 6574758, 4701393, 2024-08-03, 'docs/notebooks/environment/environment_class_usage.ipynb'
28, 6313322, 3514941, 2023-06-28, 'docs/notebooks/example_solid.ipynb'
29, 6054970, 1531514, 2020-03-22, 'nbks/Dispersion Sample.disp_output'
30, 5635005, 3817378, 2020-03-22, 'nbks/Environment - Examples.ipynb'
31, 5080208, 4834400, 2022-04-09, 'data/weather/spaceport_america_pressure_level_reanalysis_2015_2021.nc'
32, 4976750, 2880500, 2022-06-07, 'docs/notebooks/SolidMotor_class_usage.ipynb'
33, 4929275, 3217437, 2020-03-22, 'nbks/Dispersion Analysis - Monte Carlo - Example.ipynb'
34, 4782068, 2288881, 2023-08-10, 'docs/notebooks/tank_class_usage.ipynb'
35, 4712149, 3155118, 2019-02-07, 'nbks/Environment Examples.ipynb'
36, 4299000, 215, 2024-09-21, 'docs/notebooks/tail_cL.csv'
37, 4298998, 1038596, 2024-09-21, 'docs/notebooks/tail_cQ.csv'
38, 4273149, 215, 2024-09-21, 'docs/notebooks/nose_cL.csv'
39, 4273147, 1036268, 2024-09-21, 'docs/notebooks/nose_cQ.csv'
40, 4223009, 213, 2024-09-21, 'docs/notebooks/fins_cL.csv'
41, 4223007, 1030363, 2024-09-21, 'docs/notebooks/fins_cQ.csv'
42, 4082870, 4044416, 2022-09-22, 'data/weather/EuroC_pressure_levels_reanalysis_2002_2010.nc'
43, 3418375, 1328400, 2023-08-10, 'docs/notebooks/example_liquid.ipynb'
44, 3274830, 1609382, 2020-03-22, 'nbks/Euporia.ipynb'
45, 3211198, 2261130, 2022-10-10, 'getting_started Dispersion.ipynb'
46, 2695115, 2695950, 2023-09-25, 'docs/static/trajectory-earth.png'
47, 2174231, 911552, 2022-05-19, 'getting_started.ipynb'
48, 2138702, 771662, 2023-01-01, 'data/calisto/CD Test.CSV'
49, 2109965, 836873, 2023-01-01, 'data/euporia/euporiaIDrag.csv'
50, 1933436, 965847, 2018-12-11, 'nbks/Calisto.ipynb'
51, 1525754, 51968, 2024-07-03, 'tests/test_rocket.py' |
Amazing work, @aureliobarbosa ! I was not imagining that .ipynb would also be a part of the "villains list", but it makes total sense! When we save the notebooks with images, the ipython interpreter has to convert the image to a hash and store it in the .ipynb file (wich is just a fancy .json), this may consume disk space. Found another reason to migrate .ipynb to .rst files @MateusStano @phmbressan @Lucas-Prates ! @RocketPy-Team/code-owners can you read this thread and let us know that you agree with such operation? The only concern is that a few files are still being used, therefore cannot be deleted:
Something we should definitely try is to compress the .nc files! Based on my experience, there are some free tools that compress these files, usually reducing the file size by 30%. With all that been said, I guess a good summary of next steps would be:
As of now, I think your contribution is already quite beneficial for us, @aureliobarbosa ! |
Is your feature request related to a problem? Please describe.
As discussed here by @aureliobarbosa, cloning the RocketPy repo currently consumes more more than 1GB.
This is probably due to large files being stored
Describe the solution you'd like
There are a few options that we would like to explore in order to tackle this issue. For instance:
.nc
and other binary files that were initially committed to this repo but at some point got deleted.data
folder.Additional context
I have no much experience on this, but I will try listing a few links that may help us.
The text was updated successfully, but these errors were encountered: