-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to remove directories #12
Comments
@Fryguy It would be possible to have a switch based on directory name rather than directory path, if that is useful? For instance:
Can you fill me in with some more context about your use case? Are you removing sensitive/private data, or just want to remove large files to reduce repo size? |
@rtyley Given a structure below, and I only want to remove the root-level aaa directory, would it remove both since it's by name?
I guess my use case falls into the "reduce repo size" category. My use case is I'm trying to split a massive repo with a long history into separate repos while keeping history. Most of the split is based on top-level directories. For example, a, b, and c will go to one repo; d to a second repo; and e and f to a third repo. git-filter-branch works ok for a single directory using --subdirectory-filter (when that directory doesn't have much activity in the history), but to do multiple directories I have to use --index-filter 'git rm -rf ', which takes forever. Considering I have to split it into about 5 repos, this approach will take forever. I found bfg-repo-cleaner and ran it to clean up some large files and was amazed by the performance, hence why I would love to be able to use it. |
The implementation I was thinking of in my head would, yes, which makes it a fairly blunt instrument. Part of the special-sauce that makes the BFG so fast is that it is path-independent, which makes me cautious about adding path-dependent features - unless they can be added without effecting performance. Let me have a think about it, I can see it's a valid feature.
That makes a good quote - would you mind if I added it to http://rtyley.github.com/bfg-repo-cleaner/#feedback, attributing it to you as 'Jason Frey, Software Engineer at Red Hat'? |
Sure, go for it.
Thanks! I was trying to go through the code to find where the file name checking was done, to see if I could help out with a patch or something, but I couldn't find it. I've also never done Scala before, so it's also a new thing to look at. |
Thanks - I've added your quote, much appreciated.
These are the lines currently associated with stripping out out files that match a particular text-pattern.... ...but the code really is not setup to deal with paths, it's a non-trivial change to get it there :) Out of interest, would you want/expect 'empty' commits (ie commits that appear to change nothing, because they relate to stuff from deleted directories) to be completely removed by the BFG, or for them to stay in place, with the commit message intact?
If you've got several weeks to invest (!) and you'd like to learn Scala I can really recommend this free course: https://www.coursera.org/course/progfun A lot of us in the office took it a few months ago and it really brought us up to speed. |
I would expect them to be removed for my use case, but others might(?) want to keep them. git-filter-branch has the --prune-empty option, which I was using. |
I know very little about git internals, so I'm not sure this helps, but I know if I do Would that help? |
Git tools apparently don't like empty trees much (C-Git CLI certainly makes it difficult to create empty trees, though it doesn't seem to mind them if it encounters them?). Also, if we're using multiple-blob-id removal to nuke directories (as in issue #12) then leaving the empty husks of directories around will just be ugly. #12 (comment)
Git tools apparently don't like empty trees much (C-Git CLI certainly makes it difficult to create empty trees, though it doesn't seem to mind them if it encounters them?). Also, if we're using multiple-blob-id removal to nuke directories (as in issue #12) then leaving the empty husks of directories around will just be ugly. #12 (comment)
Git tools apparently don't like empty trees much (C-Git CLI certainly makes it difficult to create empty trees, though it doesn't seem to mind them if it encounters them?). Also, if we're using multiple-blob-id removal to nuke directories (as in issue #12) then leaving the empty husks of directories around will just be ugly. #12 (comment)
Helps a little with issue #12. People can get a list of blob-ids using "git rev-list --all --objects", then grep to list all files in directories they want to nuke, and pass that to the BFG, as noted by @Fryguy: #12 (comment)
This is not really the full implementation of issue #12 ("Ability to remove directories"), because that issue actually requires path-dependant pruning. This is just simple filtering based on folder-name, but it should help this guy: http://stackoverflow.com/q/16821649/438886 ..and also Francis with https://github.com/janua/premiumparking To create the nasty folder-named-'.git' repo, this is the code I used: https://gist.github.com/rtyley/5673862
"Fairly blunt instrument" to say the least. So if I someone committed a lib directory in the parent, I can't use this tool without removing all lib directories anywhere in the path? |
Yep, that's true of the Often though, it's reasonable to ask exactly what you're trying to achieve by removing the |
i can't use strip blobs now because i work at a company and i'm not sure about some of the files yet. i am sure about the directory in question. honestly, in terms of design i don't see how a generic delete directory by name could be very useful. it certainly isn't in my case :P it seems like a loaded gun waiting to destroy current directories... but you said not the latest commit. does that mean currently existing directories called 'lib' won't be touched? |
Sure- here's some further documentation: |
+1 for specifying an absolute path for removal. My case is that unfortunately the team has commited a lot of libs in the repo (since SVN times...) and, until we refactor the project and put them into Maven or something external like that, we gotta keep the libs in the repo. But, we did identify several libs that were useless by now, which could be removed and save already a good space in the repo... So suppose I have:
In this example, we're still using version 3.4. So we couldn't delete "certain-lib" folder. And also we couldn't delete "certain-lib.jar" otherwise the 3.4 version would also go. As for the performance, I think it's fine: for this full path case, you could just warn the users (manual, readme, before running the command...). I wouldn't mind at all. Nonetheless, congrats and thanks a lot for this great tool!!!!! |
@lfilho in your example, you should be fine to do:
...this will delete all jars that are not in your latest commit - because, by default, the BFG protects the contents of your latest commit. So This command is short and sweet, and should definitely be used unless there's a good reason not to. Although I appreciate that for some use-cases path-dependent action is necessary, for the large majority of cases, it's not. For some of the cases where path-dependent action is necessary, there may actually already be a decent alternative tool (perhaps git-subtree, which is decently performant) that can perform the task. I'm always *very happy to hear explanations of why users do need path-dependent action, and if people explain the need here on this issue, that'll help my prioritise this feature. So far, of the two people who've discussed their requirements yet, FryGuy had a legitimate use-case, whereas yeago, I believe, would have been served perfectly well by the BFG's protected-commit behaviour.
The cost of implementing path-dependent action:
Given the cost of implementing the feature, vs the benefit it provides to a limited percentage of users, what would you do!? Personally, I would like to try implement it, but that will have to be in a world where I have considerably more time.
Thank you, I appreciate your thanks :-) |
First off, a great tool, and very much appreciated. Our company went through a "split repos" stage under SVN where one very large repo (200,000 files?), let's call it repo "a",and it contained 5 top level directories: a1, a2, a3, a4, and a5, and it got turned into 5 separate SVN repos: a1, a2, a3, a4, and a5. I wasn't around when this happened, but apparently they must have copied "a" 5 times, then did SVN deletes to trim each one, then pulled the subdirs up to the top of each repo (e.g. for the "a1" repository: rm -f -r a2 a3 a4 a5 ; mv a1/* . ; rmdir a1). So now the transition to 5 GIT repositories (and preserving at least the SVN source code change history, using git-svn) creates 5 rather bloated GIT repositories. So some kind of simple delete any a1 repo file with a prefix pattern of "a[2-5]/" in it's full path would be nice. For the most part, it's the top level deleted SVN directories, or a simple prefix on the full path. If I delete all a1, a2, a3, a4, and a5 directories, that might work, I'll try it, but when you are dealing with old SVN repositories and 100's of engineers with no proper repository rules, who knows what will happen. :( Of course the biggest bloat comes from jar and zip files people shoved into the SVN repos over the years, but BFG does a great job on that. |
I would also really love to see bfg support removing of specific subdirectories. This would make it useful in my situation. |
+1 for this feature. Sometimes it is prudent to prune entire paths from history. I imagine that this is a fairly common need. This can be achieved now, but at the moment it requires a lot of pre-BFG scripting to generate a list of objects that are in those delete-target trees but not in HEADs, then feed that to delete-by-objectId using
...
@rtyley Did you ever draw any performance-conclusions on this? Is it just a coding-exercise without fundamental overwhelming space/time concerns? It looks like this wants |
Here's a workaround to remove a given directory by path with BFG: git rev-list --all --objects -- path/to/the/directory/to/delete | git cat-file --batch-check='%(objectname) %(objecttype) %(rest)' | grep -Pe '^\w+ blob' | cut -d' ' -f1 > ./to-delete.txt
java -jar bfg.jar --no-blob-protection --strip-blobs-with-ids ./to-delete.txt The principle is simple: create a list of object IDs to strip, and input that to BFG. This means that if an object is referenced through a different path it will be nuked nonetheless.
Needless to say, it's much faster than git filter-branch 😄 |
You have to be careful with this approach. If a file has been copied from another location, this approach will also delete it in the other location as git uses the same hash for different locations. It has happened to me on trial runs a number of times. The best way to deal with this is to delete the directory in git first, commit and then run your script but without blob-protection. This seemed to have worked for me. |
@bschindler yes that's dangerous, that's what I said in bold in my comment. |
I do, I just get through them real slowly On 23 Aug 2016 12:00 p.m., "Lucas Trzesniewski" [email protected]
|
@rtyley oh sorry I just noticed you started working on the project again recently |
I erroneously included a folder containing release builds in my git repo. Can I use BFG to undo this, removing that folder and its contents from my git history? It's just been sitting there taking up space for no reason. |
@TharosTheDragon If the directory name is consistent throughout history, and doesn't conflict by name with other directories in the tree, then you could use |
What's the difference between --delete-dirs and --delete-folders? |
I'm sorry...I copied the wrong string, not paying attention...should be |
Thanks!
…________________________________
From: Jason Frey <[email protected]>
Sent: Monday, May 1, 2017 1:47 PM
To: rtyley/bfg-repo-cleaner
Cc: TharosTheDragon; Mention
Subject: Re: [rtyley/bfg-repo-cleaner] Ability to remove directories (#12)
I'm sorry...I copied the wrong string, not paying attention...should be --delete-folders <glob>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#12 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADwZODds1EhQBU07DUrNk30_ZrtPeApTks5r1hqjgaJpZM4Ae5o5>.
|
Thank @ltrzesniewski for his awesome answer.
|
Git tools apparently don't like empty trees much (C-Git CLI certainly makes it difficult to create empty trees, though it doesn't seem to mind them if it encounters them?). Also, if we're using multiple-blob-id removal to nuke directories (as in issue #12) then leaving the empty husks of directories around will just be ugly. rtyley/bfg-repo-cleaner#12 (comment)
Helps a little with issue #12. People can get a list of blob-ids using "git rev-list --all --objects", then grep to list all files in directories they want to nuke, and pass that to the BFG, as noted by @Fryguy: rtyley/bfg-repo-cleaner#12 (comment)
I was wondering if bfg-repo-cleaner had the ability to remove entire directories of files. I noticed the -D option, but it specifically says it doesn't work with paths, so I don't think it will work. Would it be possible to add an option to give one or more directories?
The text was updated successfully, but these errors were encountered: