Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to remove directories #12

Open
Fryguy opened this issue Mar 7, 2013 · 27 comments
Open

Ability to remove directories #12

Fryguy opened this issue Mar 7, 2013 · 27 comments

Comments

@Fryguy
Copy link

Fryguy commented Mar 7, 2013

I was wondering if bfg-repo-cleaner had the ability to remove entire directories of files. I noticed the -D option, but it specifically says it doesn't work with paths, so I don't think it will work. Would it be possible to add an option to give one or more directories?

@rtyley
Copy link
Owner

rtyley commented Mar 7, 2013

@Fryguy It would be possible to have a switch based on directory name rather than directory path, if that is useful? For instance:

--delete-dirs <glob> - delete directories with the specified names

Can you fill me in with some more context about your use case? Are you removing sensitive/private data, or just want to remove large files to reduce repo size?

@Fryguy
Copy link
Author

Fryguy commented Mar 7, 2013

@rtyley Given a structure below, and I only want to remove the root-level aaa directory, would it remove both since it's by name?

  • aaa/
  • bbb/
    • aaa/

I guess my use case falls into the "reduce repo size" category. My use case is I'm trying to split a massive repo with a long history into separate repos while keeping history. Most of the split is based on top-level directories. For example, a, b, and c will go to one repo; d to a second repo; and e and f to a third repo. git-filter-branch works ok for a single directory using --subdirectory-filter (when that directory doesn't have much activity in the history), but to do multiple directories I have to use --index-filter 'git rm -rf ', which takes forever. Considering I have to split it into about 5 repos, this approach will take forever.

I found bfg-repo-cleaner and ran it to clean up some large files and was amazed by the performance, hence why I would love to be able to use it.

@rtyley
Copy link
Owner

rtyley commented Mar 7, 2013

Given a structure below, and I only want to remove the root-level aaa directory, would it remove both since it's by name?

The implementation I was thinking of in my head would, yes, which makes it a fairly blunt instrument. Part of the special-sauce that makes the BFG so fast is that it is path-independent, which makes me cautious about adding path-dependent features - unless they can be added without effecting performance. Let me have a think about it, I can see it's a valid feature.

I found bfg-repo-cleaner and ran it to clean up some large files and was amazed by the performance

That makes a good quote - would you mind if I added it to http://rtyley.github.com/bfg-repo-cleaner/#feedback, attributing it to you as 'Jason Frey, Software Engineer at Red Hat'?

@Fryguy
Copy link
Author

Fryguy commented Mar 7, 2013

That makes a good quote - would you mind if I added it to http://rtyley.github.com/bfg-repo-cleaner/#feedback, attributing it to you as 'Jason Frey, Software Engineer at Red Hat'?

Sure, go for it.

Let me have a think about it, I can see it's a valid feature.

Thanks! I was trying to go through the code to find where the file name checking was done, to see if I could help out with a patch or something, but I couldn't find it. I've also never done Scala before, so it's also a new thing to look at.

@rtyley
Copy link
Owner

rtyley commented Mar 7, 2013

Sure, go for it.

Thanks - I've added your quote, much appreciated.

Thanks! I was trying to go through the code to find where the file name checking was done, to see if I could help out with a patch or something, but I couldn't find it.

These are the lines currently associated with stripping out out files that match a particular text-pattern....

https://github.com/rtyley/bfg-repo-cleaner/blob/v1.0.2/src/main/scala/com/madgag/git/bfg/cli/CLIConfig.scala#L118-L124

...but the code really is not setup to deal with paths, it's a non-trivial change to get it there :)

Out of interest, would you want/expect 'empty' commits (ie commits that appear to change nothing, because they relate to stuff from deleted directories) to be completely removed by the BFG, or for them to stay in place, with the commit message intact?

I've also never done Scala before, so it's also a new thing to look at.

If you've got several weeks to invest (!) and you'd like to learn Scala I can really recommend this free course:

https://www.coursera.org/course/progfun

A lot of us in the office took it a few months ago and it really brought us up to speed.

@Fryguy
Copy link
Author

Fryguy commented Mar 7, 2013

Out of interest, would you want/expect 'empty' commits (ie commits that appear to change nothing, because they relate to stuff from deleted directories) to be completely removed by the BFG, or for them to stay in place, with the commit message intact?

I would expect them to be removed for my use case, but others might(?) want to keep them. git-filter-branch has the --prune-empty option, which I was using.

@Fryguy
Copy link
Author

Fryguy commented Mar 9, 2013

I know very little about git internals, so I'm not sure this helps, but I know if I do git rev-list --all --objects, I get a list of every object with full path. With my repo of ~365,000 objects it takes only 24 seconds to run (with an unprimed file cache). This can then be easily grepped to get a list of object SHAs, which I assume can be run through the "delete a file from history" method.

Would that help?

rtyley added a commit that referenced this issue Mar 11, 2013
Git tools apparently don't like empty trees much (C-Git CLI certainly
makes it difficult to create empty trees, though it doesn't seem to mind
them if it encounters them?).

Also, if we're using multiple-blob-id removal to nuke directories (as
in issue #12) then leaving the empty husks of directories around will
just be ugly.

#12 (comment)
rtyley added a commit that referenced this issue Mar 13, 2013
Git tools apparently don't like empty trees much (C-Git CLI certainly
makes it difficult to create empty trees, though it doesn't seem to mind
them if it encounters them?).

Also, if we're using multiple-blob-id removal to nuke directories (as
in issue #12) then leaving the empty husks of directories around will
just be ugly.

#12 (comment)
rtyley added a commit that referenced this issue Mar 20, 2013
Git tools apparently don't like empty trees much (C-Git CLI certainly
makes it difficult to create empty trees, though it doesn't seem to mind
them if it encounters them?).

Also, if we're using multiple-blob-id removal to nuke directories (as
in issue #12) then leaving the empty husks of directories around will
just be ugly.

#12 (comment)
rtyley added a commit that referenced this issue Mar 20, 2013
Helps a little with issue #12. People can get a list of blob-ids using
"git rev-list --all --objects", then grep to list all files in
directories they want to nuke, and pass that to the BFG, as noted by
@Fryguy:

#12 (comment)
rtyley added a commit that referenced this issue May 29, 2013
This is not really the full implementation of issue #12 ("Ability to
remove directories"), because that issue actually requires
path-dependant pruning. This is just simple filtering based on
folder-name, but it should help this guy:

http://stackoverflow.com/q/16821649/438886

..and also Francis with https://github.com/janua/premiumparking

To create the nasty folder-named-'.git' repo, this is the code I used:

https://gist.github.com/rtyley/5673862
@yeago
Copy link

yeago commented May 19, 2014

"Fairly blunt instrument" to say the least. So if I someone committed a lib directory in the parent, I can't use this tool without removing all lib directories anywhere in the path?

@rtyley
Copy link
Owner

rtyley commented May 19, 2014

So if I someone committed a lib directory in the parent, I can't use this tool without removing all lib directories anywhere in the path?

Yep, that's true of the --delete-folder [foo] option. Folders named [foo] are removed from anywhere within history (apart from latest commit, which is 'protected'), and yep, this is a blunt instrument.

Often though, it's reasonable to ask exactly what you're trying to achieve by removing the lib folder. I would guess you're just trying to make your repo smaller. In which case, you can get pretty close to that aim by just using --strip-blobs-bigger-than 10M (or whatever size is appropriate to your repository).

@yeago
Copy link

yeago commented May 19, 2014

i can't use strip blobs now because i work at a company and i'm not sure about some of the files yet. i am sure about the directory in question.

honestly, in terms of design i don't see how a generic delete directory by name could be very useful. it certainly isn't in my case :P it seems like a loaded gun waiting to destroy current directories... but you said not the latest commit. does that mean currently existing directories called 'lib' won't be touched?

@rtyley
Copy link
Owner

rtyley commented May 19, 2014

but you said not the latest commit. does that mean currently existing directories called 'lib' won't be touched?

Sure- here's some further documentation:

http://rtyley.github.io/bfg-repo-cleaner/#protected-commits

@lfilho
Copy link

lfilho commented Sep 13, 2014

+1 for specifying an absolute path for removal.

My case is that unfortunately the team has commited a lot of libs in the repo (since SVN times...) and, until we refactor the project and put them into Maven or something external like that, we gotta keep the libs in the repo. But, we did identify several libs that were useless by now, which could be removed and save already a good space in the repo...

So suppose I have:

/libs/certain-lib/2.1/certain-lib.jar
/libs/certain-lib/3.4/certain-lib.jar

In this example, we're still using version 3.4. So we couldn't delete "certain-lib" folder. And also we couldn't delete "certain-lib.jar" otherwise the 3.4 version would also go.

As for the performance, I think it's fine: for this full path case, you could just warn the users (manual, readme, before running the command...). I wouldn't mind at all.

Nonetheless, congrats and thanks a lot for this great tool!!!!!

@rtyley
Copy link
Owner

rtyley commented Sep 13, 2014

In this example, we're still using version 3.4. So we couldn't delete "certain-lib" folder. And also we couldn't delete "certain-lib.jar" otherwise the 3.4 version would also go.

@lfilho in your example, you should be fine to do:

$ bfg --delete-files *.jar

...this will delete all jars that are not in your latest commit - because, by default, the BFG protects the contents of your latest commit. So /libs/certain-lib/2.1/certain-lib.jar will be deleted from your repo (because it's not present in your latest commit) - but /libs/certain-lib/3.4/certain-lib.jar won't be deleted (because it is present in your latest commit).

This command is short and sweet, and should definitely be used unless there's a good reason not to. Although I appreciate that for some use-cases path-dependent action is necessary, for the large majority of cases, it's not. For some of the cases where path-dependent action is necessary, there may actually already be a decent alternative tool (perhaps git-subtree, which is decently performant) that can perform the task.

I'm always *very happy to hear explanations of why users do need path-dependent action, and if people explain the need here on this issue, that'll help my prioritise this feature. So far, of the two people who've discussed their requirements yet, FryGuy had a legitimate use-case, whereas yeago, I believe, would have been served perfectly well by the BFG's protected-commit behaviour.

As for the performance, I think it's fine: for this full path case, you could just warn the users (manual, readme, before running the command...). I wouldn't mind at all.

The cost of implementing path-dependent action:

  • Possibly Performance
  • Definitely a big chunk of dev time - almost certainly my time, unpaid, when I could be creating something more useful to more people.
  • Definitely complexity in the implementation of the BFG :-) The BFG implementation is relatively simple because it does not care about the path, the implementation of this feature would not be trivial.

Given the cost of implementing the feature, vs the benefit it provides to a limited percentage of users, what would you do!? Personally, I would like to try implement it, but that will have to be in a world where I have considerably more time.

Nonetheless, congrats and thanks a lot for this great tool!!!!!

Thank you, I appreciate your thanks :-)

@kellyohair
Copy link

First off, a great tool, and very much appreciated.

Our company went through a "split repos" stage under SVN where one very large repo (200,000 files?), let's call it repo "a",and it contained 5 top level directories: a1, a2, a3, a4, and a5, and it got turned into 5 separate SVN repos: a1, a2, a3, a4, and a5. I wasn't around when this happened, but apparently they must have copied "a" 5 times, then did SVN deletes to trim each one, then pulled the subdirs up to the top of each repo (e.g. for the "a1" repository: rm -f -r a2 a3 a4 a5 ; mv a1/* . ; rmdir a1).

So now the transition to 5 GIT repositories (and preserving at least the SVN source code change history, using git-svn) creates 5 rather bloated GIT repositories. So some kind of simple delete any a1 repo file with a prefix pattern of "a[2-5]/" in it's full path would be nice. For the most part, it's the top level deleted SVN directories, or a simple prefix on the full path.

If I delete all a1, a2, a3, a4, and a5 directories, that might work, I'll try it, but when you are dealing with old SVN repositories and 100's of engineers with no proper repository rules, who knows what will happen. :(

Of course the biggest bloat comes from jar and zip files people shoved into the SVN repos over the years, but BFG does a great job on that.

@xanderdunn
Copy link

I would also really love to see bfg support removing of specific subdirectories. This would make it useful in my situation.

@javabrett
Copy link
Contributor

+1 for this feature. Sometimes it is prudent to prune entire paths from history. I imagine that this is a fairly common need. This can be achieved now, but at the moment it requires a lot of pre-BFG scripting to generate a list of objects that are in those delete-target trees but not in HEADs, then feed that to delete-by-objectId using -bi. It works, but it's pretty cumbersome.

Part of the special-sauce that makes the BFG so fast is that it is path-independent, which makes me cautious about adding path-dependent features - unless they can be added without effecting performance. Let me have a think about it, I can see it's a valid feature.

...

...but the code really is not setup to deal with paths, it's a non-trivial change to get it there :)

@rtyley Did you ever draw any performance-conclusions on this? Is it just a coding-exercise without fundamental overwhelming space/time concerns?

It looks like this wants scala-git Tree to be able to maintain maps on both the blob short/relative filename, and the full path, which would require generating the two maps and the storage required for that, but doesn't otherwise seem like a big burden.

@ltrzesniewski
Copy link

ltrzesniewski commented Jul 19, 2016

Here's a workaround to remove a given directory by path with BFG:

git rev-list --all --objects -- path/to/the/directory/to/delete | git cat-file --batch-check='%(objectname) %(objecttype) %(rest)' | grep -Pe '^\w+ blob' | cut -d' ' -f1 > ./to-delete.txt
java -jar bfg.jar --no-blob-protection --strip-blobs-with-ids ./to-delete.txt

The principle is simple: create a list of object IDs to strip, and input that to BFG. This means that if an object is referenced through a different path it will be nuked nonetheless.

  • git rev-list --all --objects -- path/to/the/directory/to/delete
    This will list all objects in the subdirectory referenced in all commits which modify the given path. The format is objectid filepath.

    You should run this command to check its output matches what you'd expect.

  • git cat-file --batch-check='%(objectname) %(objecttype) %(rest)'
    This will qualify the object with its type. It will turn the previous format objectid filepath into objectid type filepath.

  • grep -Pe '^\w+ blob'
    This will filter out non-blob objects.

  • cut -d' ' -f1 > ./to-delete.txt
    This will extract the object ID and redirect the output into the to-delete.txt file.

  • java -jar bfg.jar --no-blob-protection --strip-blobs-with-ids ./to-delete.txt
    This runs BFG, giving it the list of objects to remove.

Needless to say, it's much faster than git filter-branch 😄

@bschindler
Copy link

Here's a workaround to remove a given directory by path with BFG:

You have to be careful with this approach. If a file has been copied from another location, this approach will also delete it in the other location as git uses the same hash for different locations. It has happened to me on trial runs a number of times.

The best way to deal with this is to delete the directory in git first, commit and then run your script but without blob-protection. This seemed to have worked for me.

@ltrzesniewski
Copy link

@bschindler yes that's dangerous, that's what I said in bold in my comment.
I made a PR which enables a safe method, see #166 - unfortunately the maintainer doesn't seem to care about PRs.

@rtyley
Copy link
Owner

rtyley commented Aug 23, 2016

I do, I just get through them real slowly

On 23 Aug 2016 12:00 p.m., "Lucas Trzesniewski" [email protected]
wrote:

@bschindler https://github.com/bschindler yes that's dangerous, that's
what I said in bold in my comment.
I made a PR which enables a safe method, see #166
#166 - unfortunately the
maintainer doesn't seem to care about PRs.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#12 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADLRkxFsBtEve1AzvPHCtc_J-v_R9pNks5qitLAgaJpZM4Ae5o5
.

@ltrzesniewski
Copy link

@rtyley oh sorry I just noticed you started working on the project again recently

@ghost
Copy link

ghost commented May 1, 2017

I erroneously included a folder containing release builds in my git repo. Can I use BFG to undo this, removing that folder and its contents from my git history? It's just been sitting there taking up space for no reason.

@Fryguy
Copy link
Author

Fryguy commented May 1, 2017

@TharosTheDragon If the directory name is consistent throughout history, and doesn't conflict by name with other directories in the tree, then you could use --delete-dirs <glob> - delete directories with the specified names. Note that command is based on name, not path, so if you have multiple directories with the same name, even at different depths, they will both be removed.

@ghost
Copy link

ghost commented May 1, 2017

What's the difference between --delete-dirs and --delete-folders?

@Fryguy
Copy link
Author

Fryguy commented May 1, 2017

I'm sorry...I copied the wrong string, not paying attention...should be --delete-folders <glob>

@ghost
Copy link

ghost commented May 1, 2017 via email

@gqy117
Copy link

gqy117 commented Dec 20, 2017

Thank @ltrzesniewski for his awesome answer.
In my case, I need to delete 2 files with full path provided.
So I tweaked @ltrzesniewski 's answer:

git rev-list --all --objects | grep -P '^\w+ Path/to/your/file1.txt' | cut -d" " -f1 >> ../to-delete.txt
git rev-list --all --objects | grep -P '^\w+ Path/to/your/file2.txt' | cut -d" " -f1 >> ../to-delete.txt

java -jar bfg.jar --no-blob-protection --strip-blobs-with-ids ./to-delete.txt

abaga129 pushed a commit to abaga129/glean that referenced this issue Jul 18, 2019
Git tools apparently don't like empty trees much (C-Git CLI certainly
makes it difficult to create empty trees, though it doesn't seem to mind
them if it encounters them?).

Also, if we're using multiple-blob-id removal to nuke directories (as
in issue #12) then leaving the empty husks of directories around will
just be ugly.

rtyley/bfg-repo-cleaner#12 (comment)
abaga129 pushed a commit to abaga129/glean that referenced this issue Jul 18, 2019
Helps a little with issue #12. People can get a list of blob-ids using
"git rev-list --all --objects", then grep to list all files in
directories they want to nuke, and pass that to the BFG, as noted by
@Fryguy:

rtyley/bfg-repo-cleaner#12 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants