-
-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Detect hard links #27
Comments
Very true. I've actually been bothered by this issue but didn't get around to deal with it. Thanks for bringing it up! IMO the best solution would be something where we don't have to assume whether the hard links come from within the scanned folder or without (I feel that might make things complicated and create some edge cases we might not have considered). What do you think of: giving a UI indication that the file/folder is a hard link (or has a hard link to it), and "doing the right thing" when the user deletes it? (delete all occurrences of it in the ui and only indicate the actual amount of disk space freed by it). As for detecting hard links: is it a very performance-heavy process? I haven't tested this, but since we're getting the file metadata anyway, can't we keep some state of "which files point to this inode" in order to know this? Or am I missing something? |
If I recall correctly,
Here I believe the worst case is if we have 2 large directories and suppose that
Then as we start computing the size of
We now look at Given that no hard links appeared to Moving on to
Now we look at Now the potentially expensive part is here, but I don't know off the top of my head how expensive this is.
This leads to the counter intuitive result that If we take the other perspective (the space we are using if we don't break the links in a copy) we get the opposite effect. For our purposes here, I think the asymptotically optimal data structure would be something like
Then, (with Python-esque dictionary literals) the process would look like
Some points about the above algorithm
The worst case scenario, once fully accounted for inodes are pruned out, would likely be a collection of hard linked files many generations apart from their linked counter parts.
|
Hey @dbramucci - thanks again for the super detailed explanation. I think I understand your concerns better now. If an inode is a unique global identifier of a hard-link (as in: "I am a hard link if the inode of the file I point to has more than 1 hard link pointed at it"), why can't we use a global hashmap whose keys are inodes. We update it every time we add a file, and silently ignore any file with an inode already in this table? We couple this with a UI change, swapping Then, we are focused on "space that would be freed upon deletion", and the only edge case we don't deal with very nicely is if we have a hard link that leads outside our scanned folder (I think we can live with that). I think this is essentially the same user-facing behaviour we have right now with soft-links (seeing as they take up no space, so we don't show them). Please, do tell me if I'm not taking something into consideration. What do you think? |
Just to make sure that we aren't talking past each-other over a misconception To my understanding, when you make a hard link from This means that in my original example with the
Even though the two folders are identical in every way except for file names. It would make more sense to use a ChainMap (Python example linked because I know it off the top of my head) and say that all files are links if their parents have an hard link to that same file because it would improve consistency but I feel like it wouldn't match up to any mental model or goal a user would have. If we treat all hard links like symlinks are treated today (i.e. we ignore them from all counts and by initial discussion here, both Consider the Sherlock example. With today's diskonaut, we would incorrectly say that they could save 1.2M of space by deleting that directory because we currently treat the 2 files separately. With the "all hard links are like symlinks and won't get counted diskonaut" we would say that there are 0B of space to free here. If we use the version you propose, it would work until we zoomed in to I also think that hard links may frequently lead outside the users folder (i.e. a bunch of users might have a hard link in their home folder to one global virtual machine image) and the behavior could be weird for each user (details depend on implementation details). As to edge cases for hard links.
Minor detail but (even ignoring metadata) soft-links generally require some space (on my system it appears to require 4K of space) ~/tmp> touch test.txt
~/tmp> ln -s test.txt link
~/tmp> ls
link@ test.txt
~/tmp> ls -hs
total 4.0K
4.0K link@ 0 test.txt and yes, it takes more space than an empty text file. |
Okay, that is actually a misconception that I've had. Thank you for clearing this up, I think now I understand where you're coming from. Before talking about the algorithm and its implementation, how would you think it best to present this to the user in a simple and immediately comprehensible way? Like with the D1+D2 example, what would you do? (I think you mentioned the lower/higher bound approach above, but I didn't understand how we would see this on screen) The reason I'm asking is that I feel if we don't find a really simple way, then we'd be damaging a lot of users at the expense of an edge case. |
Given that the current behavior of diskonaut is to start by assuming a file is empty and growing until it reaches the full size and the most likely use-case for diskonaut is to free space on a drive I believe the most natural extension is to start by assuming all hard links would free no space (i.e. they are effectively 0 bytes) and we just track how many we see. As we loop through the directory, we sum up all of the non linked file sizes until we get to the end where we can see if it deleting the directory would free any data (i.e. all hard links to a inode are contained within that directory) then we add that space to the space for that directory. Users would not perceive any difference in behavior compared to today except for smaller (and more accurate) file sizes being reported. The most confusing aspect for your average user would be that diskonaut would disagree with the sizes that other utilities (like This would be confusing to anyone who doesn't understand or recognize the relevance of hard links here. (really rough sketch)
or As for the more complicated interface I suggested at the beginning, Something like
But, if you don't call |
As an aside, it would be nice if there was some good way to show that |
I'm going to suggest some ideas for the UI first, because I feel once we know what we'd like to show the user, it would be easier for us to think how to do it, and thus know what we should implement and what we won't be needing. At least, that's how it's easiest for me to work. :) So regarding UI:
And then have the legend only appear if we actually have hard links on screen. Then, in folders that include hard links, we'll have what we have now, plus an additional line under it (assuming there is space for it) reading: |
Some things to be aware of with any attempt to graphically represent hard links will have to deal with the issue that the standard mindset used to represent filesystems is that the filesystem is a tree where hard links change the filesystem into a type of non-tree directed acyclic graph. On the flip side, most filesystems use links sparingly and therefore, trees form a close approximation for said filesystem and trees offer cleaner, less complicated and less clustered visualizations than generalized DAGs (directed-acyclic graphs) do. At a first glance, your proposal demonstrates the following Pros
Cons
Of course, we shouldn't let the perfect be the enemy of the good. |
A handy capability would be to highlight a hard link and press some key (say That way you could highlight ``sherlock.txt
Likewise, there are analogous popups that could be designed for folders containing hard links. |
I'm trying to find the right terminology to distinguish between
once I have figured out some terminology I can describe what we are looking at in terms of graph theory and some relevant terminology that can be googled for finding similar situations, algorithms and visualizations. |
Running the following commands
Will produce a 600 KiB file
sherlock.txt
and a hard linksherlock-link.txt
.diskonaut
currently detects these 2 as different files but deleting either file will not free any disk space (beyond a little metadata).This is similar to the problem of issue #26 where the expectation of xG of deleted data = xG of more usable space proves incorrect.
This problem is typical of hard links (see
ln
's behavior) but it would be cool ifdiskonaut
could deal with them nicely.UI idea
Put upper and lower bounds on disk-space before running the slower process of detecting the details of hard-linked files.
By checking for files with a link-count of 1, you can identify non hard linked files.
If you assume that all hard linked files are used outside the current folder (this assumption may be modified for the root directory perhaps) than you get a lower bound of space freed upon deletion.
If you assume that all hard linked are within a directory, you get an upper bound of space freed upon deletion (this is the current behavior of
diskonaut
)The text was updated successfully, but these errors were encountered: