-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added fuzzy image matching with pybktree and HEIF/HEIC support #56
base: master
Are you sure you want to change the base?
Conversation
bolshevik
commented
Jan 19, 2019
•
edited
Loading
edited
- This adds a fuzzy matching of images based on Hamming distance
- All dependencies updated
- HEIF/HEIC support on MacOS and Linux
…y size for duplciates.
17de864
to
b666c2d
Compare
@bolshevik what is the way to determine the threshold? I've tried |
@phanirithvij I believe that you are having many duplicates and actually you should see 250 groups of duplicates, but each group should contain more images unless there is a bug. Let me explain the idea: the images are hashed and represented by binary strings, with this fuzzy matching it is possible to find not only fully equal strings but also those having a number of bits flipped (10 in your case). At some point this will also lead to false positives (according to us humans, but not the math) but only if you take too many bits as a threshold. I was seeing up to 100 performing well (but again depends on the image set you have). Take a look at this snippet:
When the distance is 0, it behaves exactly the same as the original implementation, but much slower, therefore I recommend to clean up exact matches using the original flow first, and then continue with fuzzy:
But if you increase the distance to 2:
You can see that it grows to the right a lot, then I am actually eliminating duplicated groups by ensuring that all values are appearing only once:
There still might be some cross-references, mind "1" or "11" in each list. This is happening because 0 and 111 are < 2 distant to 1 and 11 I hope it explains the idea, if you think there is a bug somewhere, please let me know. |
Thanks for your reply. There were many images in the groups indeed. I tried to test your pybktree implementation in my fork https://github.com/phanirithvij/duplicate-images because I couldn't directly run your fork on windows (magic module doesn't build on windows) so I had to make some additional changes. But your pybktree implementation is, as it is copied. Now there is a bug using zero threshold yields 0 duplicates
this is the original output.
PyBKTree takes roughly the same time but outputs nothing. output for multiple thresholds expand
|
@phanirithvij I think I have found the problem, could you please check bolshevik#1 . This should solve the wrong behavior when having many equal images. |
…ations. * Update all dependencies * Add support of different hashes at the same time * Redesign HTML page to vertically slice duplicates.
b988d39
to
52e1bdf
Compare
- Add no_duplicates page from philipbl#71 - Improve fuzzy search for the new logic of multiple hashes - Add table layout of dupliacates - Extend documentation - pylint and pycodestyle fixed, add ubuntu 22.04 test scripts - Adapt tests