Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added fuzzy image matching with pybktree and HEIF/HEIC support #56

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

bolshevik
Copy link

@bolshevik bolshevik commented Jan 19, 2019

  • This adds a fuzzy matching of images based on Hamming distance
  • All dependencies updated
  • HEIF/HEIC support on MacOS and Linux

@bolshevik bolshevik force-pushed the master branch 4 times, most recently from 17de864 to b666c2d Compare January 17, 2021 17:52
@phanirithvij
Copy link

@bolshevik what is the way to determine the threshold? I've tried 10 with 10 thousand images and only 250 duplicates came out. Whereas with the original version (no pybktree) there were 1200 duplicates found.

@bolshevik
Copy link
Author

bolshevik commented Jan 22, 2021

@phanirithvij I believe that you are having many duplicates and actually you should see 250 groups of duplicates, but each group should contain more images unless there is a bug. Let me explain the idea: the images are hashed and represented by binary strings, with this fuzzy matching it is possible to find not only fully equal strings but also those having a number of bits flipped (10 in your case). At some point this will also lead to false positives (according to us humans, but not the math) but only if you take too many bits as a threshold. I was seeing up to 100 performing well (but again depends on the image set you have).

Take a look at this snippet:

import pybktree

distance = 0
input = [0, 0, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 14]

tree = pybktree.BKTree(pybktree.hamming_distance, input)

for val in input:
    matches = tree.find(val, distance)
    print("{0:b}".format(val) + ": " + str(
        ["{1:b}".format(distance, matched) for distance, matched in matches])
    )

When the distance is 0, it behaves exactly the same as the original implementation, but much slower, therefore I recommend to clean up exact matches using the original flow first, and then continue with fuzzy:

0: ['0', '0']
0: ['0', '0']
1: ['1', '1']
1: ['1', '1']
10: ['10']
11: ['11']
100: ['100']
101: ['101']
110: ['110']
111: ['111']
1000: ['1000']
1001: ['1001']
1010: ['1010']
1011: ['1011']
1100: ['1100']
1101: ['1101']
1110: ['1110', '1110']
1111: ['1111']
1110: ['1110', '1110']

But if you increase the distance to 2:

0: ['0', '0', '1', '1', '10', '100', '1000', '11', '101', '1100', '110', '1010', '1001']
0: ['0', '0', '1', '1', '10', '100', '1000', '11', '101', '1100', '110', '1010', '1001']
1: ['1', '1', '0', '0', '11', '101', '1001', '111', '10', '1011', '100', '1101', '1000']
1: ['1', '1', '0', '0', '11', '101', '1001', '111', '10', '1011', '100', '1101', '1000']
10: ['10', '0', '0', '11', '110', '1010', '1', '111', '1', '1011', '100', '1000', '1110', '1110']
11: ['11', '1', '111', '1', '10', '1011', '0', '0', '1111', '101', '110', '1010', '1001']
100: ['100', '0', '0', '101', '1100', '110', '1', '111', '1', '10', '1101', '1000', '1110', '1110']
101: ['101', '1', '111', '1', '100', '1101', '0', '0', '11', '1111', '1100', '110', '1001']
110: ['110', '111', '10', '100', '1110', '1110', '0', '0', '11', '1111', '101', '1100', '1010']
111: ['111', '11', '1111', '101', '110', '1', '1', '10', '1011', '100', '1101', '1110', '1110']
1000: ['1000', '0', '0', '1100', '1010', '1001', '1', '1', '10', '1011', '100', '1101', '1110', '1110']
1001: ['1001', '1', '1', '1011', '1101', '1000', '0', '0', '11', '1111', '101', '1100', '1010']
1010: ['1010', '10', '1011', '1000', '1110', '1110', '0', '0', '11', '1111', '1100', '110', '1001']
1011: ['1011', '11', '1111', '1010', '1001', '1', '111', '1', '10', '1101', '1000', '1110', '1110']
1100: ['1100', '100', '1101', '1000', '1110', '1110', '0', '0', '1111', '101', '110', '1010', '1001']
1101: ['1101', '1111', '101', '1100', '1001', '1', '111', '1', '1011', '100', '1000', '1110', '1110']
1110: ['1110', '1110', '1111', '1100', '110', '1010', '111', '10', '1011', '100', '1101', '1000']
1111: ['1111', '111', '1011', '1101', '1110', '1110', '11', '101', '1100', '110', '1010', '1001']
1110: ['1110', '1110', '1111', '1100', '110', '1010', '111', '10', '1011', '100', '1101', '1000']

You can see that it grows to the right a lot, then I am actually eliminating duplicated groups by ensuring that all values are appearing only once:

0: {'1001', '1010', '0', '10', '11', '101', '1000', '100', '110', '1100', '1'}
111: {'1111', '1101', '1110', '10', '11', '111', '101', '100', '110', '1011', '1'}

There still might be some cross-references, mind "1" or "11" in each list. This is happening because 0 and 111 are < 2 distant to 1 and 11

I hope it explains the idea, if you think there is a bug somewhere, please let me know.

@phanirithvij
Copy link

phanirithvij commented Jan 22, 2021

Thanks for your reply. There were many images in the groups indeed.

I tried to test your pybktree implementation in my fork https://github.com/phanirithvij/duplicate-images because I couldn't directly run your fork on windows (magic module doesn't build on windows) so I had to make some additional changes.

But your pybktree implementation is, as it is copied.

Now there is a bug using zero threshold yields 0 duplicates

$ time py duplicate_finder.py find --threshold=0 --print
Started database...
Finding fuzzy duplicates, it might take a while... 0
100%[]
Number of duplicates: 0
Stopped database...

real    0m7.602s
user    0m0.000s
sys     0m0.016s

this is the original output.

$ time py duplicate_finder.py find --print
Number of duplicates: 1434
Stopped database...

real    0m7.625s
user    0m0.015s
sys     0m0.000s

PyBKTree takes roughly the same time but outputs nothing.

output for multiple thresholds for i in {0..15}; do time py duplicate_finder.py find --threshold=$i --print; done

expand
Rithvij@RithvijL  /d/Projects/fun/pythone/duplicate-images (master)
$ for i in {0..15}; do time py duplicate_finder.py find --threshold=$i --print; done
Started database...
Finding fuzzy duplicates, it might take a while... 0
100%Number of duplicates: 0
Stopped database...

real    0m8.096s
user    0m0.000s
sys     0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 1
100%Number of duplicates: 0
Stopped database...

real    0m8.055s
user    0m0.000s
sys     0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 2
100%Number of duplicates: 124
Stopped database...

real    0m14.636s
user    0m0.000s
sys     0m0.000s
Started database...
Finding fuzzy duplicates, it might take a while... 3
100%Number of duplicates: 124
Stopped database...

real    0m14.597s
user    0m0.000s
sys     0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 4
100%Number of duplicates: 209
Stopped database...

real    0m28.277s
user    0m0.000s
sys     0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 5
100%Number of duplicates: 209
Stopped database...

real    0m28.244s
user    0m0.000s
sys     0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 6
100%Number of duplicates: 241
Stopped database...

real    0m52.954s
user    0m0.000s
sys     0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 7
100%Number of duplicates: 241
Stopped database...

real    0m52.360s
user    0m0.015s
sys     0m0.000s
Started database...
Finding fuzzy duplicates, it might take a while... 8
100%Number of duplicates: 280
Stopped database...

real    2m33.524s
user    0m0.015s
sys     0m0.046s
Started database...
Finding fuzzy duplicates, it might take a while... 9
100%Number of duplicates: 280
Stopped database...

real    2m47.336s
user    0m0.031s
sys     0m0.016s
Started database...
Finding fuzzy duplicates, it might take a while... 10
100%Number of duplicates: 305
Stopped database...

real    4m7.267s
user    0m0.016s
sys     0m0.031s
Started database...
Finding fuzzy duplicates, it might take a while... 11
100%Number of duplicates: 305
Stopped database...

real    4m7.897s
user    0m0.000s
sys     0m0.061s
Started database...
Finding fuzzy duplicates, it might take a while... 12
100%Number of duplicates: 324
Stopped database...

real    5m53.380s
user    0m0.031s
sys     0m0.047s
Started database...
Finding fuzzy duplicates, it might take a while... 13
100%Number of duplicates: 324
Stopped database...

real    5m49.925s
user    0m0.031s
sys     0m0.016s
Started database...
Finding fuzzy duplicates, it might take a while... 14
100%Number of duplicates: 341
Stopped database...

real    7m47.925s
user    0m0.000s
sys     0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 15
100%Number of duplicates: 341
Stopped database...

real    7m47.621s
user    0m0.015s
sys     0m0.031s

Rithvij@RithvijL  /d/Projects/fun/pythone/duplicate-images (master)
$ time py duplicate_finder.py find --print
Started database...
Number of duplicates: 1434
Stopped database...

real    0m5.680s
user    0m0.000s
sys     0m0.046s

@bolshevik
Copy link
Author

@phanirithvij I think I have found the problem, could you please check bolshevik#1 . This should solve the wrong behavior when having many equal images.

@bolshevik bolshevik changed the title Added fuzzy image matching with pybktree Added fuzzy image matching with pybktree and HEIF/HEIC support Jan 24, 2021
…ations.

* Update all dependencies
* Add support of different hashes at the same time
* Redesign HTML page to vertically slice duplicates.
@bolshevik bolshevik force-pushed the master branch 9 times, most recently from b988d39 to 52e1bdf Compare December 22, 2023 22:56
- Add no_duplicates page from philipbl#71
- Improve fuzzy search for the new logic of multiple hashes
- Add table layout of dupliacates
- Extend documentation
- pylint and pycodestyle fixed, add ubuntu 22.04 test scripts
- Adapt tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants