Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update video duplicate finder and more #1425

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Conversation

qarmin
Copy link
Owner

@qarmin qarmin commented Dec 29, 2024

  • updated vid_dup_lib_finder to latest version
  • used jxl -> image-rs converter provided by jxl library
  • using bigger buffer size to speedup checking for files in HDD and also SSD(biggest gains are in HDD)
  • option to enable fast-image-resize

Performance comparison of using Array and Vector with specific buffer sizes for reading files from disk and calculating their hashes.
This is quite realistic scenario which also uses rayon which sometimes sometimes mess with predictability of results.
My computer have quite good CPU, but cheap Sata SSD, so results shows disk

(Time to read files and calculate hashes in parallel, smaller is better)

Name 250000 files ~50 KB(SSD) 170 files 5MB-150MB(SSD) 1 file 0.9 GB(SSD) 6200 files 50KB - 50MB(HDD) 1 file 671 MB(HDD)
Array 16KB Base Base Base Base Base
Vector 16KB 0% 0% 0% 0% 0%
Vector 1MB -7% -4% -16% -45% 0%
Thread local Vector 1MB -12% -4% -16% -45% 0%

I tried to use locks to read at max 1 file from hdd, but there was no performance gains

There is also option to speedup resizing images, which causes significant speedup when checking for similar images
On my OS and hashed 91 files(~3/4 MB each, 3000x4000 JPGS) I got sometimes even 3x speedup(in real world, speedup should be smaller, depending on size of files(bigger should get bigger gains) and algorithm(nearest and blockhash almost not have any speedup because one is very simple to resize and other do not resize image before hashing ))

Size Algorithm Filter NORMAL [ms] FAST IMAGE RESIZE [ms]
16 Mean Gaussian 8845 2738
16 Gradient Nearest 2127 2216
16 Gradient Lanczos3 8802 2809
8 Gradient Lanczos3 8819 2771
64 Gradient Lanczos3 5347 2486
16 BlockHash Lanczos3 6162 5934

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant