Skip to content

Identifying and removing near-duplicate images using perceptual hashing.

Notifications You must be signed in to change notification settings

knjcode/imgdupes

Repository files navigation

imgdupes

imgdupes is a command line tool for checking and deleting near-duplicate images based on perceptual hash from the target directory.

video_capture Images by Caltech 101 dataset that semi-deduped for demonstration.

It is better to pre-deduplicate identical images with fdupes or jdupes in advance.
Then, you can check and delete near-duplicate images using imgdupes with an operation similar to the fdupes command.

For large dataset

It is possible to speed up dedupe process by approximate nearest neighbor search of hamming distance using NGT or hnsw. See Against large dataset section for details.

Install

To install, simply use pip:

$ pip install imgdupes

Usage

The following example is sample command to find sets of near-duplicate images with Hamming distance of phash less than 4 from the target directory.
To search images recursively from the target directory, add -r or --recursive option.

$ imgdupes --recursive target_dir phash 4
target_dir/airplane_0583.jpg
target_dir/airplane_0800.jpg

target_dir/watch_0122.jpg
target_dir/watch_0121.jpg

By default, imgdupes displays a list of duplicate images list and exits.
To display preserve or delete images prompt, use the -d or --delete option.

If you are using iTerm 2, you can display a set of images on the terminal with the -c or --imgcat option.

$ imgdupes --recursive --delete --imgcat 101_ObjectCategories phash 4

The set of images are sorted in ascending order of file size and displayed together with the pixel size of the image, you can choose which image to preserve.

With -N or --noprompt option, you can preserve the first file in each set of duplicates and delete the rest without prompting.

$ imgdupes -rdN 101_ObjectCategories phash 0

To take input from a list of files

Use --files-from or -T option to take input from a list of files.

$ imgdupes -T image_list.txt phash 0

For example, create image_list.txt as below.

101_ObjectCategories/Faces/image_0345.jpg
101_ObjectCategories/Motorbikes/image_0269.jpg
101_ObjectCategories/Motorbikes/image_0735.jpg
101_ObjectCategories/brain/image_0047.jpg
101_ObjectCategories/headphone/image_0034.jpg
101_ObjectCategories/dollar_bill/image_0038.jpg
101_ObjectCategories/ferry/image_0020.jpg
101_ObjectCategories/tick/image_0049.jpg
101_ObjectCategories/Faces_easy/image_0283.jpg
101_ObjectCategories/watch/image_0171.jpg

Find near-duplicated images from an image you specified

Use --query option to specify a query image file.

$ imgdupes --recursive target_dir --query target_dir/airplane_0583.jpg phash 4
Query: sample_airplane.png

target_dir/airplane_0583.jpg
target_dir/airplane_0800.jpg

Against large dataset

imgdupes supports approximate nearest neighbor search of hamming distance using NGT or hnsw.

To dedupe images using NGT, run with --ngt option after installing NGT and python binding.

$ imgdupes -rdc --ngt 101_ObjectCategories phash 4

Notice: --ngt option is enabled by default from version 0.1.0.

For instructions on installing NGT and python binding, see NGT and python NGT.

To dedupe images using hnsw, run with --hnsw option after installing hnsw python binding.

$ imgdupes -rdc --hnsw 101_ObjectCategories phash 4

Fast exact searching

imgdupes supports exact nearest neighbor search of hamming distance using faiss (IndexFlatL2).

To dedupe images using faiss, run with --faiss-flat option after installing faiss python binding.

$ imgdupes -rdc --faiss-flat 101_ObjectCategories phash 4

Using imgdupes without installing it with docker

You can use imgdupes without installing it using a pre-build docker container image.
NGT, hnsw and faiss are already installed in this image.

Place the target directory in the current directory and execute the following command.

$ docker run -it -v $PWD:/app knjcode/imgdupes -rdc target_dir phash 0

When docker run, current directory is mounted inside the container and referenced from imgdupes.

By aliasing the command, you can use imgdupes as installed.

$ alias imgdupes="docker run -it -v $PWD:/app knjcode/imgdupes"
$ imgdupes -rdc target_dir phash 0

To upgrade imgdupes docker image, you can pull the docker image as below.

$ docker pull knjcode/imgdupes

Available hash algorithm

imgdupes uses the ImageHash to calculate perceptual hash (except for phash_org algorithm).

  • ahash: average hashing

  • phash: perception hashing (using only the 8x8 DCT low-frequency values including the first term)

  • dhash: difference hashing

  • whash: wavelet hashing

  • phash_org: perception hashing (fix algorithm from ImageHash implementation)

    using only the 8x8 DCT low-frequency values and excluding the first term since the DC coefficient can be significantly different from the other values and will throw off the average.

Options

-r --recursive

search images recursively from the target directory (default=False)

-d --delete

prompt user for files to preserve and delete (default=False)

-c --imgcat

display duplicate images for iTerm2 (default=False)

-m --summarize

summarize dupe information

-N --noprompt

together with --delete, preserve the first file in each set of duplicates and delete the rest without prompting the user

--query <image filename>

find image files that are duplicated or similar to the specified image file from the target directory

--hash-bits 64

bits of perceptual hash (default=64)

The number of bits specifies the value that is the square of n.
For example, you can specify 64(8^2), 144(12^2), 256(16^2), etc.

--sort <sort_type>

how to sort duplicate image files (default=filesize)

You can specify following types:

  • filesize: sort by filesize in descending order
  • filepath: sort by filepath in ascending order
  • imagesize: sort by pixel width and height in descenging order
  • width: sort by pixel width in descending order
  • height: sort by pixel height in descending order
  • none: do not sort

--reverse

reverse sort order

--num-proc 4

number of hash calculation and ngt processes (default=cpu_count-1)

--log

output logs of duplicate and delete files (default=False)

--no-cache

not create or use image hash cache (default=False)

--no-subdir-warning

stop warnings that appear when similar images are in different subdirectories

--sameline

list each set of matches on a single line

--dry-run

dry run (do not delete any files)

--faiss-flat

use faiss exact search (IndexFlatL2) for calculating Hamming distance between hash of images (default=False)

--faiss-flat-k 20

number of searched objects when using faiss-flat (default=20)

use with imgcat (-c, --imgcat) options

--size 256x256

resize image (default=256x256)

--space 0

space between images (default=0)

--space-color black

space color between images (default=black)

--tile-num 4

horizontal tile number (default=4)

--interpolation INTER_LINEAR

interpolation methods (default=INTER_LINEAR)

You can specify OpenCV interpolation methods: INTER_NEAREST, INTER_LINEAR, INTER_AREA, INTER_CUBIC, INTER_LANCZOS4, etc.

--no-keep-aspect

do not keep aspect when displaying images

ngt options

--ngt

use NGT for calculating Hamming distance between hash of images (default=True)

--ngt-k 20

number of searched objects when using NGT. Increasing this value, improves accuracy and increases computation time. (default=20)

--ngt-epsilon 0.1

search range when using NGT. Increasing this value, improves accuracy and increases computation time. (default=0.1)

--ngt-edges 10

number of initial edges of each node at graph generation time. (default=10)

--ngt-edges-for-search 40

number of edges at search time. (default=40)

hnsw options

--hnsw

use hnsw for calculating Hamming distance between hash of images (default=False)

--hnsw-k 20

number of searched objects when using hnsw. Increasing this value, improves accuracy and increases computation time. (default=20)

--hnsw-ef-construction 100

controls index search speed/build speed tradeoff (default=100)

--hnsw-m 16

m is tightly connected with internal dimensionality of the data stronlgy affects the memory consumption (default=16)

--hnsw-ef 50

controls recall. higher ef leads to better accuracy, but slower search (default=50)

faiss options

--faiss-cuda

uses CUDA enabled device for faster searching (requires faiss-gpu, Nvidia GPU, and CUDA toolkit)
Install: https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
General: https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU

CUDA options

--cuda-device

uses the specific CUDA device passed for CUDA accelerated searches (default=device with lowest load)
NOTE: if the device passed is not found on the system the CUDA device with the lowest load will be used

License

MIT