Couckoo is a duplicate image detection tool design for image de at scale using locality sensitive hashing. For the ability to achieve image deduplication at scale, an approximating search algorithm such as LSH offers a significant trade of accuracy for speed and efficiency with parameter tuning.
This program employs Locality Sensitive Hashing (LSH) to detect near-duplicate images within a directory. The ability to detect duplicates for the purpose deduplication at scale is crucial to maintaining good quality image datasets. The program is structured into the following components:
-
- Handles image preprocessing tasks such as converting images to grayscale, resizing, and flipping to normalize brightest quarters..
- Computes a perceptual hash (dhash) of the image to generate a signature for similarity comparison.
-
- Implements LSH for efficient similarity detection by dividing image signatures into bytes.
- Stores image signatures in buckets and provides methods to find potentially similar images.
-
- get_image_files: Retrieves a list of image files from a specified directory based on recognized file extensions.
- process_images: Processes each image in the directory using the ImageProcessor and populates the LSHProcessor with image signatures.
- find_near_duplicates: Coordinates the entire process by initializing components, finding image duplicates using LSH.
-
- Reads input directory path, similarity threshold, hash size, and number of bands.
- Uses find_near_duplicates to identify near-duplicate images based on the provided threshold.
- Outputs a CSV file (results.csv) containing filenames and their corresponding similarity labels.
- ImageProcessor: Preprocesses images and computes their signatures.
- LSHProcessor: Implements LSH for efficient similarity detection.
- Utility Functions: Handle file operations and coordinate image processing tasks.
-
Inputs:
input_dir
: Directory path containing images to be analyzed.threshold
: Minimum similarity threshold (between 0 and 1) for considering images as near-duplicates.hash_size and bands
: Parameters for LSH configuration, affecting granularity and efficiency of similarity detection.
-
Output:
- Generates a CSV file (results.csv) containing image paths and labels with duplicates having same lable.
For detecting similarity between two images A
and B
at a threshold X
.
-
The
ImageProcessor
class is uses to calculate the image signature/hash with thecalculate_signature
method.- The image is converted to grayscale and resized to
(hash_size+1, hash_size)
scale. - The image is then flipped to ensure the brightest quatre is always at the top left. to deal with image rotations.
- A difference hash is then calculated using hash_size, and then collapsed to 1-dimensional array.
- This 1-dimensional array is returned as the signature of the image.
- The image is converted to grayscale and resized to
-
The
LSHProcessor
class is employed to ;-
Add each image path and signature to bucket list,
hash_buckets_list
usingadd_signature
method. Theband size
androws
are used to iteratively calculate different signature bytes and stored in thehash_buckets_list
if a previous images has produced the same bytes, the image path is append to it's list of image paths, in thehash_buckets_list
. This indicates the current row in the image is similar to previous row of a different image.- NB:
hash_bucket_list
contains dicts of signature bytes as keys and list of image paths as values
- NB:
-
Assign labels, For each similar images paths list in
hash_bucket_list
, we iteratively compare them to each other in pairs, and calculate a similarity score using thecalculate_similarity
method which useshamming distance
to calculate the similarity between image signatures. If the similarity score exceeds threshold, the same label is assigned to both images. For images that are not assign any labels through the previous step new labels are assigned.
-
-
For images
A
andB
if their similarity score exceeds thresholdX
, same label is assigned.