Skip to content

Commit

Permalink
Full refactoring to hash videos by image.
Browse files Browse the repository at this point in the history
- Add no_duplicates page from #71
- Improve fuzzy search for the new logic of multiple hashes
- Add table layout of dupliacates
- Extend documentation
- pylint and pycodestyle fixed, add ubuntu 22.04 test scripts
- Adapt tests
  • Loading branch information
bolshevik committed Dec 22, 2023
1 parent e6cc347 commit b988d39
Show file tree
Hide file tree
Showing 22 changed files with 412 additions and 172 deletions.
41 changes: 41 additions & 0 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: Python package

on:
push:
branches: [ "master" ]
pull_request:
branches: [ "master" ]

jobs:
build:

runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.12"]

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
sudo apt-get install -y python3 python3-pip python3-setuptools gnupg curl file
curl -fsSL https://pgp.mongodb.com/server-7.0.asc | sudo gpg -o /usr/share/keyrings/mongodb-server-7.0.gpg --dearmor
echo "deb [ arch=amd64,arm64 signed-by=/usr/share/keyrings/mongodb-server-7.0.gpg ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
sudo apt-get update && sudo apt-get install -y mongodb-org && sudo mkdir -p /data/db && mongod &
pip install --upgrade setuptools
pip install --only-binary=numpy,scipy -r requirements.txt
pip install -r requirements-test.txt
- name: pep8 styles
run: |
pycodestyle *.py hashers tests
- name: Test with pytest
run: |
pytest tests/test.py -v
3 changes: 3 additions & 0 deletions .pylintrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[MASTER]

extension-pkg-allow-list=av
12 changes: 0 additions & 12 deletions .travis.yml

This file was deleted.

78 changes: 42 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,26 @@

![](https://api.travis-ci.org/philipbl/duplicate-images.svg)

This Python script finds duplicate images using a [perspective hash (pHash)](http://www.phash.org) to compare images. pHash ignores the image size and file size and instead creates a hash based on the pixels of the image. This allows you to find duplicate pictures that have been rotated, have changed metadata, and slightly edited.
This Python script finds duplicate files:
- any file by exact match of blake2b hash
- images using a [perspective hash (pHash)](http://www.phash.org) to compare images. pHash ignores the image size and file size and instead creates a hash based on the pixels of the image. This allows you to find duplicate pictures that have been rotated, have changed metadata, and slightly edited.
- videos by extracting N frames at respective times and hashing them with perspective hash (see above)

This script hashes images added to it, storing the hash into a database (MongoDB). To find duplicate images, hashes are compared. If the hash is the same between two images, then they are marked as duplicates. A web interface is provided to delete duplicate images easily. If you are feeling lucky, there is an option to automatically delete duplicate files.
This script hashes files added to it, storing the hashes into a database (MongoDB). To find duplicate files, hashes are compared. If the hash is the same between two files, then they are marked as duplicates. A web interface is provided to delete duplicate files easily. If you are feeling lucky, there is an option to automatically delete duplicate files.

As a word of caution, pHash is not perfect. I have found that duplicate pictures sometimes have different hashes and similar (but not the same) pictures have the same hash. This script is a great starting point for cleaning your photo library of duplicate pictures, but make sure you look at the pictures before you delete them. You have been warned! I hold no responsibility for any family memories that might be lost because of this script.

This script has only been tested with Python 3 and is still pretty rough around the edges. Use at your own risk.
As a word of caution, pHash is not perfect. I have found that duplicate pictures sometimes have different hashes and similar (but not the same) pictures have the same hash. This script is a great starting point for cleaning your photo or video library of duplicate pictures, but make sure you look at the pictures before you delete them. You have been warned! I hold no responsibility for any family memories that might be lost because of this script.

## Requirements

This script requires MongoDB, Python 3.4 or higher, and a few Python modules, as found in `requirements.txt`.

This script requires MongoDB, Python 3.12 or higher, and a few Python modules, as found in `requirements.txt`.

## Quick Start

I suggest you read the usage, but here are the steps to get started right away. These steps assume that MongoDB is already installed on the system.

First, install this script. This can be done by either cloning the repository or [downloading the script](https://github.com/philipbl/duplicate-images/archive/master.zip).
First, install this script. This can be done by either cloning the repository or [downloading the script](https://github.com/bolshevik/duplicate-images/archive/master.zip).
```bash
git clone https://github.com/philipbl/duplicate-images.git
git clone https://github.com/bolshevik/duplicate-images.git
```

Next, download all required modules. This script has only been tested with Python 3. I would suggest that you make a virtual environment, setting Python 3 as the default python executable (`mkvirtualenv --python=/usr/local/bin/python3 <name>`)
Expand All @@ -34,26 +34,6 @@ Last, run script:
python duplicate_finder.py
```

## On Ubuntu 18.04

```bash
# Install Mongo and pip
sudo apt -y install mongodb-server python3-pip
# Disable Mongo service autostart
sudo systemctl disable mongodb.service
# Stop Mongo service
sudo service mongodb stop
```

Python 2 is the default version of Python, so we have to call `python3` explicitely:

```bash
# Install dependencies with Python 3
pip3 install -r requirements.txt
# “python duplicate_finder.py” will fail, so we have to use Python 3 for every call:
python3 duplicate_finder.py …
```

## Example

```bash
Expand All @@ -76,6 +56,7 @@ Usage:
duplicate_finder.py remove <path> ... [--db=<db_path>]
duplicate_finder.py clear [--db=<db_path>]
duplicate_finder.py show [--db=<db_path>]
duplicate_finder.py cleanup [--db=<db_path>]
duplicate_finder.py find [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--threshold=<num>]
duplicate_finder.py -h | --help

Expand All @@ -84,13 +65,13 @@ Options:

--db=<db_path> The location of the database or a MongoDB URI. (default: ./db)

--parallel=<num_processes> The number of parallel processes to run to hash the image
files (default: number of CPUs).
--parallel=<num_processes> The number of parallel processes to run to hash the files
(default: number of CPUs).

find:
--threshold=<num> Image matching threshold. Number of different bits in Hamming distance. False positives are possible.
--threshold=<num> Hash matching threshold. Number of different bits in Hamming distance. False positives are possible.
--print Only print duplicate files rather than displaying HTML file
--delete Move all found duplicate pictures to the trash. This option takes priority over --print.
--delete Move all found duplicate files to the trash. This option takes priority over --print.
--match-time Adds the extra constraint that duplicate images must have the
same capture times in order to be considered.
--trash=<trash_path> Where files will be put when they are deleted (default: ./Trash)
Expand All @@ -101,14 +82,14 @@ Options:
python duplicate_finder.py add /path/to/images
```

When a path is added, image files are recursively searched for. In particular, `JPEG`, `PNG`, `GIF`, and `TIFF` images are searched for. Any image files found will be hashed. Adding a path uses 8 processes (by default) to hash images in parallel so the CPU usage is very high.
When a path is added, files are recursively searched for. Binary content hash is applied to all files, for image files like `JPEG`, `PNG`, `GIF`, and `TIFF` the perceptive hash is applied. Video hash is applied to video files. Adding a path uses 8 processes (by default) to hash images in parallel so the CPU usage is very high.

### Remove
```bash
python duplicate_finder.py remove /path/to/images
```

A path can be removed from the database. Any image inside that path will be removed from the database.
A path can be removed from the database. Any file inside that path will be removed from the database.

### Clear
```bash
Expand All @@ -117,6 +98,13 @@ python duplicate_finder.py clear

Removes all hashes from the database.

### Cleanup
```bash
python duplicate_finder.py cleanup
```

Clean disappeared files from the database.

### Show
```bash
python duplicate_finder.py show
Expand All @@ -129,7 +117,25 @@ Prints the contents database.
duplicate_finder.py find [--print] [--delete] [--match-time] [--trash=<trash_path>] [--threshold=<num>]
```

Finds duplicate pictures that have been hashed. This will find images that have the same hash stored in the database. There are a few options associated with `find`. By default, when this command is run, a webpage is displayed showing duplicate pictures and a server is started that allows for the pictures to be deleted (images are not actually deleted, but moved to a trash folder -- I really don't want you to make a mistake). The first option, **`--print`**, prints all duplicate pictures and does not display a webpage or start the server. **`--delete`** automatically moves all duplicate images found to the trash. Be careful with this one. **`--match-time`** adds the extra constraint that images must have the same EXIF time stamp to be considered duplicate pictures. Last, `--trash=<trash_path>` lets you select a path to where you want files to be put when they are deleted. The default trash location is `./Trash`.
Finds duplicate files that have been hashed. This will find files that have the same hash stored in the database. There are a few options associated with `find`. By default, when this command is run, a webpage is displayed showing duplicate files and a server is started that allows for the files to be deleted (files are not actually deleted, but moved to a trash folder -- I really don't want you to make a mistake). The first option, **`--print`**, prints all duplicate files and does not display a webpage or start the server. **`--delete`** automatically moves all duplicate files found to the trash. Be careful with this one. **`--match-time`** adds the extra constraint that images must have the same EXIF time stamp to be considered duplicate pictures. `--trash=<trash_path>` lets you select a path to where you want files to be put when they are deleted. The default trash location is `./Trash`. Last, `--threshold=<num>` specifies number of bits of Hamming distance to run fuzzy matching of hashes.

# Testing

## Ubuntu 22.04
```
sudo apt-get install python3 python3-pip python3-setuptools gnupg curl file
curl -fsSL https://pgp.mongodb.com/server-7.0.asc | \
sudo gpg -o /usr/share/keyrings/mongodb-server-7.0.gpg \
--dearmor
echo "deb [ arch=amd64,arm64 signed-by=/usr/share/keyrings/mongodb-server-7.0.gpg ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
sudo apt-get update
sudo apt-get install -y mongodb-org
sudo mkdir /data/db
sudo mongod
pip install --only-binary=numpy,scipy -r requirements.txt
pip install -r requirements-test.txt
```

## Disclaimer

Expand Down
Loading

0 comments on commit b988d39

Please sign in to comment.