This script is designed to help you find and manage duplicate files across multiple directories. It allows you to process directories, identify duplicates, and remove unwanted copies while preserving preferred versions of files based on your criteria.
- Scan directories to build a database of files with their hashes and metadata.
- Identify duplicate files based on content (file hashes).
- Prefer certain directories when keeping originals.
- Remove duplicates while preserving the preferred original files.
- Process files using multiple threads for improved performance.
- Log detailed output to a file for debugging and record-keeping.
- Clean up the database by removing entries for files that no longer exist.
- Rescan duplicates to update file information.
- Python 3.6 or higher
- The following Python packages:
xxhash
tqdm
Install the required packages using:
pip install xxhash tqdm
Clone the repository or download the script finddupes.py
to your local machine.
The script provides several commands to process directories, find duplicates, list them, and delete unwanted copies.
process
: Process files in a directory to build or update the database.list-duplicates
: List duplicates excluding the original files.delete-duplicates
: Delete duplicate files based on specified criteria.rescan-duplicates
: Rescan duplicate files to update their information.clean-db
: Remove entries from the database for files that no longer exist on disk.list-duplicates-csv
: List duplicates and originals in CSV format.
The typical workflow involves the following steps:
Start by cleaning the database to remove any entries for files that no longer exist on disk.
python finddupes.py clean-db
This ensures that the database reflects the current state of your files.
Choose a main directory that you want to prefer when keeping original files. Process this directory first.
python finddupes.py process /path/to/main_directory --threads 4 --log-file finddupes.log
--threads 4
specifies using 4 threads for processing. By default, it uses the number of CPU cores.--log-file finddupes.log
logs detailed output to the specified file.
Remove duplicates within the main directory based on folder depth or other criteria, ensuring that the best version of files is kept.
First, list duplicates within the main directory:
python finddupes.py list-duplicates --within-directory /path/to/main_directory > duplicates.txt
Review duplicates.txt
to see which files are considered duplicates.
Then, delete duplicates within the main directory:
python finddupes.py delete-duplicates --within-directory /path/to/main_directory --simulate-delete
- Use
--simulate-delete
to perform a dry run without actually deleting files. - Remove
--simulate-delete
to perform the actual deletion.
Process additional directories where duplicates might exist.
python finddupes.py process /path/to/other_directory --skip-existing --threads 4 --log-file finddupes.log
--skip-existing
ensures that only new files are processed, avoiding re-hashing files already in the database.
Now, remove duplicates across directories, preferring the original files in the main directory.
python finddupes.py delete-duplicates --prefer-directory /path/to/main_directory --simulate-delete
-
--prefer-directory /path/to/main_directory
sets the main directory as the preferred source for originals. -
Optionally, you can specify multiple preferred directories, ordered by preference:
python finddupes.py delete-duplicates --prefer-directory /path/to/main_directory,/path/to/another_directory
-
Review the output to ensure that the correct files will be deleted.
-
Remove
--simulate-delete
to delete the duplicates.
After deleting duplicates, you can rescan the files to update the database.
python finddupes.py rescan-duplicates
Or, process the directories again with --skip-existing
to update any new changes.
Process a directory:
python finddupes.py process /path/to/directory --threads 4 --log-file finddupes.log
List duplicates excluding originals:
python finddupes.py list-duplicates --prefer-directory /path/to/main_directory --output duplicates.txt
Delete duplicates, preferring the main directory:
python finddupes.py delete-duplicates --prefer-directory /path/to/main_directory --simulate-delete
Clean the database:
python finddupes.py clean-db
python finddupes.py process <directory> [options]
Options:
<directory>
: The directory to process.--skip-existing
: Skip processing files that are already in the database.--threads N
: Number of threads for hashing (default: number of CPU cores).--log-file FILE
: Path to log file for detailed output.
python finddupes.py list-duplicates [options]
Options:
-o, --output FILE
: Output file to write the list to. If not specified, prints to console.--prefer-directory DIRS
: Preferred source directories for selecting original files (comma-separated if multiple). The first directory has the highest preference.--within-directory DIR
: Only examine duplicates within this directory.
python finddupes.py delete-duplicates [options]
Options:
--prefer-directory DIRS
: Preferred source directories for selecting original files (comma-separated if multiple).-o, --output FILE
: Output CSV file to log the deleted files.--overwrite
: Overwrite the output file if it exists.--append
: Append to the output file if it exists.--simulate-delete
: Simulate deletion without actually deleting files.--within-directory DIR
: Only delete duplicates within this directory.
python finddupes.py rescan-duplicates
Rescans duplicates to update their hashes and metadata in the database.
python finddupes.py clean-db
Removes entries from the database for files that no longer exist on disk.
- Clean the Database Regularly: Use
clean-db
to ensure the database reflects the current state of your files, especially after moving or deleting files outside of this script. - Process Directories in Order: Start with your main directory, then process additional directories, preferring the main directory when removing duplicates.
- Use
--skip-existing
: When adding new directories, use--skip-existing
to avoid reprocessing files already in the database. - Simulate Deletions First: Always use
--simulate-delete
initially to ensure that the correct files will be deleted before performing actual deletions. - Review Outputs: Check the outputs of
list-duplicates
and deletion simulations to verify that the script is identifying duplicates as expected. - Backup Important Data: Before deleting files, consider backing up important data to prevent accidental loss.
- Log Outputs: Utilize the
--log-file
option to keep detailed logs for debugging and record-keeping.
This script was written primarily by AI, and therefore has no license or copyright.
Note: Replace /path/to/main_directory
and other placeholder paths with the actual paths on your system.
This readme provides instructions and examples to help you effectively use the Duplicate File Finder script. By following the typical workflow and utilizing the options available, you can manage duplicate files across directories while preserving preferred versions and ensuring data integrity.