Find identical files according to their size and hashing algorithm.
Therefore, a file is identical to another if they both have the same size and hash.
"A hash function is a mathematical algorithm that takes an input (in this case, a file) and produces a fixed-size string of characters, known as a hash value or checksum. The hash value acts as a summary representation of the original input. This hash value is unique (disregarding unlikely collisions) to the input data, meaning even a slight change in the input will result in a completely different hash value."
To find identical files, 3 procedures were performed:
Procedure 1. Group files by size
.
Procedure 2. Group files by hash(first_bytes)
with ahash algorithm.
Procedure 3. Group files by hash(entire_file)
with chosen algorithm.
Hash algorithm options are:
find-identical-files just reads the files and never changes their contents. See the open_file function to verify.
find-identical-files
The number of identical files is the number of times the same file is found (number of repetitions or frequency).
By default, identical files will be filtered and those whose frequency is two (duplicates) or more will be selected.
find-identical-files -f N
such that N is an integer greater than or equal to 1 (N >= 1).
With the -f
(or --min_frequency
) argument option, set the minimum frequency (number of identical files).
With the -F
(or --max_frequency
) argument option, set the maximum frequency (number of identical files).
- To report all files:
Useful for getting hash information for all files in the current directory.
find-identical-files -f 1
- Look for duplicate or higher frequency files (default):
find-identical-files
or
find-identical-files -f 2
- Look for files whose frequency is exactly 4:
find-identical-files -f 4 -F 4
find-identical-files -b N
such that N is an integer (N >= 0).
With the -b
(or --min_size
) argument option, set the minimum size (in bytes).
With the -B
(or --max_size
) argument option, set the maximum size (in bytes).
- To find identical files whose size is greater than or equal to 8 bytes:
find-identical-files -b 8
- To find identical files whose size is less than or equal to 1024 bytes:
find-identical-files -B 1024
- To find identical files whose size is between 8 and 1024 bytes:
find-identical-files -b 8 -B 1024
- To find identical files whose size is exactly 1024 bytes:
find-identical-files -b 1024 -B 1024
find-identical-files -twa fxhash -r yaml
- The CSV file will be saved in the currenty directory:
find-identical-files -c .
- The CSV file will be saved in the
/tmp
directory:
find-identical-files -c /tmp
or
find-identical-files --csv_dir=/tmp
- The XLSX file will be saved in the
~/Downloads
directory:
find-identical-files -x ~/Downloads
- The XLSX file will be saved in the
/tmp
directory:
find-identical-files -x /tmp
or
find-identical-files --xlsx_dir=/tmp
7. To find identical files in the Downloads
directory with the ahash
algorithm, redirect the output to a json
file (/tmp/fif.json) and export the result to an XLSX file (/tmp/fif . xlsx) for further analysis:
find-identical-files -tvi ~/Downloads -a ahash -r json > /tmp/fif.json -x /tmp
8. Get information using jq:
- Print all hashes:
find-identical-files -r json | jq -sr '.[:-1].[].["File information"].hash'
- Get information from the first identical file:
find-identical-files -r json | jq -s '.[0]'
- Get information from the 15th identical file (if it exists):
find-identical-files -r json | jq -s '.[14]'
- Get information from the range [a,b) with Start (a) inclusive and End (b) exclusive.
For a = 2 and b = 5:
find-identical-files -r json | jq -s '.[2:5]'
- Get summary information:
find-identical-files -r json | jq -s '.[-1]'
Another option is to redirect the result to a temporary file and read specific information:
find-identical-files -vr json > /tmp/fif
jq -sr '.[:-1].[].["File information"].hash' /tmp/fif
jq -s '.[0]' /tmp/fif
jq -s '.[-2]' /tmp/fif
jq -s '.[-1]' /tmp/fif
jq -s '.[-1]["Total number of identical files"]' /tmp/fif
Type in the terminal find-identical-files -h
to see the help messages and all available options:
find identical files according to their size and hashing algorithm
Usage: find-identical-files [OPTIONS]
Options:
-a, --algorithm <ALGORITHM>
Choose the hash algorithm [default: blake3] [possible values: ahash, blake3, fxhash, sha256, sha512]
-b, --min_size <MIN_SIZE>
Set a minimum file size (in bytes) to search for identical files [default: 0]
-B, --max_size <MAX_SIZE>
Set a maximum file size (in bytes) to search for identical files
-c, --csv_dir <CSV_DIR>
Set the output directory for the CSV file (fif.csv)
-d, --min_depth <MIN_DEPTH>
Set the minimum depth to search for identical files [default: 0]
-D, --max_depth <MAX_DEPTH>
Set the maximum depth to search for identical files
-e, --extended_path
Prints extended path of identical files, otherwise relative path
-f, --min_frequency <MIN_FREQUENCY>
Minimum frequency (number of identical files) to be filtered [default: 2]
-F, --max_frequency <MAX_FREQUENCY>
Maximum frequency (number of identical files) to be filtered
-g, --generate <GENERATOR>
If provided, outputs the completion file for given shell [possible values: bash, elvish, fish, powershell, zsh]
-i, --input_dir <INPUT_DIR>
Set the input directory where to search for identical files [default: current directory]
-o, --omit_hidden
Omit hidden files (starts with '.'), otherwise search all files
-r, --result_format <RESULT_FORMAT>
Print the result in the chosen format [default: personal] [possible values: json, yaml, personal]
-s, --sort
Sort result by number of identical files, otherwise sort by file size
-t, --time
Show total execution time
-v, --verbose
Show intermediate runtime messages
-w, --wipe_terminal
Wipe (Clear) the terminal screen before listing the identical files
-x, --xlsx_dir <XLSX_DIR>
Set the output directory for the XLSX file (fif.xlsx)
-h, --help
Print help (see more with '--help')
-V, --version
Print version
To build and install from source, run the following command:
cargo install find-identical-files
Another option is to install from github:
cargo install --git https://github.com/claudiofsr/find-identical-files.git
In general, jwalk (default) is faster than walkdir.
But if you prefer to use walkdir:
cargo install --features walkdir find-identical-files