Culling Duplicate Photos with rmlint
I have twenty-five years worth of old personal hard drives that I wish to scour for photos and videos that should be preserved prior to destroying the drives. In searching for tools to assist me in this effort, I ran across rmlint
1, fim
2, and fdupes
3.
In the end, I chose to use rmlint
, as it was the best match for my needs. This note documents a few of the subtleties of using rmlint
. The note does not go into all of the differences between the tools; I may create another note later on this topic. Having said that, the primary decision points for me were:
fim
calculates checksums for all files it touches, which is expensive time-wise.rmlint
only calculates checksums when absolutely necessary (i.e., when it is necessary to determine the similarity of files).fim
always saves checksums it calculates;rmlint
can optionally save the checksums it calculates. Calculating the checksum for a file only once is key to scaling to multiple invocations of the command across hundreds of thousands of files.When using
fim
to delete duplicate files, one has two choices: a) answer yes/no to each file to be deleted, or b) simply delete all duplicates without any sort of review mechanism. Both of these options were unacceptable for my use case; I wanted to be able to inspect the candidates for retention/deletion with scripts and other tools, prior to performing an irrevocable action.rmlint
, on the other hand, can generatebash
orpython
scripts for duplicate deletion. The scripts can be reviewed prior to execution to get an understanding for what is going to be removed. Of course, the scripts can also be modified.rmlint
can also emit duplicate file information incsv
,json
, and other formats.rmlint
allows extensive customization of how duplicate files are chosen.Finally,
rmlint
has an undocumented feature that allows one to generate a list of unique files contained on one or more drives/directories.
In the end, I chose to use rmlint
for my purposes. This note describes a few of the nuances I ran across when using rmlint
.
Notes
Storing Hashes
My ideal process centered around the creation of a Master archive, which would contain unique photos/videos. I would process disks one-by-one. As unique photos/videos were encountered, they would be added to the Master archive.
This meant that hashes for content in Master would be required many times. Because hashing is the most time consuming part of duplicate detection, it seemed mandatory for the tool to store calculated hashes for later reuse.
I played with rmlint
for a long time (focusing on the --replay
option) before I realized rmlint
could store hashes. It is described on the man page4, but it is downplayed and is easy to miss.
Specifically, invoking rmlint
with --xattr-write
and --write-unfinished
cause both full and partial hashes to be written to extended file attributes. The --xattr-read
flag causes full/partial hashes to be read from extended file attributes. These flags can make a huge difference in run times.
Identifying Unique Files on a Backup Disk
As already mentioned, part of my ideal process required me to copy unique files from a backup disk and place them in the Master archive. There is an undocumented rmlint
feature that supports this. I was shocked when I first read about it in this github issue. It is not mentioned in any of the rmlint
documentation that I can find, however, it does work as described.
Example: Archiving Unique Files
For an example, assume the following:
-
/Volumes/master
contains the master archive of unique photos/videos. -
/Volumes/backup27
a disk that may contain unarchived photos/videos.
The following will identify any photos/videos present on backup27
but not present on master
(i.e., identify any unarchived photos/videos) and subsequently archive them to master
:
$ rmlint -o uniques \
--keep-all-tagged \
/Volumes/master \
// \
/Volumes/backup27 \
> files-to-archive.txt
$ rsync -aRX --progress \
--files-from=files-to-archive.txt \
/ /Volumes/master/backup27
A few comments about this approach:
Carefully note that
backup27
is the tagged directory withrmlint
. For some reason, I found this to be counter intuitive. This is probably because if one were usingrmlint
in the traditional manner (i.e., to identify the duplicate files onbackup27
to delete),master
would be the tagged directory and their locations would be swapped in thermlint
invocation.In practice, I would also invoke
rmlint
with--xattr-write --xattr-unfinished --xattr-read
but those options aren’t relevant to the focus of the example.The
-g
/--progress
option tormlint
will interfere with the output from-o uniques
; they can’t be used here.This invocation actually directs
rmlint
to process all files, regardless of what they contain. See Example: Processing Only Certain File Types for an example of how to process only certain file types.The
rsync
command, as written, places all archived files under/Volumes/master/backup27
. I find that archiving unique files to separate subdirectories indicating their source to be a useful technique. Obviously, archiving to/Volumes/master
would be fine as well, but by combining multiple archives in this manner, one looses the ability to determine where a file came from. Note that the Master archive organization used here bears no relation to the ultimate archive organization that will be created once all originals have been amassed.
Example: Delete Duplicates, Archive Remaining
An alternative to the previous example would be to use rmlint
to identify files present on both master
and backup27
and to delete the duplicates from backup27
. The remaining photos / videos on backup27
would then be archived to master
. The following would accomplish that:
# This creates 'rmlint.sh' and 'rmlint.json'.
# 'rmlint.sh' is standalone; it does not read 'rmlint.json'.
# 'rmlint.json' is for later optional use with the rmlint '--replay' option.
$ rmlint --progress \
--keep-all-tagged \
/Volumes/backup27 \
// \
/Volumes/master
$ ./rmlint.sh -n # dry-run: view duplicate files to be deleted
$ ./rmlint.sh # actually delete duplicate files on backup27
Note that the order of master
and backup27
was reversed and that master
is the tagged directory in this invocation. This is crucially important.
Rather than the shell script created by the previous command, I usually get rmlint
to generate a python script which does the same thing. This option also creates a JSON file with the particulars of each file that was processed. I find this to be a better fit for my usual workflow. The python version of the previous command would look like:
# This creates 'process.py' and '.rmlint.json'
$ rmlint --progress \
-O py:process.py \
--keep-all-tagged \
/Volumes/backup27 \
// \
/Volumes/master
# process.py reads '.rmlint.json'
$ ./process.py --dry-run # dry-run: view duplicate files to be deleted
$ ./process.py # actually delete duplicate files on backup27
$ rsync -arRX --progress \
/Volumes/backup27/ \
/Volumes/master/backup27
It is trivial to modify either process.py
or rmlint.sh
to do other things, such as compute statistics, on the files to be deleted.
Testing: Identification of Unique Files
I am bit paranoid about file loss and had to convince myself that the uniques
formatter offered by rmlint
worked correctly. To that end, I created a simple test script:
#!/usr/bin/env bash
rm -rf mast bkup
mkdir mast bkup
cd mast
for i in a b c d e; do dd if=/dev/urandom bs=1 count=${#i} of="$i"; done
cd ..
cp mast/{d,e} bkup
cd bkup
for i in f g h; do dd if=/dev/urandom bs=1 count=${#i} of="$i"; done
cd ..
rmlint -o uniques -k mast // bkup > uniques.txt
This creates two directories: mast
and bkup
. mast
contains three unique files: a
, b
, and c
. bkup
contains three unique files: f
, g
, and h
. The files d
and e
are unique; both mast
and bkup
each contain a copy of d
and e
.
The following diagram depicts the setup just described:
After the script is run, uniques.txt
contains the full path to the files f
, g
, and h
. This trivial test gives me confidence that the uniques
formatter does, in fact, do what I expected it to do.
Example: Processing Only Certain File Types
In the previous examples, rmlint
processes all of the files on the backup drive. What if the drive contains all sorts of files, but we are only interested in identifying unique image files that aren’t present on the master? To keep things simple, let’s assume that “image files” really means files with the extensions .jpg
or .jpeg
.
$ ( find -E /Volumes/backup27 -type f -iregex '.*\.(jpg|jpeg)$' -print0 \
| xargs \
-0 \
-J % \
rmlint \
-S ma \
-o uniques \
--keep-all-tagged \
--xattr-write \
--write-unfinished \
--xattr-read \
/Volumes/master \
// \
%
) > uniques.txt
This uses the fact that rmlint
can accept individual files on the command line. Instead of simply providing /Volumes/backup27
as an argument, we wish to provide paths to all the image files on backup27
as arguments and tag them. If there are more than a few image files, a simple invocation of rmlint
with all the file paths on the command line will fail, as the maximum command line length will be exceded.
To work around this problem, xargs
is used. find
provides a stream of paths to image files, with each path separated by a null character (i.e., '\0'
). xargs
repeatedly invokes rmlint
with as many paths as possible until the paths are exhausted. Because the output of the uniques
formatter goes to stdout
, it’s easy to collect the output of the repeated rmlint
invocations in a single file, uniques.txt
.
And there we have it… a file containing paths to all of the image files that exist in backup27
and do not exist in master
.
Appendix
About the Venn Diagram
The Venn diagram was created with pyvenn
and matplotlib
:
import matplotlib
import venn
matplotlib.use('SVG')
mast = (1,2,3,4,5)
bkup = (4,5,6,7,8)
labels = {'10': 'a,b,c', '01': 'f,g,h', '11': 'd,e'}
fig, ax = venn.venn2(labels,names=['mast','bkup'])
fig.savefig('venn.svg')
Finding Media Files
The following is a very crude script I use to get the paths of all “media files” located in one or more directories. It could be used to replace the find
command in the last example (e.g., mfind -0 /Volumes/backup27 | xargs ...
).
#!/usr/bin/env bash
# Print on stdout paths of all media files located
# in one or more directories.
#
# Usage:
# mfind [-0] PATH [PATH...]
#
# If invoked with '-0' option, separate output paths
# with null character. Otherwise, output paths appear
# one per line.
# Most extensions taken from:
# https://en.wikipedia.org/wiki/ExifTool
IMG="cr2|crw|ciff|exif"
IMG="${IMG}|gif|jp2|jpm|jpx|jpg|jpeg"
IMG="${IMG}|pict|pct|png|jng|mng|ppm|pbm|pgm"
IMG="${IMG}|raw|tif|tiff"
VID="3g2|3gp2|3gp|3gpp"
VID="${VID}|f4a|f4b|f4p|f4v"
VID="${VID}|m4a|m4b|m4p|m4v|mov|qt|mp4"
VID="${VID}|mpeg|mpg|m2v|vob"
VID="${VID}|webm|wma|wmv"
VID="${VID}|divx|dv|flv"
VID="${VID}|avi"
AUD="mp3|flac|aif"
REGEX=".*\.(${IMG}|${VID}|${AUD})$"
PRINT="-print"
if [ "$1" == "-0" ]
then
PRINT="${PRINT}0"
shift
fi
find -E "$@" -iregex "${REGEX}" -type f "${PRINT}"