Skip to main content

Culling Duplicate Photos with rmlint

I have twenty-five years worth of old personal hard drives that I wish to scour for photos and videos that should be preserved prior to destroying the drives. In searching for tools to assist me in this effort, I ran across rmlint1, fim2, and fdupes3.

In the end, I chose to use rmlint, as it was the best match for my needs. This note documents a few of the subtleties of using rmlint. The note does not go into all of the differences between the tools; I may create another note later on this topic. Having said that, the primary decision points for me were:

  • fim calculates checksums for all files it touches, which is expensive time-wise. rmlint only calculates checksums when absolutely necessary (i.e., when it is necessary to determine the similarity of files). fim always saves checksums it calculates; rmlint can optionally save the checksums it calculates. Calculating the checksum for a file only once is key to scaling to multiple invocations of the command across hundreds of thousands of files.

  • When using fim to delete duplicate files, one has two choices: a) answer yes/no to each file to be deleted, or b) simply delete all duplicates without any sort of review mechanism. Both of these options were unacceptable for my use case; I wanted to be able to inspect the candidates for retention/deletion with scripts and other tools, prior to performing an irrevocable action.

  • rmlint, on the other hand, can generate bash or python scripts for duplicate deletion. The scripts can be reviewed prior to execution to get an understanding for what is going to be removed. Of course, the scripts can also be modified. rmlint can also emit duplicate file information in csv, json, and other formats.

  • rmlint allows extensive customization of how duplicate files are chosen.

  • Finally, rmlint has an undocumented feature that allows one to generate a list of unique files contained on one or more drives/directories.

In the end, I chose to use rmlint for my purposes. This note describes a few of the nuances I ran across when using rmlint.


Storing Hashes

My ideal process centered around the creation of a Master archive, which would contain unique photos/videos. I would process disks one-by-one. As unique photos/videos were encountered, they would be added to the Master archive.

This meant that hashes for content in Master would be required many times. Because hashing is the most time consuming part of duplicate detection, it seemed mandatory for the tool to store calculated hashes for later reuse.

I played with rmlint for a long time (focusing on the --replay option) before I realized rmlint could store hashes. It is described on the man page4, but it is downplayed and is easy to miss.

Specifically, invoking rmlint with --xattr-write and --write-unfinished cause both full and partial hashes to be written to extended file attributes. The --xattr-read flag causes full/partial hashes to be read from extended file attributes. These flags can make a huge difference in run times.

Identifying Unique Files on a Backup Disk

As already mentioned, part of my ideal process required me to copy unique files from a backup disk and place them in the Master archive. There is an undocumented rmlint feature that supports this. I was shocked when I first read about it in this github issue. It is not mentioned in any of the rmlint documentation that I can find, however, it does work as described.

Example: Archiving Unique Files

For an example, assume the following:

  • /Volumes/master contains the master archive of unique photos/videos.
  • /Volumes/backup27 a disk that may contain unarchived photos/videos.

The following will identify any photos/videos present on backup27 but not present on master (i.e., identify any unarchived photos/videos) and subsequently archive them to master:

A few comments about this approach:

  • Carefully note that backup27 is the tagged directory with rmlint. For some reason, I found this to be counter intuitive. This is probably because if one were using rmlint in the traditional manner (i.e., to identify the duplicate files on backup27 to delete), master would be the tagged directory and their locations would be swapped in the rmlint invocation.

  • In practice, I would also invoke rmlint with --xattr-write --xattr-unfinished --xattr-read but those options aren’t relevant to the focus of the example.

  • The -g / --progress option to rmlint will interfere with the output from -o uniques; they can’t be used here.

  • This invocation actually directs rmlint to process all files, regardless of what they contain. See Example: Processing Only Certain File Types for an example of how to process only certain file types.

  • The rsync command, as written, places all archived files under /Volumes/master/backup27. I find that archiving unique files to separate subdirectories indicating their source to be a useful technique. Obviously, archiving to /Volumes/master would be fine as well, but by combining multiple archives in this manner, one looses the ability to determine where a file came from. Note that the Master archive organization used here bears no relation to the ultimate archive organization that will be created once all originals have been amassed.

Example: Delete Duplicates, Archive Remaining

An alternative to the previous example would be to use rmlint to identify files present on both master and backup27 and to delete the duplicates from backup27. The remaining photos / videos on backup27 would then be archived to master. The following would accomplish that:

Note that the order of master and backup27 was reversed and that master is the tagged directory in this invocation. This is crucially important.

Rather than the shell script created by the previous command, I usually get rmlint to generate a python script which does the same thing. This option also creates a JSON file with the particulars of each file that was processed. I find this to be a better fit for my usual workflow. The python version of the previous command would look like:

It is trivial to modify either or to do other things, such as compute statistics, on the files to be deleted.

Testing: Identification of Unique Files

I am bit paranoid about file loss and had to convince myself that the uniques formatter offered by rmlint worked correctly. To that end, I created a simple test script:

This creates two directories: mast and bkup. mast contains three unique files: a, b, and c. bkup contains three unique files: f, g, and h. The files d and e are unique; both mast and bkup each contain a copy of d and e.

The following diagram depicts the setup just described:

After the script is run, uniques.txt contains the full path to the files f, g, and h. This trivial test gives me confidence that the uniques formatter does, in fact, do what I expected it to do.

Example: Processing Only Certain File Types

In the previous examples, rmlint processes all of the files on the backup drive. What if the drive contains all sorts of files, but we are only interested in identifying unique image files that aren’t present on the master? To keep things simple, let’s assume that “image files” really means files with the extensions .jpg or .jpeg.

This uses the fact that rmlint can accept individual files on the command line. Instead of simply providing /Volumes/backup27 as an argument, we wish to provide paths to all the image files on backup27 as arguments and tag them. If there are more than a few image files, a simple invocation of rmlint with all the file paths on the command line will fail, as the maximum command line length will be exceded.

To work around this problem, xargs is used. find provides a stream of paths to image files, with each path separated by a null character (i.e., '\0'). xargs repeatedly invokes rmlint with as many paths as possible until the paths are exhausted. Because the output of the uniques formatter goes to stdout, it’s easy to collect the output of the repeated rmlint invocations in a single file, uniques.txt.

And there we have it… a file containing paths to all of the image files that exist in backup27 and do not exist in master.


About the Venn Diagram

The Venn diagram was created with pyvenn and matplotlib:

Finding Media Files

The following is a very crude script I use to get the paths of all “media files” located in one or more directories. It could be used to replace the find command in the last example (e.g., mfind -0 /Volumes/backup27 | xargs ...).