Skip to main content

Using FIM for Duplicate Photo Detection

I recently discovered FIM (File Integrity Manager), which is intended as a tool for managing changes to photos. It is primarily intended to detect corruption in photos stored on disk, but it also offers the ability to detect duplicate photos, which is what attracted me to it. These are my notes on using it as a duplicate detector.

One of FIM's shortcomings is it's very limited ability to specify which photo in a set of duplicates should be considered the master. These notes primarily focus an workaround for that aspect of FIM.

In my situation, I find that, generally, among a set of duplicate photos, the one with the earliest creation date is the 'original'. If at all possible, I try to maintain the correct creation date. My approach does not solve the problem entirely, but it makes it manageable for many situations.

Note that FIM has a very strict definition of 'duplicate'. Files are duplicates of each other if all of the following are identical between them: name, size, creation time, modified time, hash, and permissions. One can direct FIM to ignore permissions and times in the comparisons via the -i command line option.

Example

Assume that I have three groups of photos: X, Y, and Z. There may be some duplicated photos within and among them. Z is the oldest group and X is the newest. We will process the groups in order of oldest to newest. This is based on the assumption that we want the earliest instance of a photo, should the same photo appear in multiple groups.

Initially, the groups of photos will be stored in $I. We will conduct our work in $F.

Move or copy (maintaining file dates/times) $I/Z to $F/Z, then initialize the FIM repository:

cd $F
fim init -m 'initialize using group Z'

This will create a FIM 'repository' in $F/.fim, which contains hashes of each photo. Now we'll use FIM to search for duplicates within Z:

cd $F
fim fdup

This will print a list of all duplicate files, if present. If the output is truncated, rerun with the -o nnnn, where nnnn is the max number of output lines you wish to see.

If there are duplicates and you wish to delete them:

cd $F
fim rdup

Unfortunately, FIM forces one to specify each photo that should be kept as an original -- there is no way to set an order of precedence for directories, etc., and apply that precedence in bulk. Note that this actually removes the duplicates from the filesystem, so be sure that you have backups.

Once the duplicate photos have been removed, commit the current state to FIM:

cd $F
fim ci -m 'Removed duplicates from group Z'

We are now ready to process group Y. There are two ways to do this:

  1. The first method is to designate Z as the master group. This method guarantees that if duplicates exist between Z and Y, the duplicate(s) in Y will be removed and the photo in Z will be preserved. Note that this does not check for duplicates that exist solely within Y alone.

    cd $I/Y
    fim rdup -M $F   # specifies that $F/.fim contains the Masters
    

    The very nice feature of defining a Master (i.e., using rdup -M) is that one of the command prompt options is 'A', which means to apply the choice to all subsequent duplicates. By choosing 'A', any duplicates between X and Y that are present in X will be deleted. If one has thousands of duplicates, this is a blessing. The A option is not present when using rdup without -M.

    At this point, there are no duplicates between Y and Z, though duplicates may still exist within Y. Move $I/X to $F/Y and remove any duplicates solely within Y :

    cd $F/Y
    fim fdup
    fim rdup
    fim ci -m 'Add group Y, less duplicates.'
    

    Note that any duplicates removed from Y by any of the above have not been logged because Y is not yet controlled -- this is a significant drawback if you care about traceability.

    The following should show no duplicates:

    cd $F
    fim fdup
    

    Finally, process X in an identical manner to Y. After this is complete, FIM should be managing all the photos in X, Y, and Z and no duplicates should be present.

  2. The second method is to bring X and Y into the FIM repository at the beginning of the process. The benefit of this is that all deletions of duplicates are logged, so actions are traceable. The downside is that it is not possible to make a single choice that applies to multiple instances of duplicates (i.e., the 'A' response of the previous method is not available). In other words, one has no alternative to responding to each duplicate with a keypress to specify which is the master.

    Move or copy $I/Y to $F/Y, process the group, and do the same thing with group X. One can also move/copy both X and Y and process them at once.

    rsync -av "$I/$Y/" "$F/$Y"
    cd $F
    fim ci -m 'add group Y'
    fim fdup  # view duplicates
    fim rdup  # remove duplicates
    fim ci -m 'removed duplicates from group Y'
    
    rsync -av "$I/$X/" "$F/$X"
    cd $F
    fim ci -m 'add group X'
    fim fdup  # view duplicates
    fim rdup  # remove duplicates
    fim ci -m 'removed duplicates from group X'
    

    At this point, there should be no duplicate photos in $F.