Skip to main content

These aren't blog posts though, on the surface, they may appear to be. The notes on this site have not been subjected to the subject selection, structure, proofreading, and attention to detail necessary for a quality blog post. Rather, these are entries in a lab notebook, contemporaneous, messy, and unstructured. They reflect whatever I happen to be working on at the time that I want to document and make available for easy access. They are here primarily for me; if they happen to be useful to you, so much the better.

Notes on exiftool

Introduction

These are my notes on using exiftool to organize a jumbled mess of digital images. This is written more as a personal memory jogger, unlikely to be of value to anyone else.

General

Before getting to exiftool, here are a couple of handy one-liners.

Print all of the file extensions present in $DIR, one per line:

find "$DIR" -type f | sed -e 's/.*\.//' | sed -e 's/.*\///' \
     | sort -u
Print all of the file extensions present in $DIR, one per line, along
with a count of the number of files with that extension:
find "$DIR" -type f | sed -e 's/.*\.//' | sed -e 's/.*\///' \
     | sort | uniq -c | sort -rn

Organizing Photos Based on Date

If a photo's ultimate location will be based in some way on a date, the date to be used in determining the location will have to come from somewhere. The most obvious approach is to use one of the dates that should be present in the photo's EXIF data. I use DateTimeOriginal, which is the date on which an image was taken -- either by a traditional film camera or a digital camera.

CreatDate is subtly different -- it is the date that the digital image was created. This is either the date that a digital photo was taken or the date that a traditional photo was scanned.

Unfortunately, not all images will have a DateTimeOriginal or even a CreateDate. Therefore, before we can reorganize, our first step is to ensure that each image has a valid DateTimeOriginal.

A Cautionary Word

If duplicate detection is a part of your workflow, give ample thought to choosing the best time to organize by date. My experience is that this is best as a last or nearly last step. This is because the process I describe here changes EXIF data. A photo with changed EXIF data will no longer be detected as a duplicate by most detection tools (e.g., rmlint, fim, fdupes, etc.). On the other hand, if your detection tool looks at image similarity, perhaps using a perceptual hash, this may not be a concern.

Getting a Date to Use

If DateTimeOriginal does not exist, where will a date come from? Sources of possible dates, in increasing order of desireability (for my purposes) are: MDItemFSCreationDate, GPSDateTime, and CreateDate.

Note that MDItemFSCreationDate is macOS-specific. Be aware that the drive containing the images must be indexed by Spotlight for any of the MDItem tags to be available. Otherwise, exiftool will return file not found errors. FileCreateDate is apparently the Windows equivalent. Unix does not maintain file creation time.

With exiftool, the most recent valid tag assignment is the final value of the tag. My scheme for determining DateTimeOriginal follows. DateTimeOriginal is initially set to time the image file was created, as we know this exists. Then, if a GPS time is available, it is used. Finally, if CreateDate is available, it is used.

exiftool -r -if '(not $DateTimeOriginal)'             \
            '-DateTimeOriginal<MDItemFSCreationDate'  \
            '-DateTimeOriginal<GPSDateTime'           \
            '-DateTimeOriginal<CreateDate'            \
            "$PHOTODIR"

Next, a quick sanity check on dates. I have encountered image files that, for unknown reasons, have a DateTimeOriginal newer than the FileModifyDate (e.g., DateTimeOriginal of 7/4/2000 and FileModifyDate of 1/1/2000). The following corrects any such illogical dates:

exiftool -r -if '($FileModifyDate lt $DateTimeOriginal)' \
            '-DateTimeOriginal<FileModifyDate'           \
            "$PHOTODIR"

If either of the previous steps modify a file, the modified file will have the initial name of the original file and the original file will have _original appended to its name. For example, if foo.jpg is modified, exiftool will leave in its wake two files: foo.jpg and foo.jpg_original.

It is possible to tell exiftool to remove any _original files, but I prefer to do this as a separte step, after I have a chance to examine the results.

find "$PHOTODIR" -name \*_original -delete

Harmonizing File Extensions

Often, a single image type may be present with different file name extensions. E.g., one may have photos ending in both .jpg and .jpeg. I chose to enforce consistent entensions.

The following command will convert .jpeg to .jpg. We'll ignore case conversion for the moment, because my macOS is configured to be case insensitive.

exiftool -r -filename=%f.jpg -ext jpeg "$PHOTODIR"

If you have a case sensitive filesystem, the following would take care of both changing jpeg to jpg and lowercasing and uppercase extensions:

exiftool -r -filename=%f.jpg -ext jpg -ext jpeg \
            -if '$filename!~/\.jpg$$/' "$PHOTODIR"

exiftool can be used to do this, but there are other tool (e.g., find, the Perl rename module) that will likely be much faster if extension renaming is your only goal.

Reorganize

These examples will reorganize photos by placing copies of the originals into a new directory structure with the following overall format: new/ext/yyyy/mm/dd/hh/file.ext. It's not difficult to move rather than copy. Note that DateTimeOriginal, which we set earlier, is used to determine the directory structure for each image.

Notice that some of the parameters in the expression supplied to '-d' require two '%' signs, one which escapes the other. Also note the use of %le, which will convert a file extension to lowercase, if necessary. See <http://owl.phy.queensu.ca/~phil/exiftool/filename.html>.

If you'd like to try a dryrun before doing anything, the following invocation only prints what it would do to stdout -- it does not actually copy, move, rename, or otherwise modify the filesystem.

exiftool -r '-testname<DateTimeOriginal' \
         -d 'new/%%le/%Y/%m/%d/%H/%%f%%-c.%%le' "$PHOTODIR"

This command will actually reorganize things:

exiftool -r '-filename<DateTimeOriginal' \
         -d 'new/%%le/%Y/%m/%d/%H/%%f%%-8c.%%le' "$PHOTODIR"

A new structure containing copies of the original photos will be created in ./new.

Be sure to check in both the new and original directories for any files that were left behind due to exiftool errors.

Other Invocations

  1. Make a copy of all images in the directory ./SRC and place them in the directory ./DEST, maintaining the subdirectory structure.

    exiftool -r -o . -directory=DEST/%d SRC
    

    The -o . option sets the directory to ./ and also forces copies of the images to be made, rather than changing them in place. This directory setting is subsequently overridden by the -directory= option, which sets the output directory to ./DEST with the relative path of the image under ./SRC appended.

    E.g., the original image ./SRC/foo/bar/baz.jpg is copied to ./DEST/SRC/foo/bar/baz.jpg. I'm sure there's a way to remove the unnecessary SRC subdirectory in the output, but I haven't attempted to determine how to to this.

    To better understand how this works, consider another example.

    exiftool -r -o . -directory=DEST/%d `pwd`/SRC
    

    This has a different result than the previous example. The fully qualified path is used for the subdirectory structure under ./DEST. If $PWD is /Users/khe/example, then the original image ./SRC/foo/bar/baz.jpg will be copied to ./DEST/Users/khe/example/SRC/foo/bar/baz.jpg.

  2. Find all duplicate files, where duplicate is defined as a file name containing '-nnnnnnnn' (a dash followed by eight digits) preceding the three character extension.

    exiftool -r -FileName -if '$FileName =~ /[0-9]{8}\..../'
             dupes/src-new/jpg/1970
    
  3. Get image sizes info.

    # dimensions of every image
    exiftool -r -T -progress: -imagesize -directory \
             -filename  photos > sizes.all
    
    # counts of image sizes, ordered by frequency
    sed 's/x/ /' < sizes.all                             \
        | awk '{printf "%12d %6d %6d\n", $1*$2, $1, $2}' \
        | sort -n | uniq -c | sort -n > sizes.frequency
    
    # counts of image sizes, ordered by image size
    sort -n -k 2 sizes.frequency > sizes.size
    
  4. exiftool processing based on image size/dimensions.

    # images less than 256x512 in size
    exiftool -if '$imagesize and ($imagewidth<256 and $imageheight<512)' \
             -filename -r -T src-new
    

Using FIM for Duplicate Photo Detection

I recently discovered FIM (File Integrity Manager), which is intended as a tool for managing changes to photos. It is primarily intended to detect corruption in photos stored on disk, but it also offers the ability to detect duplicate photos, which is what attracted me to it. These are my notes on using it as a duplicate detector.

One of FIM's shortcomings is it's very limited ability to specify which photo in a set of duplicates should be considered the master. These notes primarily focus an workaround for that aspect of FIM.

In my situation, I find that, generally, among a set of duplicate photos, the one with the earliest creation date is the 'original'. If at all possible, I try to maintain the correct creation date. My approach does not solve the problem entirely, but it makes it manageable for many situations.

Note that FIM has a very strict definition of 'duplicate'. Files are duplicates of each other if all of the following are identical between them: name, size, creation time, modified time, hash, and permissions. One can direct FIM to ignore permissions and times in the comparisons via the -i command line option.

Example

Assume that I have three groups of photos: X, Y, and Z. There may be some duplicated photos within and among them. Z is the oldest group and X is the newest. We will process the groups in order of oldest to newest. This is based on the assumption that we want the earliest instance of a photo, should the same photo appear in multiple groups.

Initially, the groups of photos will be stored in $I. We will conduct our work in $F.

Move or copy (maintaining file dates/times) $I/Z to $F/Z, then initialize the FIM repository:

cd $F
fim init -m 'initialize using group Z'

This will create a FIM 'repository' in $F/.fim, which contains hashes of each photo. Now we'll use FIM to search for duplicates within Z:

cd $F
fim fdup

This will print a list of all duplicate files, if present. If the output is truncated, rerun with the -o nnnn, where nnnn is the max number of output lines you wish to see.

If there are duplicates and you wish to delete them:

cd $F
fim rdup

Unfortunately, FIM forces one to specify each photo that should be kept as an original -- there is no way to set an order of precedence for directories, etc., and apply that precedence in bulk. Note that this actually removes the duplicates from the filesystem, so be sure that you have backups.

Once the duplicate photos have been removed, commit the current state to FIM:

cd $F
fim ci -m 'Removed duplicates from group Z'

We are now ready to process group Y. There are two ways to do this:

  1. The first method is to designate Z as the master group. This method guarantees that if duplicates exist between Z and Y, the duplicate(s) in Y will be removed and the photo in Z will be preserved. Note that this does not check for duplicates that exist solely within Y alone.

    cd $I/Y
    fim rdup -M $F   # specifies that $F/.fim contains the Masters
    

    The very nice feature of defining a Master (i.e., using rdup -M) is that one of the command prompt options is 'A', which means to apply the choice to all subsequent duplicates. By choosing 'A', any duplicates between X and Y that are present in X will be deleted. If one has thousands of duplicates, this is a blessing. The A option is not present when using rdup without -M.

    At this point, there are no duplicates between Y and Z, though duplicates may still exist within Y. Move $I/X to $F/Y and remove any duplicates solely within Y :

    cd $F/Y
    fim fdup
    fim rdup
    fim ci -m 'Add group Y, less duplicates.'
    

    Note that any duplicates removed from Y by any of the above have not been logged because Y is not yet controlled -- this is a significant drawback if you care about traceability.

    The following should show no duplicates:

    cd $F
    fim fdup
    

    Finally, process X in an identical manner to Y. After this is complete, FIM should be managing all the photos in X, Y, and Z and no duplicates should be present.

  2. The second method is to bring X and Y into the FIM repository at the beginning of the process. The benefit of this is that all deletions of duplicates are logged, so actions are traceable. The downside is that it is not possible to make a single choice that applies to multiple instances of duplicates (i.e., the 'A' response of the previous method is not available). In other words, one has no alternative to responding to each duplicate with a keypress to specify which is the master.

    Move or copy $I/Y to $F/Y, process the group, and do the same thing with group X. One can also move/copy both X and Y and process them at once.

    rsync -av "$I/$Y/" "$F/$Y"
    cd $F
    fim ci -m 'add group Y'
    fim fdup  # view duplicates
    fim rdup  # remove duplicates
    fim ci -m 'removed duplicates from group Y'
    
    rsync -av "$I/$X/" "$F/$X"
    cd $F
    fim ci -m 'add group X'
    fim fdup  # view duplicates
    fim rdup  # remove duplicates
    fim ci -m 'removed duplicates from group X'
    

    At this point, there should be no duplicate photos in $F.

The SparkFun ESP8266 Thing and espeasy

I had an old SparkFun ESP8266 Thing laying around and decided that I wanted to install espeasy on it. Why was I interested in espeasy? It's new to me, but it seemed to provide a lot of leverage to someone wanting to monitor and/or control remote sensors/devices. Specifically:

  • OTA software updates!
  • Builtin support for many common sensors, e.g., DHT11, DHT22, RFID readers, analog input, IR, switches, displays, etc., etc.
  • Web-based device configuration after initial software install.
  • Multiple sensors/controls on a single device.
  • Works with Sonoff devices (look for 'switches'). These are fantastic!
  • MQTT support
  • Integrations with Domoticz, OpenHAB

espeasy doesn't explicitly say that it supports Home Assistant, but since both support MQTT, I assume that this is possible. I'm new to Home Assistant as well and am just exploring options at this point.

These are my installation notes.

Read more…

Dotfiles Reloaded

About a year and a half ago, I converted to using GNU stow and git for managing my dotfiles. The most recent version of that effort is available here.

The scheme worked well but the OCD part of me grew tired of seeing symbolic links to files instead of the actual files themselves. The symlinks weren't unexpected, after all, the whole point of stow is easily managing lots of links. Then I ran across a post on Hacker News about using a bare git repo to manage dotfiles in place. I very much liked the idea and settled on using vcsh and myrepos to manage my dotfiles going forward.

I should mention that I've known about vcsh for a good while. My impression had been that it was more complicated than necessary for my purposes. I've since changed my mind completely; it's as simple as what I ended up doing with stow, if not simpler. Minor tweaks to dotfiles are a pleasure when using myrepos.

These are my brief notes about the process.

Read more…

A Content-Aware Untabify Command

I have an application whose behavior is controlled by a YAML file. Recently, I modified the behavior and shipped off the new YAML file to the application, only to have the application die because the YAML file contained tabs instead of spaces. One remedy would be to modify the application to handle tabs properly. Another would be to ensure that tabs are never present in the YAML file to begin with. In this instance, the latter was the path of least resistance; this is a brief note about creating a shell script that will use GNU Emacs to perform the tab to space conversion.

Read more…

Using an Amazon Dash Button for Emergency Notifications

I've been interested in the Amazon Dash button as a generic IoT device since they were introduced. The original branded buttons are a great deal at $5, but using them for other than their intended purpose is a chore. With Amazon's introduction of the generic IoT Button, it is now quite simple to create custom behavior associated with the button. These notes describe how I used one to I create an 'emergency' button for my 103 year old father.

Read more…