Skip to main content

Using Pandoc with Problematic UTF-8 Files

I recently used pandoc to convert some html files to reST. Initially, there were numerous errors about invalid characters. Here is the brute-force solution I arrived at to get the job done while attempting to reasonably convert offending characters into something reasonable.

It started with a clue I found here.

With the clue in mind, I wrote a script similar to this:

for file in *html
    iconv -t utf-8//TRANSLIT//IGNORE  "${file}" \
          | pandoc -f html -t rst -o "${file}.rst"

The magic here is the use of iconv. The -t options says that we want to translate the input, whatever it is, into UTF-8 and that we want it to substitute characters as best as it can, where possible (via TRANSLIT), and, where not possible, to simply drop the problem character(s) (via IGNORE).

It was more than adequate for the need at hand.

The pandoc documentation does mention the use of iconv, however, it doesn't give you the sledgehammer that the earlier link did, which allowed me to beat the offending files into submission.