A client has a substantial body of written work that began life as MS Word documents. They also created many derivative documents in various formats (including HTML, PDF, Dreamweaver, MS Front Page, etc.) for different use cases. Our recommendation was that, going forward, they should create content once, in a canonical format, from which they could generate documents in whatever format they need(HTML, PDF, etc.). That subject is beyond the scope of this note. This note is focused on a quick and dirty method to access their assets in MS Word format and convert them into something more widely useful.
Word to reST
While LaTeX is our tool of choice for in-house content, it is really only suited to hardcore users, such as programmers or academics. We decided to convert our client's Word assets into reST, knowing that later, we could convert to HTML, PDF, etc., as needed. The key to making all this work is the remakable 'pandoc' package to convert from Word to reST.
If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Pandoc can convert documents in markdown, reStructuredText, textile, HTML, DocBook, or LaTeX to:
- HTML formats: XHTML, HTML5, and HTML slide shows using Slidy, Slideous, S5, or DZSlides.
- Word processor formats: Microsoft Word docx, OpenOffice/LibreOffice ODT, OpenDocument XML
- Ebooks: EPUB
- Documentation formats: DocBook, GNU TexInfo, Groff man pages
- TeX formats: LaTeX, ConTeXt, LaTeX Beamer slides
- PDF via LaTeX
- Lightweight markup formats: Markdown, reStructuredText, AsciiDoc, MediaWiki markup, Emacs Org-Mode, Textile
Here are the steps:
- Open document in Word, and 'Save document as Web Page...'
- Choose 'Save entire file as HTML'
- In 'Web Options', set Encoding to 'UTF-8'
- Save the file
- Run pandoc word_output.html -o newfile.rst.
What Didn't Work (For us)
- We tried opening the DOCX file using Open Office, then saving the file as an ODT file (pandoc can deal with ODT as a file type), however, pandoc encountered encoding errors which we were not keen to pursue. Most likely it's a simple tweak to make it work, once you've identified the problem. We did try using iconv, as mentioned in the pandoc documentation, without much success.
- We tried printing from Word to PDF, then converting the PDF with pandoc, but we ran into encoding errors here as well.
- Word's HTML output is excruciatingly ugly and suffers from layout problems. However, this approach proved to provide the best conversion of all the avenues we pursued.