Pandoc

Convert Markdown to other formats via Pandoc

2013-03-06

Note: you are no longer recommended to use the pandoc() function in knitr. Please try the rmarkdown package instead: http://rmarkdown.rstudio.com


The function pandoc() in knitr (since version 1.2) was designed to convert Markdown documents to other formats such as LaTeX/PDF, HTML and Word (odt/docx). The main idea is to minimize the command-line call by wrapping commands into a configuration file or embedded configurations. Normally we call Pandoc via command line like this:

pandoc -s --mathjax --number-sections --bibliography=foo.bib \
  -o output.html input.md

It is tedious to type the same command again and again. What pandoc() does is to execute the command like above in R via system(), but read the Pandoc arguments from a config file, so that we can write all arguments in the file once, and simply call pandoc('input.md') afterwards.

Please follow the instructions on the Pandoc website to install it.

Absolute beginners

If you have no experience using the command line, you can try this function without any configurations. Write a Markdown file, say, foo.md, and throw it into pandoc():

library(knitr)
pandoc('foo.md', format='html')  # HTML
pandoc('foo.md', format='latex') # LaTeX/PDF
pandoc('foo.md', format='docx')  # MS Word
pandoc('foo.md', format='odt')   # OpenDocument

But you often need some custom options like what we showed in the beginning. Now we explain how to pass such options to Pandoc.

A simple example

Suppose you want to convert Markdown to HTML with arguments in the first command line example, you can write a config file like this:

t: html
s: 
mathjax: 
number-sections: 
bibliography: foo.bib
o: output.html

You can save it as foo.txt and run

library(knitr)
pandoc('input.md', format='html', config='foo.txt')

Then knitr will parse this config file and turn it into pandoc arguments. The empty options such as s and mathjax are turned to -s and --mathjax respectively, and those non-empty options like bibliography and o are converted to --bibliography=foo.bib and -o output.html respectively.

The config file

The config file is essentially a Debian Control File. Here are some rules:

  1. the option name and value are separated by :
  2. an option can have a value of multiple lines but all the following lines have to be indented by white spaces
  3. blank lines are used to separate records (paragraphs)

The first rule is simple. For the second rule, consider bibliography: when there are multiple bibliography databases to be passed to Pandoc, we can write the config file as

bibliography: paper1.bib
  paper2.bib
  paper3.bib

In this case, it is converted to --bibliography=paper1.bib --bibliography=paper2.bib --bibliography=paper3.bib and passed to Pandoc.

For the third rule, it is useful when we define multiple output formats in the config file; below is an example of two records for html and latex output:

t: html
s: 
mathjax: 
number-sections: 
bibliography: foo.bib
o: output.html

t: latex
latex-engine: xelatex
s: 
number-sections: 
output: test.pdf

With this config file, we can call pandoc('input.md', format='latex', config='foo.txt') and we will get a PDF file test.pdf.

The name of the config file is obtained from getOption('config.pandoc') by default, which means you can set options(config.pandoc = 'path/to/your/config.file') as a global option. If this option is not set, the pandoc() function will look for a file foo.pandoc where foo is the base name of the input file, e.g. it looks for test.pandoc if the input file is test.md. In other words, the config file has the same name as the Markdown file except that it has a different extension.

Common options

Sometimes we want to share some a few common options across different output formats. For instance, --number-sections can be used for both PDF and HTML output. The record that does not contain the t tag is treated as common options for all formats. Now we can rewrite the above config file as:

s: 
number-sections: 

t: html
mathjax: 
bibliography: foo.bib
o: output.html

t: latex
latex-engine: xelatex
output: test.pdf

Note the s and number-sections are extracted to a separate record without a t tag.

Embedded configurations

We may want to make the Markdown file self-contained in the sense that the configurations are embedded in it, so we do not need to rely on an external config file. In this case, we can use a special comment <!--pandoc --> in the Markdown file.

<!--pandoc
t: html
s:
mathjax:
number-sections:
bibliography: foo.bib
o: output.html
-->

Now we can pass a single file to other people and they will be able to call pandoc() to convert it to the expected format.

If both the config file and embedded configurations are found, they will be combined as if they were from a single file.

Complete examples

See the example 084 (using an external config file) and 088 (using embedded configurations).