4.20 Manipulate Markdown via Pandoc Lua filters (*)

Technically, this section may be a little advanced, but once you learn how your Markdown content is translated into the Pandoc abstract syntax tree (AST), you will have the power of manipulating any Markdown elements with the programming language called Lua.

Basically, when Pandoc reads a Markdown file, the content will be parsed into an AST. Pandoc allows you to modify this AST with Lua scripts. We use the following simple Markdown file (named ast.md) to show what the AST means:

## Section One

Hello world!

This file contains a header and a paragraph. After Pandoc parses this content, it may be easier for R users to understand the resulting AST if we convert the file to the JSON format:

pandoc -f markdown -t json -o ast.json ast.md

Then read the JSON file into R, and print out the data structure.

When you do this, you will see that the Markdown content is represented in a recursive list. Its structure is printed below. The label t stands for “type,” and c stands for “content.” Take the header for example. Its type is “Header”, and its content has three sub-elements: the header level (2), the attributes (e.g., the ID is section-one), and the text content.

xfun:::tree(
  jsonlite::fromJSON('ast.json', simplifyVector = FALSE)
)
List of 3
 |-pandoc-api-version:List of 3
 |  |-: int 1
 |  |-: int 22
 |  |-: int 1
 |-meta              : Named list()
 |-blocks            :List of 2
    |-:List of 2
    |  |-t: chr "Header"
    |  |-c:List of 3
    |     |-: int 2
    |     |-:List of 3
    |     |  |-: chr "section-one"
    |     |  |-: list()
    |     |  |-: list()
    |     |-:List of 3
    |        |-:List of 2
    |        |  |-t: chr "Str"
    |        |  |-c: chr "Section"
    |        |-:List of 1
    |        |  |-t: chr "Space"
    |        |-:List of 2
    |           |-t: chr "Str"
    |           |-c: chr "One"
    |-:List of 2
       |-t: chr "Para"
       |-c:List of 3
          |-:List of 2
          |  |-t: chr "Str"
          |  |-c: chr "Hello"
          |-:List of 1
          |  |-t: chr "Space"
          |-:List of 2
             |-t: chr "Str"
             |-c: chr "world!"

After you are aware of the AST, you can modify it with the Lua programming language. Pandoc has a built-in Lua interpreter, so you do not need to install additional tools. The Lua scripts are called “Lua filters” for Pandoc. Next we give a quick example of raising the levels of headers by one, e.g., convert level 3 headers to level 2 headers. This may be useful when the top-level headers of your document are level 2 headers, but you want to start with level 1 headers instead.

First, we create a Lua script file named raise-header.lua, which contains a function named Header, indicating that we want to modify elements of the type “Header” (in general, you can use the type name as the function name to process elements of a certain type):

function Header(el)
  -- The header level can be accessed via the attribute 'level'
  -- of the element. See the Pandoc documentation later.
  if (el.level <= 1) then
    error("I don't know how to raise the level of h1")
  end
  el.level = el.level - 1
  return el
end

Then we can pass this script to Pandoc via the argument --lua-filter, e.g.,

pandoc -t markdown --markdown-headings=atx \
  --lua-filter=raise-header.lua ast.md
# Section One

Hello world!

You can see that we have successfully converted ## Section One to # Section One. You may feel this example is trivial, and wonder why not simply replace ## with # with a regular expression like:

gsub("^##", "#", readLines("ast.md"))

Usually it is not robust to manipulate a structured document with regular expressions, because there are almost always exceptions, e.g., what if ## means a comment in R code? The AST gives you the structured data, so you know for sure that you are modifying the expected elements.

Pandoc has extensive documentation on Lua filters at https://pandoc.org/lua-filters.html, where you can find a large number of examples. You can also find some filters written by the community in the GitHub repository at https://github.com/pandoc/lua-filters.

In the R Markdown world, below is an incomplete list of packages that have made use of Lua filters (usually they are in the inst/ directory):

  • The rmarkdown package (https://github.com/rstudio/rmarkdown) contains filters that insert page breaks (see Section 4.1) and generate custom blocks (see Section 9.6).

  • The pagedown package (Xie et al. 2022) contains filters that help implement footnotes and the list of figures on HTML pages.

  • The govdown package (Garmonsway 2021) contains filters to convert Pandoc’s fenced Divs to appropriate HTML tags.

You can also find an example in Section 5.1.2 in this book, which shows you how to change the text color with a Lua filter.

For R Markdown users who do not want to create R packages to ship the Lua filters (like the above packages), you may store these Lua scripts anywhere on your computer, and apply them through the pandoc_args option of an R Markdown output format, e.g.,

---
output:
  html_document:
    pandoc_args:
      - --lua-filter=raise-header.lua
---

References

Garmonsway, Duncan. 2021. Govdown: GOV.UK Style Templates for r Markdown. https://ukgovdatascience.github.io/govdown/.
Xie, Yihui, Romain Lesur, Brent Thorne, and Xianying Tan. 2022. Pagedown: Paginate the HTML Output of r Markdown with CSS for Print. https://github.com/rstudio/pagedown.