RMarkdown Advanced
This section does not have a companion video.
We have been using RMarkdown files to combine the analysis and discussion into one nice document that contains all the analysis steps so that your research is reproducible.
There are many resources on the web about Markdown and the variant that RStudio uses (called RMarkdown), but the easiest reference is to just use the RStudio help tab to access the help. I particular like Help -> Cheatsheets -> RMarkdown Reference Guide
because it gives me the standard Markdown information but also a bunch of information about the options I can use to customize the behavior of individual R code chunks.
Most of what is presented here isn’t primarily about how to use R, but rather how to work with tools in RMarkdown so that the final product is neat and tidy. While you could print out your RMarkdown file and then clean it up in MS Word, sometimes there is a good to want as nice a starting point as possible.
Chunk Options
Within an Rmarkdown file, we usually have some R chunk and there are many things we could to tweak how the results are displayed. Below shows an R chunk with a few figure options enable.
```{r, echo=FALSE, fig.height=4, fig.width=6}
plot(cars)
```
In this example, I’ve shown what a code chunk might look like when I include different chunk options. In this case I’ve set the figure output height to 4 inches and width to 6 inches, while setting echo=FALSE
that specifies the code is to be run and the output to be shown, but we don’t want to see the R-code that produces the output.
The comprehensive set of R chunk options is available by the knitr package author Yihui Xie at the knitr
website. There are many chunk options available, below is only a list of those that I use most commonly.
Option | Default | Description |
---|---|---|
echo |
TRUE |
If FALSE, knitr will not display the code in the code chunk above its results in the final document. |
results |
'markup' |
If 'hide' , knitr will not display the code’s results in the final document. If 'hold' , knitr will delay displaying all output pieces until the end of the chunk. If 'asis' , knitr will pass through results without reformatting them, useful if results return raw HTML, etc.) |
error |
TRUE |
If FALSE , knitr will not display any error messages generated by the code. |
message |
TRUE |
If FALSE , knitr will not display any messages generated by the code. |
warning |
TRUE |
If FALSE , knitr will not display any warning messages generated by the code. |
fig.height |
7 | The height to use in R for plots created by the chunk (in inches). |
fig.width |
7 | The width to use in R for plots created by the chunk (in inches). |
fig.dim |
7,7 | The height and width to use in R for plots created by the chunk (in inches). |
To set the chunk options for ALL chunks, which can be overwritten on a case by case basis, we can use the global options.
Verbatim & List Environments
This section is best reviewed using the source code of the book. Some parts may not show all the intended detail on the html output.
The way that Markdown starts a verbatim environment is to indent your text with 4 spaces. If you have the following code in your Rmarkdown file:
This is text that will be printed verbatim.
Then you’ll see the following output:
This is text that will be printed verbatim.
Notice the Markdown verbatim environment is exactly how your R code chunks get displayed exactly how your wrote them. This is a necessary and handy trick for producing really nice knitted output.
Markdown unfortunately ALSO uses four spaces to denote an indented list environment.
1. Problem definition. This problem definition spans several lines. On
the second line, I'll indent 4 spaces to keep us in the list environment.
a) Part a. This might be very long. To keep ourselves in this
list element, we indent 8 spaces. (4 for problem 1, and four for part a).
b) Part b.
Produces the following output:
- Problem definition. This problem definition spans several lines. On
the second line, I’ll indent 4 spaces to keep us in the list
environment.
- Part a. This might be very long. To keep ourselves in this list element, we indent 8 spaces. (4 for problem 1, and four for part a).
- Part b.
But notice what happens if I insert R code chunk between part a) and b) and critically, the R chunk is not indented by four spaces.
1. Problem definition.
a) Part a.
```{r}
2+3
```
b) Part b.
Without the four spaces ahead of the code chunk between parts (a) and (b), we fall out of the nested list environment and begin a verbatim environment.
- Problem definition.
- Part a.
## [1] 5
b) Part b.
So to keep ourselves in the nested list environment, we need to indent the R chunk 4 (or 8) spaces. If we indent it 4 spaces, then the R code and output will be aligned with the a), if we use 8 spaces, it will be indented from the a).
1. Problem definition.
a) Part a.
```{r}
2+3
```
b) Part b.
- Problem definition.
Part a.
## [1] 5
Part b.
I really like the code indented from the a) header, but then the code editor doesn’t do highlighting because on first blush, it looks like the verbatim environment and RStudio isn’t smart enough to realize that we aren’t in the verbatim. So my solution is to get the R code working and then indent it the 8 spaces.
Finally, I often leave a blank line separating my response in part (a) to the problem definition for part (b). Again the RStudio editor isn’t smart enough to realize that we are writing an R chunk. Unfortunately I don’t have a clever hack to keep the editor from thinking that you are in the verbatim environment. Fortunately, when we knit, it will all be fine.
Mathematical Expressions
The primary way to insert a mathematical expression is to use a markup language called LaTeX. This is a very powerful system and it is what most mathematicians use to write their documents. The downside is that there is a lot to learn. However, you can get most of what you need pretty easily. For RMarkdown to recognize that you are writing math using Latex there are two common options. 1) We can use a single dollar sign ($) within a sentence - this is often referred to as in-line LaTex. 2) We can use two dollar signs ($$) to create a LaTex-environment, where we can include additional options such as align
or cases
, that allow for more flexible writing of mathematical expression. In general, if you are trying to add a simple LaTex line - use one $, and if you need to write out a longer more complex mathematical argument, open a LaTex-environment with two $.
Some examples of common LaTeX patterns are given below:
Goal | LaTeX | Output | LaTeX | Output |
---|---|---|---|---|
power | $x^2$ |
\(x^2\) | $y^{0.95}$ |
\(y^{0.95}\) |
Subscript | $x_i$ |
\(x_i\) | $t_{24}$ |
\(t_{24}\) |
Greek | $\alpha$ $\beta$ |
\(\alpha\) \(\beta\) | $\theta$ $\Theta$ |
\(\theta\) \(\Theta\) |
Bar | $\bar{x}$ |
\(\bar{x}\) | $\bar{mu}_i$ |
\(\bar{\mu}_i\) |
Hat | $\hat{mu}$ |
\(\hat{\mu}\) | $\hat{y}_i$ |
\(\hat{y}_i\) |
Star | $y^*$ |
\(y^*\) | $\hat{\mu}^*_i$ |
\(\hat{\mu}^*_i\) |
Centered Dot | $\cdot$ |
\(\cdot\) | $\bar{y}_{i\cdot}$ |
\(\bar{y}_{i\cdot}\) |
Sum | $\sum x_i$ |
\(\sum x_i\) | $\sum_{i=0}^N x_i$ |
\(\sum_{i=0}^N x_i\) |
Square Root | $\sqrt{a}$ |
\(\sqrt{a}\) | $\sqrt{a^2 + b^2}$ |
\(\sqrt{a^2 + b^2}\) |
Fractions | $\frac{a}{b}$ |
\(\frac{a}{b}\) | $\frac{x_i - \bar{x}{s/\sqrt{n}$ |
\(\frac{x_i - \bar{x}}{s/\sqrt{n}}\) |
Within your RMarkdown document, you can include LaTeX code by enclosing it with dollar signs. So you might write $\alpha=0.05$
in your text, but after it is knitted to a pdf, html, or Word, you’ll see \(\alpha=0.05\).
If you want your mathematical expression to be on its own line, all by itself, enclose it with double dollar signs. So
$$z_i = \frac{z_i-\bar{x}}{\sigma / \sqrt{n}}$$
would be displayed as
\[ z_{i}=\frac{x_{i}-\bar{X}}{\sigma/\sqrt{n}} \]
Unfortunately RMarkdown is a little picky about spaces near the $
and $$
signs and you can’t have any spaces between them and the LaTeX command. For a more information about all the different symbols you can use, Google ‘LaTeX math symbols’. In general, if you open a LaTeX environment using $$, be sure to start writing on the line directly after them, and do not include extra spaces!
Tables
For the following descriptions of the simple, grid, and pipe tables, I’ve shamelessly stolen from the Pandoc documentation
One way to print a table is to just print it in R and have the table presented in the code chunk. For example, suppose I want to print out the first 4 rows of the trees data set.
## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
Usually this is sufficient, but suppose you want something a bit nicer because you are generating tables regularly and you don’t want to have to clean them up by hand. Tables in RMarkdown follow the table conventions from the Markdown class with a few minor exceptions. Markdown provides 4 ways to define a table, and RMarkdown supports 3 of them.
Simple Tables
Simple tables look like this:
and would be rendered like this:
Right | Left | Center | Default |
---|---|---|---|
12 | 12 | hmmm | 12 |
123 | 123 | 123 | 123 |
1 | 1 | 1 | 1 |
Notice I don’t wrap them in dollar signs or anything, just a blank line above and below the table. The headers and table rows must each fit on one line. Column alignments are determined by the position of the header text relative to the dashed line below it.
If the dashed line is flush with the header text on the right side but extends beyond it on the left, the column is right-aligned. If the dashed line is flush with the header text on the left side but extends beyond it on the right, the column is left-aligned. If the dashed line extends beyond the header text on both sides, the column is centered. If the dashed line is flush with the header text on both sides, the default alignment is used (in most cases, this will be left). The table must end with a blank line, or a line of dashes followed by a blank line.
Grid Tables
Grid tables are a little more flexible and each cell can take an arbitrary Markdown block elements (such as lists).
+---------------+---------------+--------------------+
| Fruit | Price | Advantages |
+===============+===============+====================+
| Bananas | $1.34 | - built-in wrapper |
| | | - bright color |
+---------------+---------------+--------------------+
| Oranges | $2.10 | - cures scurvy |
| | | - tasty |
+---------------+---------------+--------------------+
which is rendered as the following:
Fruit | Price | Advantages |
---|---|---|
Bananas | $1.34 |
|
Oranges | $2.10 |
|
Grid table doesn’t support Left/Center/Right alignment. Both Simple tables and Grid tables require you to format the blocks nicely inside the RMarkdown file and that can be a bit annoying if something changes and you have to fix the spacing in the rest of the table. Both Simple and Grid tables don’t require column headers.
Pipe Tables
Pipe tables look quite similar to grid tables but Markdown isn’t as picky about the pipes lining up. However, it does require a header row (which you could leave the elements blank in).
| Right | Left | Default | Center |
|------:|:-----|---------|:------:|
| 12 | 12 | 12 | 12 |
| 123 | 123 | 123 | 123 |
| 1 | 1 | 1 | 1 |
which will render as the following:
Right | Left | Default | Center |
---|---|---|---|
12 | 12 | 12 | 12 |
123 | 123 | 123 | 123 |
1 | 1 | 1 | 1 |
In general I prefer to use the pipe tables because it seems a little less picky about getting everything correct. However it is still pretty annoying to get the table laid out correctly. In all of these tables, you can use the regular RMarkdown formatting tricks for italicizing and bolding. So I could have a table such as the following:
| Source | df | Sum of Sq | Mean Sq | F | $Pr(>F_{1,29})$ |
|:------------|-----:|--------------:|--------------:|-------:|--------------------:|
| Girth | *1* | 7581.8 | 7581.8 | 419.26 | **< 2.2e-16** |
| Residual | 29 | 524.3 | 18.1 | | |
and have it look like this:
Source | df | Sum of Sq | Mean Sq | F | \(Pr(>F_{1,29})\) |
---|---|---|---|---|---|
Girth | 1 | 7581.8 | 7581.8 | 419.26 | < 2.2e-16 |
Residual | 29 | 524.3 | 18.1 |
The problem with all of this is that I don’t want to create these by hand. Instead I would like functions that take a data frame or matrix and spit out the RMarkdown code for the table.
R functions to produce table code
There are a couple of different packages that convert a data frame to simple/grid/pipe table. We will explore a couple of these, starting with the most basic and moving to the more complicated. The general idea is that we’ll produce the appropriate simple/grid/pipe table syntax in R, and when it gets knitted, then RMarkdown will turn our simple/grid/pipe table into something pretty.
knitr::kable
package
The knitr
package includes a function that produces simple tables. It doesn’t have much customization, but it gets the job done. One nice aspect of kable
compared to pander
is that we don’t need to set any additional chunk options.
Girth | Height | Volume |
---|---|---|
8.3 | 70 | 10.3 |
8.6 | 65 | 10.3 |
8.8 | 63 | 10.2 |
10.5 | 72 | 16.4 |
To further customize kable
tables, one can look into the kableextra
package, which allows for a variety of options to improve table output, beyond the scope of what this section covers.
pander
package
The package pander
seems to be a nice compromise between customization and not having to learn too much. It is relatively powerful in that it will take summary()
and anova()
output and produce tables for them. By default pander
will produce simple tables, but you can ask for Grid or Pipe tables.
Girth | Height | Volume |
---|---|---|
8.3 | 70 | 10.3 |
8.6 | 65 | 10.3 |
8.8 | 63 | 10.2 |
10.5 | 72 | 16.4 |
The pander
package deals with summary and anova tables from a variety of different analyses. So you can simply ask for a
nice looking version using the following:
# a simple regression model
model <- lm( Volume ~ Girth, data=trees )
model %>%
summary() %>% # the usual summary table
pander::pander() # make the table print *pretty*
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -36.94 | 3.365 | -10.98 | 7.621e-12 |
Girth | 5.066 | 0.2474 | 20.48 | 8.644e-19 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
31 | 4.252 | 0.9353 | 0.9331 |
model %>%
anova() %>% # The usual anova table
pander::pander(missing='') # Make the table print *pretty*
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
Girth | 1 | 7582 | 7582 | 419.4 | 8.644e-19 |
Residuals | 29 | 524.3 | 18.08 |
The missing=''
argument causes pander to print a blank instead of NA
for any missing values in the table.
Code Appendix
Some people prefer to not be distracted by having all of the R code embedded within a document and just want to see the resulting output tables and graphs. This can easily be done by including echo=FALSE
in the header of each code chunk, or in the initial global chunk options via knitr::opts_chunk$set(echo=FALSE)
. This will cause the code to be executed and the results shown, but the R code used to produce those results will not be shown.
Another preference is to not show the code at each step, but to show the R code only at the very end. For example, perhaps we want a Code Appendix that gives all of the code. The naive approach would be to just copy all the code and create duplicate code chunks, but not evaluate them. However, this violates the rule of reproducible research that the code used to produce the result must be the code that is advertised as having created it.
Instead, we’ll set the default behavior for all code chunks be to not show the R code,
Then at the end of the document, create a code-chunk that isn’t evaluated but is echoed, and copies from all of the previous code chunks in the document. The documentation=1
adds the code chunk headers (if specified) as comments. To insert such a section, one would use the r-chunk:
```{r, ref.label=knitr::all_labels(), echo=TRUE, eval=FALSE, documentation=1}
```