Chapter 2 Basics
R Markdown provides an authoring framework for data science. You can use a single R Markdown file to both
save and execute code, and
generate high quality reports that can be shared with an audience.
R Markdown was designed for easier reproducibility, since both the computing code and narratives are in the same document, and results are automatically generated from the source code. R Markdown supports dozens of static and dynamic/interactive output formats.
If you prefer a video introduction to R Markdown, we recommend that you check out the website https://rmarkdown.rstudio.com, and watch the videos in the “Get Started” section, which cover the basics of R Markdown.
Below is a minimal R Markdown document, which should be a plain-text file, with the conventional extension .Rmd
:
---
title: "Hello R Markdown"
author: "Awesome Me"
date: "2018-02-14"
output: html_document
---
This is a paragraph in an R Markdown document.
Below is a code chunk:
```{r}
fit = lm(dist ~ speed, data = cars)
b = coef(fit)
plot(cars)
abline(fit)
```
`r b[1]`. The slope of the regression is
You can create such a text file with any editor (including but not limited to RStudio). If you use RStudio, you can create a new Rmd file from the menu File -> New File -> R Markdown
.
There are three basic components of an R Markdown document: the metadata, text, and code. The metadata is written between the pair of three dashes ---
. The syntax for the metadata is YAML (YAML Ain’t Markup Language, https://en.wikipedia.org/wiki/YAML), so sometimes it is also called the YAML metadata or the YAML frontmatter. Before it bites you hard, we want to warn you in advance that indentation matters in YAML, so do not forget to indent the sub-fields of a top field properly. See the Appendix B.2 of Xie (2016) for a few simple examples that show the YAML syntax.
The body of a document follows the metadata. The syntax for text (also known as prose or narratives) is Markdown, which is introduced in Section 2.5. There are two types of computer code, which are explained in detail in Section 2.6:
A code chunk starts with three backticks like
```{r}
wherer
indicates the language name,1 and ends with three backticks. You can write chunk options in the curly braces (e.g., set the figure height to 5 inches:```{r, fig.height=5}
).An inline R code expression starts with
`r
and ends with a backtick`
.
Figure 2.1 shows the above example in the RStudio IDE. You can click the Knit
button to compile the document (to an HTML page). Figure 2.2 shows the output in the RStudio Viewer.
Now please take a closer look at the example. Did you notice a problem? The object b
is the vector of coefficients of length 2 from the linear regression; b[1]
is actually the intercept, and b[2]
is the slope! This minimal example shows you why R Markdown is great for reproducible research: it includes the source code right inside the document, which makes it easy to discover and fix problems, as well as update the output document. All you have to do is change b[1]
to b[2]
, and click the Knit
button again. Had you copied a number -17.579
computed elsewhere into this document, it would be very difficult to realize the problem. In fact, I had used this example a few times by myself in my presentations before I discovered this problem during one of my talks, but I discovered it anyway.
Although the above is a toy example, it could become a horror story if it happens in scientific research that was not done in a reproducible way (e.g., cut-and-paste). Here are two of my personal favorite videos on this topic:
“A reproducible workflow” by Ignasi Bartomeus and Francisco Rodríguez-Sánchez (https://youtu.be/s3JldKoA0zw). It is a 2-min video that looks artistic but also shows very common and practical problems in data analysis.
“The Importance of Reproducible Research in High-Throughput Biology” by Keith Baggerly (https://youtu.be/7gYIs7uYbMo). You will be impressed by both the content and the style of this lecture. Keith Baggerly and Kevin Coombes were the two notable heroes in revealing the Duke/Potti scandal, which was described as “one of the biggest medical research frauds ever” by the television program “60 Minutes.”
It is fine for humans to err (in computing), as long as the source code is readily available.