Chapter 12 Reports

If you have ever written a report, you are probably familiar with the process of preparing your figures in some software, say R, and then copy-pasting into your text editor, say MS Word. While very popular, this process is both tedious, and plain painful if your data has changed and you need to update the report. Wouldn’t it be nice if you could produce figures and numbers from within the text of the report, and everything else would be automated? It turns out it is possible. There are actually several systems in R that allow this. We start with a brief review.

Sweave: LaTeX is a markup language that compiles to Tex programs that compile, in turn, to documents (typically PS or PDFs). If you never heard of it, it may be because you were born the the MS Windows+MS Word era. You should know, however, that LaTeX was there much earlier, when computers were mainframes with text-only graphic devices. You should also know that LaTeX is still very popular (in some communities) due to its very rich markup syntax, and beautiful output. Sweave (Leisch 2002) is a compiler for LaTeX that allows you do insert R commands in the LaTeX source file, and get the result as part of the outputted PDF. It’s name suggests just that: it allows to weave S²² output into the document, thus, Sweave.
knitr: Markdown is a text editing syntax that, unlike LaTeX, is aimed to be human-readable, but also compilable by a machine. If you ever tried to read HTML or LaTeX source files, you may understand why human-readability is a desirable property. There are many markdown compilers. One of the most popular is Pandoc, written by the Berkeley philosopher(!) Jon MacFarlane. The availability of Pandoc gave Yihui Xie, a name to remember, the idea that it is time for Sweave to evolve. Yihui thus wrote knitr (Xie 2015), which allows to write human readable text in Rmarkdown, a superset of markdown, compile it with R and the compile it with Pandoc. Because Pandoc can compile to PDF, but also to HTML, and DOCX, among others, this means that you can write in Rmarkdown, and get output in almost all text formats out there.
bookdown: Bookdown (Xie 2016) is an evolution of knitr, also written by Yihui Xie, now working for RStudio. The text you are now reading was actually written in bookdown. It deals with the particular needs of writing large documents, and cross referencing in particular (which is very challenging if you want the text to be human readable).
Shiny: Shiny is essentially a framework for quick web-development. It includes (i) an abstraction layer that specifies the layout of a web-site which is our report, (ii) the command to start a web server to deliver the site. For more on Shiny see Chang et al. (2017).

12.1 knitr

12.1.1 Installation

To run knitr you will need to install the package.

install.packages('knitr')

It is also recommended that you use it within RStudio (version>0.96), where you can easily create a new .Rmd file.

12.1.2 Pandoc Markdown

Because knitr builds upon Pandoc markdown, here is a simple example of markdown text, to be used in a .Rmd file, which can be created using the File-> New File -> R Markdown menu of RStudio.

Underscores or asterisks for _italics1_ and *italics2* return italics1 and italics2. Double underscores or asterisks for __bold1__ and **bold2** return bold1 and bold2. Subscripts are enclosed in tildes, like~this~ (like_this), and superscripts are enclosed in carets like^this^ (like^this).

For links use [text](link), like [my site](www.john-ros.com). An image is the same as a link, starting with an exclamation, like this ![image caption](image path).

An itemized list simply starts with hyphens preceeded by a blank line (don’t forget that!):

- bullet
- bullet
    - second level bullet
    - second level bullet

Compiles into:

bullet
bullet
- second level bullet
- second level bullet

An enumerated list starts with an arbitrary number:

1. number
1. number
    1. second level number
    1. second level number

Compiles into:

number
number
1. second level number
2. second level number

For more on markdown see https://bookdown.org/yihui/bookdown/markdown-syntax.html.

12.1.3 Rmarkdown

Rmarkdown, is an extension of markdown due to RStudio, that allows to incorporate R expressions in the text, that will be evaluated at the time of compilation, and the output automatically inserted in the outputted text. The output can be a .PDF, .DOCX, .HTML or others, thanks to the power of Pandoc.

The start of a code chunk is indicated by three backticks and the end of a code chunk is indicated by three backticks. Here is an example.

```{r  eval=FALSE}
rnorm(10)
```

This chunk will compile to the following output (after setting eval=FALSE to eval=TRUE):

rnorm(10)

##  [1] -1.191659233 -0.008735042 -0.251447395  0.294509886  1.545818175
##  [6]  0.076847920 -0.279008173  1.056446506  0.103559259  1.823610224

Things to note:

The evaluated expression is added in a chunk of highlighted text, before the R output.
The output is prefixed with ##.
The eval= argument is not required, since it is set to eval=TRUE by default. It does demonstrate how to set the options of the code chunk.

In the same way, we may add a plot:

```{r  eval=FALSE}
plot(rnorm(10))
```

which compiles into

plot(rnorm(10))

Some useful code chunk options include:

eval=FALSE: to return code only, without output.
echo=FALSE: to return output, without code.
cache=: to save results so that future compilations are faster.
results='hide': to plot figures, without text output.
collapse=TRUE: if you want the whole output after the whole code, and not interleaved.
warning=FALSE: to supress watning. The same for message=FALSE, and error=FALSE.

You can also call r expressions inline. This is done with a single tick and the r argument. For instance:

`r rnorm(1)` is a random Gaussian will output

-1.6490174 is a random Gaussian.

12.1.4 BibTex

BibTex is both a file format and a compiler. The bibtex compiler links documents to a reference database stored in the .bib file format.

Bibtex is typically associated with Tex and LaTex typesetting, but it also operates within the markdown pipeline.

Just store your references in a .bib file, add a bibliography: yourFile.bib in the YML preamble of your Rmarkdown file, and call your references from the Rmarkdown text using @referencekey. Rmarkdow will take care of creating the bibliography, and linking to it from the text.

12.1.5 Compiling

Once you have your .Rmd file written in RMarkdown, knitr will take care of the compilation for you. You can call the knitr::knitr function directly from some .R file, or more conveniently, use the RStudio (0.96) Knit button above the text editing window. The location of the output file will be presented in the console.

12.2 bookdown

As previously stated, bookdown is an extension of knitr intended for documents more complicated than simple reports– such as books. Just like knitr, the writing is done in RMarkdown. Being an extension of knitr, bookdown does allow some markdowns that are not supported by other compilers. In particular, it has a more powerful cross referencing system.

12.3 Shiny

Shiny (Chang et al. 2017) is different than the previous systems, because it sets up an interactive web-site, and not a static file. The power of Shiny is that the layout of the web-site, and the settings of the web-server, is made with several simple R commands, with no need for web-programming. Once you have your app up and running, you can setup your own Shiny server on the web, or publish it via Shinyapps.io. The freemium versions of the service can deal with a small amount of traffic. If you expect a lot of traffic, you will probably need the paid versions.

12.3.1 Installation

To setup your first Shiny app, you will need the shiny package. You will probably want RStudio, which facilitates the process.

install.packages('shiny')

Once installed, you can run an example app to get the feel of it.

library(shiny)
runExample("01_hello")

Remember to press the Stop button in RStudio to stop the web-server, and get back to RStudio.

12.3.2 The Basics of Shiny

Every Shiny app has two main building blocks.

A user interface, specified via the ui.R file in the app’s directory.
A server side, specified via the server.R file, in the app’s directory.

You can run the app via the RunApp button in the RStudio interface, of by calling the app’s directory with the shinyApp or runApp functions– the former designed for single-app projects, and the latter, for multiple app projects.

shiny::runApp("my_app") # my_app is the app's directory.

The site’s layout, is specified in the ui.R file using one of the layout functions. For instance, the function sidebarLayout, as the name suggest, will create a sidebar. More layouts are detailed in the layout guide.

The active elements in the UI, that control your report, are known as widgets. Each widget will have a unique inputId so that it’s values can be sent from the UI to the server. More about widgets, in the widget gallery.

The inputId on the UI are mapped to input arguments on the server side. The value of the mytext inputId can be queried by the server using input$mytext. These are called reactive values. The way the server “listens” to the UI, is governed by a set of functions that must wrap the input object. These are the observe, reactive, and reactive* class of functions.

With observe the server will get triggered when any of the reactive values change. With observeEvent the server will only be triggered by specified reactive values. Using observe is easier, and observeEvent is more prudent programming.

A reactive function is a function that gets triggered when a reactive element changes. It is defined on the server side, and reside within an observe function.

We now analyze the 1_Hello app using these ideas. Here is the ui.R file.

library(shiny)
shinyUI(fluidPage(
  titlePanel("Hello Shiny!"),
  sidebarLayout(
    sidebarPanel(
      sliderInput(inputId = "bins",
                  label = "Number of bins:", 
                  min = 1,
                  max = 50,
                  value = 30)
    ),
    mainPanel(
      plotOutput(outputId = "distPlot")
    )
  )
))

Here is the server.R file:

library(shiny)
shinyServer(function(input, output) {
  output$distPlot <- renderPlot({
    x    <- faithful[, 2]  # Old Faithful Geyser data
    bins <- seq(min(x), max(x), length.out = input$bins + 1)
    hist(x, breaks = bins, col = 'darkgray', border = 'white')
  })
})

Things to note:

ShinyUI is a (deprecated) wrapper for the UI.
fluidPage ensures that the proportions of the elements adapt to the window side, thus, are fluid.
The building blocks of the layout are a title, and the body. The title is governed by titlePanel, and the body is governed by sidebarLayout. The sidebarLayout includes the sidebarPanel to control the sidebar, and the mainPanel for the main panel.
sliderInput calls a widget with a slider. Its inputId is bins, which is later used by the server within the renderPlot reactive function.
plotOutput specifies that the content of the mainPanel is a plot (textOutput for text). This expectation is satisfied on the server side with the renderPlot function (renderText).
shinyServer is a (deprecated) wrapper function for the server.
The server runs a function with an input and an output. The elements of input are the inputIds from the UI. The elements of the output will be called by the UI using their outputId.

This is the output.

Here is another example, taken from the RStudio Shiny examples.

ui.R:

library(shiny)
fluidPage(
    
  titlePanel("Tabsets"),
  
  sidebarLayout(
    sidebarPanel(
      radioButtons(inputId = "dist", 
                   label = "Distribution type:",
                   c("Normal" = "norm",
                     "Uniform" = "unif",
                     "Log-normal" = "lnorm",
                     "Exponential" = "exp")),
      br(), # add a break in the HTML page.
      
      sliderInput(inputId = "n", 
                  label = "Number of observations:", 
                   value = 500,
                   min = 1, 
                   max = 1000)
    ),
    
    mainPanel(
      tabsetPanel(type = "tabs", 
        tabPanel(title = "Plot", plotOutput(outputId = "plot")), 
        tabPanel(title = "Summary", verbatimTextOutput(outputId = "summary")), 
        tabPanel(title = "Table", tableOutput(outputId = "table"))
      )
    )
  )
)

server.R:

library(shiny)
# Define server logic for random distribution application
function(input, output) {
  
  data <- reactive({
    dist <- switch(input$dist,
                   norm = rnorm,
                   unif = runif,
                   lnorm = rlnorm,
                   exp = rexp,
                   rnorm)
    
    dist(input$n)
  })
  
  output$plot <- renderPlot({
    dist <- input$dist
    n <- input$n
    
    hist(data(), main=paste('r', dist, '(', n, ')', sep=''))
  })
  
  output$summary <- renderPrint({
    summary(data())
  })
  
  output$table <- renderTable({
    data.frame(x=data())
  })
  
}

Things to note:

We reused the sidebarLayout.
As the name suggests, radioButtons is a widget that produces radio buttons, above the sliderInput widget. Note the different inputIds.
Different widgets are separated in sidebarPanel by commas.
br() produces extra vertical spacing (break).
tabsetPanel produces tabs in the main output panel. tabPanel governs the content of each panel. Notice the use of various output functions (plotOutput,verbatimTextOutput, tableOutput) with corresponding outputIds.
In server.R we see the usual function(input,output).
The reactive function tells the server the trigger the function whenever input changes.
The output object is constructed outside the reactive function. See how the elements of output correspond to the outputIds in the UI.

This is the output:

12.3.3 Beyond the Basics

Now that we have seen the basics, we may consider extensions to the basic report.

12.3.3.1 Widgets

actionButton Action Button.
checkboxGroupInput A group of check boxes.
checkboxInput A single check box.
dateInput A calendar to aid date selection.
dateRangeInput A pair of calendars for selecting a date range.
fileInput A file upload control wizard.
helpText Help text that can be added to an input form.
numericInput A field to enter numbers.
radioButtons A set of radio buttons.
selectInput A box with choices to select from.
sliderInput A slider bar.
submitButton A submit button.
textInput A field to enter text.

See examples here.

12.3.3.2 Output Elements

The ui.R output types.

htmlOutput raw HTML.
imageOutput image.
plotOutput plot.
tableOutput table.
textOutput text.
uiOutput raw HTML.
verbatimTextOutput text.

The corresponding server.R renderers.

renderImage images (saved as a link to a source file).
renderPlot plots.
renderPrint any printed output.
renderTable data frame, matrix, other table like structures.
renderText character strings.
renderUI a Shiny tag object or HTML.

Your Shiny app can use any R object. The things to remember:

The working directory of the app is the location of server.R.
The code before shinyServer is run only once.
The code inside `shinyServer is run whenever a reactive is triggered, and may thus slow things. To keep learning, see the RStudio’s tutorial, and the Biblipgraphic notes herein. ### shinydashboard A template for Shiny to give it s modern look. ## flexdashboard If you want to quickly write an interactive dashboard, which is simple enough to be a static HTML file and does not need an HTML server, then Shiny may be an overkill. With flexdashboard you can write your dashboard a single .Rmd file, which will generate an interactive dashboard as a static HTML file. See [http://rmarkdown.rstudio.com/flexdashboard/] for more info. ## Bibliographic Notes For RMarkdown see here. For everything on knitr see Yihui’s blog, or the book Xie (2015). For a bookdown manual, see Xie (2016). For a Shiny manual, see Chang et al. (2017), the RStudio tutorial, or Hadley’s Book. For compunding Plotly’s interactive graphics, with Shiny sites, see here. Video tutorials are available here. ## Practice Yourself

Generate a report using knitr with your name as title, and a scatter plot of two random variables in the body. Save it as PDF, DOCX, and HTML.
Recall that this book is written in bookdown, which is a superset of knitr. Go to the source .Rmd file of the first chapter, and parse it in your head: (https://raw.githubusercontent.com/johnros/Rcourse/master/02-r-basics.Rmd)

Allard, Denis. 2013. “J.-P. Chilès, P. Delfiner: Geostatistics: Modeling Spatial Uncertainty.” Springer.

Arlot, Sylvain, Alain Celisse, and others. 2010. “A Survey of Cross-Validation Procedures for Model Selection.” Statistics Surveys 4: 40–79.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Chang, Winston, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. 2017. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.

Christakos, George. 2000. Modern Spatiotemporal Geostatistics. Vol. 6. Oxford University Press.

Conway, Drew, and John White. 2012. Machine Learning for Hackers. " O’Reilly Media, Inc.".

Cressie, Noel, and Christopher K Wikle. 2015. Statistics for Spatio-Temporal Data. John Wiley & Sons.

Diggle, Peter J, JA Tawn, and RA Moyeed. 1998. “Model-Based Geostatistics.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 47 (3): 299–350.

Foster, Dean P, and Robert A Stine. 2004. “Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy.” Journal of the American Statistical Association 99 (466): 303–13.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer series in statistics Springer, Berlin.

Graham, RL. 1988. “Isometric Embeddings of Graphs.” Selected Topics in Graph Theory 3: 133–50.

Greene, William H. 2003. Econometric Analysis. Pearson Education India.

Hotelling, Harold. 1933. “Analysis of a Complex of Statistical Variables into Principal Components.” Journal of Educational Psychology 24 (6): 417.

Izenman, Alan Julian. 2008. “Modern Multivariate Statistical Techniques.” Regression, Classification and Manifold Learning.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 6. Springer.

Kuhn, Max, and others. 2008. “Building Predictive Models in R Using the Caret Package.” Journal of Statistical Software 28 (5): 1–26.

Lantz, Brett. 2013. Machine Learning with R. Packt Publishing Ltd.

Leisch, Friedrich. 2002. “Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis.” In Compstat, 575–80. Springer.

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets. Cambridge University Press.

McCullagh, Peter. 1984. “Generalized Linear Models.” European Journal of Operational Research 16 (3): 285–92.

Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2012. Foundations of Machine Learning. MIT press.

Pearson, Karl. 1901. “LIII. On Lines and Planes of Closest Fit to Systems of Points in Space.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11): 559–72.

Pinero, Jose, and Douglas Bates. 2000. “Mixed-Effects Models in S and S-Plus (Statistics and Computing).” Springer, New York.

Rabinowicz, Assaf, and Saharon Rosset. 2018. “Assessing Prediction Error at Interpolation and Extrapolation Points.” arXiv Preprint arXiv:1802.00996.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Ripley, Brian D. 2007. Pattern Recognition and Neural Networks. Cambridge university press.

Robinson, George K. 1991. “That Blup Is a Good Thing: The Estimation of Random Effects.” Statistical Science, 15–32.

Rosset, Saharon, and Ryan J Tibshirani. 2018. “From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation.” Journal of the American Statistical Association, nos. just-accepted.

Sammut, Claude, and Geoffrey I Webb. 2011. Encyclopedia of Machine Learning. Springer Science & Business Media.

Sarkar, Deepayan. 2008. Lattice: Multivariate Data Visualization with R. New York: Springer. http://lmdvr.r-forge.r-project.org.

Searle, Shayle R, George Casella, and Charles E McCulloch. 2009. Variance Components. Vol. 391. John Wiley & Sons.

Shalev-Shwartz, Shai, and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press.

Shawe-Taylor, John, and Nello Cristianini. 2004. Kernel Methods for Pattern Analysis. Cambridge university press.

Small, Christopher G. 1990. “A Survey of Multidimensional Medians.” International Statistical Review/Revue Internationale de Statistique, 263–77.

Tukey, John W. 1977. Exploratory Data Analysis. Reading, Mass.

Vapnik, Vladimir. 2013. The Nature of Statistical Learning Theory. Springer science & business media.

Venables, William N, and Brian D Ripley. 2013. Modern Applied Statistics with S-Plus. Springer Science & Business Media.

Venables, William N, David M Smith, R Development Core Team, and others. 2004. “An Introduction to R.” Network Theory Limited.

Weiss, Robert E. 2005. Modeling Longitudinal Data. Springer Science & Business Media.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

———. 2014. Advanced R. CRC Press.

Wilcox, Rand R. 2011. Introduction to Robust Estimation and Hypothesis Testing. Academic Press.

Wilkinson, GN, and CE Rogers. 1973. “Symbolic Description of Factorial Models for Analysis of Variance.” Applied Statistics, 392–99.

Wilkinson, Leland. 2006. The Grammar of Graphics. Springer Science & Business Media.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. Vol. 29. CRC Press.

———. 2016. Bookdown: Authoring Books and Technical Documents with R Markdown. CRC Press.

S and S-Plus used to save objects on disk. Working from RAM has advantages and disadvantages. More on this in Chapter ??.↩︎
Taken from http://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html ↩︎
R uses a three valued logic where a missing value (NA) is neither TRUE, nor FALSE.↩︎
This is a classical functional programming paradigm. If you want an object oriented flavor of R programming, see Hadley’s Advanced R book.↩︎
More formally, this is called Lexical Scoping.↩︎
The “response” is also know as the “dependent” variable in the statistical literature, or the “labels” in the machine learning literature.↩︎
The “factors” are also known as the “independent variable”, or “the design”, in the statistical literature, and the “features”, or “attributes” in the machine learning literature.↩︎
The “error term” is also known as the “noise”, or the “common causes of variability”.↩︎
You may philosophize if the measurement error is a mere instance of unmodeled factors or not, but this has no real implication for our purposes.↩︎
By “computed” we mean what statisticians call “fitted”, or “estimated”, and computer scientists call “learned”.↩︎
Sometimes known as the Root Mean Squared Error (RMSE).↩︎
The example is taken from http://rtutorialseries.blogspot.co.il/2011/02/r-tutorial-series-two-way-anova-with.html ↩︎
Do not confuse generalized linear models with non-linear regression, or generalized least squares. These are different things, that we do not discuss.↩︎
Taken from http://www.theanalysisfactor.com/generalized-linear-models-in-r-part-6-poisson-regression-count-variables/↩︎
Think: why bother treating the Batch effect as noise? Should we now just subtract Batch effects? This is not a trick question.↩︎
It is even a subset of the Hilbert space, itself a subset of the space of all functions.↩︎
Example taken from https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/ch6.html ↩︎
http://en.wikipedia.org/wiki/Principal_component_analysis ↩︎
You are probably used to thinking of the dimension of linear spaces. We will not rigorously define what is the dimension of a manifold, but you may think of it as the number of free coordinates needed to navigate along the manifold.↩︎
https://en.wikipedia.org/wiki/G_factor_(psychometrics)↩︎
Then again, it is possible that the true distances are the white matter fibers connecting going within the cortex, in which case, Euclidean distances are more appropriate than geodesic distances. We put that aside for now.↩︎
Recall, S was the original software from which R evolved.↩︎

References

Chang, Winston, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. 2017. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.

Leisch, Friedrich. 2002. “Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis.” In Compstat, 575–80. Springer.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. Vol. 29. CRC Press.

Xie, Yihui. 2016. Bookdown: Authoring Books and Technical Documents with R Markdown. CRC Press.