3.3 Literate programming and RMarkdown

The term “literate programming” was coined by Donald Knuth Knuth (1984) based on the idea that a computer program should be documented in a manner such that it is readable by humans. This idea has subsequently gained a good deal of traction not least because it is powerful and deceptively simple.

Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want the computer to do.
(Knuth 1984)

His ideas are encapsulated in three principles (although of course there is a lot more detail in his paper):

  1. move away from writing programs to ‘please’ the computer
  2. instead, focus on communication and understanding
  3. create a single document to integrate data analysis (executable code) with textual documentation, linking data, code, and explanation

See (Wickham 2012) to find some arguments as to why you should care about writing readable R code. A great vehicle for accomplishing this is RMarkdown and RMarkdown notebooks32 where the code is embedded into a single document.

3.3.1 Structuring your code

Readable code tips include (Wickham 2012)

  • names
  • comments
  • layout e.g., indentation and spacing
  • prettyprinting
  • user-defined functions

3.3.2 User-defined functions

One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.
(Grolemund and Wickham 2018)

User defined functions offer many advantages including - abstraction, i.e., the ability to hide details to help understandability e.g., checkInput()
- placing related code in one place
- reuse and avoiding lots of copy-and-paste type coding - maintainability i.e., the need to only make changes in one place in the event of an error or update in required functionality

# An example of user-defined function named myFunction

myFunction <- function(num) {
  num <- num * 3
}

In R, the return value of a function is always the very last expression that is evaluated. Because the num variable is the last expression that is evaluated in this function, that becomes the return value of the function.

Note that there is a return() function that can be used to return an explicitly value from a function. If in doubt we recommend using return() since it makes the intentions of the function developer absolutely clear. While we are at it, triple is a far more descriptive name than myFunction.

# Rewriting myFunction as triple with an explicit return value
# This makes for clearer code and so is preferable

triple <- function(num) {
  num <- num * 3
  return(num)
}

When calling or using the above function, the user must specify the value of the argument num.

triple(10)
## [1] 30

If an argument is not specified by the user, R will throw an error. But on occasions this can be burdensome, especially if there are many arguments with common or sensible default values. Then we can specify the default value (which is taken if no argument is explicitly provided) in the function definition. So, for example, for triple() we could say zero is the default if no value for num is given by the function caller.

triple <- function(num = 0) {
# This function triples the input argument which defaults to zero
  num <- num * 3
  return(num)
}

# Call triple without an argument
triple()
## [1] 0

References

Grolemund, Garrett, and Hadley Wickham. 2018. “R for Data Science.”
Knuth, Donald Ervin. 1984. “Literate Programming.” The Computer Journal 27 (2): 97–111.
Wickham, Hadley. 2012. “Style Guide.” http://adv-r.had.co.nz/Style.html.

  1. Recently a closely related technology, Quarto has been proposed. This has the advantage of more generality in terms of support for many programming languages and not just R.↩︎