Chapter 4 R Coding Techniques

All the data visualisation techniques we will learn in this subject will be carried out within the R software environment. R is ever-increasing in popularity, and is used for statistical computing and data visualisation by millions of people.

We will be using RStudio rather than base R for all our R coding. RStudio is an integrated development environment (IDE) for R, and offers several helpful features and user-interface options missing from base R.

Much of the R coding you will learn in this subject will be covered in the core module. However, there are some techniques which we will use in the Data Science module that may not be covered at all in the core module. Remember, we are just beginning to work with R, and there is always more to learn.

4.1 Piping
(Computer Labs 3B+)

In the first couple of data science computer labs, you may have noticed that occasionally we use the set of symbols %>% in our code. These are not just some random symbols. %>% is actually known as the pipe operator, and it can be thought of as being (somewhat) similar to the assignment operator <-. The act of using the pipe operator in your code is called piping.

The pipe operator is part of the magrittr package (Bache et al. 2020), which might sound unfamiliar. Don’t worry, while we haven’t actively installed this package, it should have been loaded as a dependency when we first installed the plotly (Sievert 2020) package.

In brief, the pipe operator can be used to chain together a sequence of operations in R, in an intuitive manner which is typically easier to read than alternative methods (see e.g. Wickham and Grolemund (2017)). Piping can be used to add additional details to existing objects, without the need to define new objects.

Let’s take a look at a simple example.

4.1.1

Suppose that we would like to carry out a sequence of operations, and we don’t know about the existence of the pipe operator.

Specifically, imagine that we have a set of 100 simulated observations in the object sim_data, and (for whatever reason) would like to

  1. sample 30 of these randomly
  2. take the square root of these sampled values
  3. compute the log of these resultant values
  4. round these resultant values to 2 decimal places of accuracy
  5. and finally plot a histogram of the results

Our code could look like this:

sim_data <- rnorm(100, mean = 5, sd = 2)
hist(round(log(sqrt(sample(sim_data, 30))), 2))

As you can see, if we try to carry out too many operations on our base object, with all this nested code it can get quite difficult to see what’s going on at a glance - it’s also easy to misread a section, or forget a bracket, which could have serious consequences. If you are used to coding in R, the above example might not seem too bad, but if you haven’t had much coding experience, it can look a bit confusing.

Alternatively, if we were aware of the pipe operator, we could conduct this sequence of operations in the following manner.

library(magrittr)
sim_data <- rnorm(100, mean = 5, sd = 2) # simulate data

sim_data %>% sample(30) %>% # take a sample of 30
             sqrt() %>% # and then take the square root
             log() %>% # and then take the log
             round(3) %>% # and then round values to 3 decimal places
             hist() # and finally, plot results as a histogram

It’s a lot clearer what’s happening here - note that we’ve included some comments in the code here too, which summarise what we would typically be thinking as we read through this code.

If you consider reading outing the code like an English sentence, each time you see the pipe operator %>%, it might help to read this as, “and then we…”.

N.B. Just remember, when you are using piping, you will need to include the pipe at the end of each line of code, if you intend to conduct further operations on the subsequent line.

E.g.

sim_data %>% sample(30) %>% 
             sqrt() 

will work, but

sim_data %>% sample(30) 
             %>% sqrt()

will not fully execute, as R will stop once the sample(30) is read, since there’s no more code on that line.

4.1.2

While piping is very useful, it is not a one-size-fits-all tool, and there are instances in which piping might not be the best tool to use.

When we use piping, we are not modifying the original object, but rather are carrying out operations on/with it. Therefore any changes we implement are not saved to the original object.

If you are carrying out a large number of sequential operations to an object, or if you intend to make significant changes to an object, then it is often a good idea to assign intermediate results to new objects, rather than using one long sequence of piping.

There is no ironclad rule as to the limit of pipes to use in a single chunk of code, but generally speaking, if you are using more than 10 pipes at once it might be worth splitting your code into more manageable portions.

References

Bache, Stefan Milton, Hadley Wickham, Lionel Henry, and RStudio. 2020. magrittr: A Forward-Pipe Operator for R. https://magrittr.tidyverse.org.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.
Wickham, H., and G. Grolemund. 2017. R for Data Science. 1st ed. USA: O’Reilly Media.