Chapter 4 R Coding Techniques
(Computer Labs 3B+)
All the data visualisation techniques we will learn in this subject will be carried out within the R software environment. R is ever-increasing in popularity, and is used for statistical computing and data visualisation by millions of people.
We will be using RStudio rather than base R for all our R coding. RStudio is an integrated development environment (IDE) for R, and offers several helpful features and user-interface options missing from base R.
Much of the R coding you will learn in this subject will be covered in the core content. However, there are some techniques which may be specific to the Data Science stream content. Remember, we are just beginning to work with R, and there is always more to learn.
4.1 Piping
In the first couple of data science computer labs, you may have noticed that occasionally we use the set of symbols %>%
in our code. These are not just some random symbols.
%>%
is actually known as the pipe operator, and it can be thought of as being (somewhat) similar to the assignment operator <-
. The act of using the pipe operator in your code is called piping.
The pipe operator is part of the magrittr
package (Bache et al. 2020), which might sound unfamiliar. Don’t worry, while we haven’t actively installed this package, it should have been loaded as a dependency when we first installed the plotly
(Sievert 2020) package.
In brief, the pipe operator can be used to chain together a sequence of operations in R, in an intuitive manner which is typically easier to read than alternative methods (see e.g. Wickham and Grolemund (2017)). Piping can be used to add additional details to existing objects, without the need to define new objects.
Let’s take a look at a simple example.
4.1.1
Suppose that we would like to carry out a sequence of operations, and we don’t know about the existence of the pipe operator.
Specifically, imagine that we have a set of 100 simulated observations in the object sim_data
, and (for whatever reason) would like to:
- Sample 30 of these randomly
- Take the square root of these sampled values
- Compute the log of these resultant values
- Round these resultant values to 2 decimal places of accuracy
- Plot a histogram of the results
Our code could look like this:
<- rnorm(100, mean = 5, sd = 2)
sim_data hist(round(log(sqrt(sample(sim_data, 30))), 2))
As you can see, if we try to carry out too many operations on our base object, with all this nested code it can get quite difficult to see what’s going on at a glance - it’s also easy to misread a section, or forget a bracket, which could have serious consequences.
If you are used to coding in R, the above example might not seem too bad, but if you haven’t had much coding experience, it can look a bit confusing.
If we were aware of the pipe operator, we could alternatively conduct the same sequence of operations in the following manner:
library(magrittr)
<- rnorm(100, mean = 5, sd = 2) # simulate data
sim_data
%>% sample(30) %>% # take a sample of 30
sim_data sqrt() %>% # and then take the square root
log() %>% # and then take the log
round(3) %>% # and then round values to 3 decimal places
hist() # and finally, plot results as a histogram
It’s a lot clearer what’s happening here. We have also included some comments in the code here too, which summarise what we would typically be thinking as we read through this code.
Note: If you consider reading outing the code like an English sentence, each time you see the pipe operator %>%
, it might help to read this as, “and then we…”.
When using piping, if your code covers multiple lines, remember to include the pipe at the end of each operation, if you intend to conduct subsequent operations on the subsequent line.
For example, the following code will work as intended:
%>% sample(30) %>%
sim_data sqrt()
However, the code shown below will not fully execute:
%>% sample(30)
sim_data %>% sqrt()
Here, R will stop once the sample(30)
is read, since there’s no more code on that line.
4.1.2
While piping is very useful, it is not a one-size-fits-all tool, and there are instances in which piping might not be the best tool to use.
When we use piping, we are not modifying the original object, but rather are carrying out operations on/with it. Therefore any changes we implement are not saved to the original object.
If you are carrying out a large number of sequential operations to an object, or if you intend to make significant changes to an object, then it is often a good idea to assign intermediate results to new objects, rather than using one long sequence of piping.
There is no ironclad rule as to the limit of pipes to use in a single chunk of code, but generally speaking, if you are using more than 10 pipes at once it might be worth splitting your code into more manageable portions.