Chapter 4 R Coding Techniques
All the data visualisation techniques we will learn in this subject will be carried out within the R software environment. R is ever-increasing in popularity, and is widely used for statistical computing and data visualisation.
Most of the R coding skills you will learn in this subject will be covered in the core module. However, there are some techniques which we will use in the Data Science module that either won’t be covered just yet, or perhaps won’t be covered at all in the core module. Remember, we are just beginning to work with R, and there is always more to learn.
4.1 Piping
(Computer Labs 3B+)
In the first couple of data science computer labs, you may have noticed that occasionally we use the set of symbols %>%
in our code. These are not just some random symbols.
%>%
is actually known as the pipe operator, and it can be thought of as being (somewhat) similar to the assignment operator <-
. The act of using the pipe operator in your code is called piping.
The pipe operator is part of the magrittr
package (Bache et al. 2020), which might sound unfamiliar. Don’t worry, while we haven’t actively installed this package, it should have been loaded as a dependency when we first installed the plotly
(Sievert 2020) package.
In brief, the pipe operator can be used to chain together a sequence of operations in R, in an intuitive manner which is typically easier to read than alternative methods (see e.g. H. Wickham and Grolemund (2017)). Piping can be used to add additional details to existing objects, without the need to define new objects.
Let’s take a look at a simple example.
4.1.1
Suppose that we would like to carry out a sequence of operations, and we don’t know about the existence of the pipe operator.
Specifically, imagine that we have a set of 100 simulated observations in the object sim_data
, and (for whatever reason) would like to
- sample 30 of these randomly
- take the square root of these sampled values
- compute the log of these resultant values
- round these resultant values to 2 decimal places of accuracy
- and finally plot a histogram of the results
Our code could look like this:
<- rnorm(100, mean = 5, sd = 2)
sim_data hist(round(log(sqrt(sample(sim_data, 30))), 2))
As you can see, if we try to carry out too many operations on our base object, with all this nested code it can get quite difficult to see what’s going on at a glance - it’s also easy to misread a section, or forget a bracket, which could have serious consequences. If you are used to coding in R, the above example might not seem too bad, but if you haven’t had much coding experience, it can look a bit confusing.
Alternatively, if we were aware of the pipe operator, we could conduct this sequence of operations in the following manner.
library(magrittr)
<- rnorm(100, mean = 5, sd = 2) # simulate data
sim_data
%>% sample(30) %>% # take a sample of 30
sim_data sqrt() %>% # and then take the square root
log() %>% # and then take the log
round(3) %>% # and then round values to 3 decimal places
hist() # and finally, plot results as a histogram
It’s a lot clearer what’s happening here - note that we’ve included some comments in the code here too, which summarise what we would typically be thinking as we read through this code.
If you consider reading outing the code like an English sentence, each time you see the pipe operator %>%
, it might help to read this as, “and then we…”.
N.B. Just remember, when you are using piping, you will need to include the pipe at the end of each line of code, if you intend to conduct further operations on the subsequent line.
E.g.
%>% sample(30) %>%
sim_data sqrt()
will work, but
%>% sample(30)
sim_data %>% sqrt()
will not fully execute, as R will stop once the sample(30)
is read, since there’s no more code on that line.
4.1.2
While piping is very useful, it is not a one-size-fits-all tool, and there are instances in which piping might not be the best tool to use.
When we use piping, we are not modifying the original object, but rather are carrying out operations on/with it. Therefore any changes we implement are not saved to the original object.
If you are carrying out a large number of sequential operations to an object, or if you intend to make significant changes to an object, then it is often a good idea to assign intermediate results to new objects, rather than using one long sequence of piping.
There is no ironclad rule as to the limit of pipes to use in a single chunk of code, but generally speaking, if you are using more than 10 pipes at once it might be worth splitting your code into more manageable portions.
4.2 Writing R functions
(Computer Lab 4B)
By now, you will have gained some experience using a variety of functions in R, ranging from inbuilt ‘base R’ functions, to package-specific functions, such as those contained in the plotly
package.
In Computer Lab 4B, it will be helpful to have some understanding of the composition of R functions. Later on in the semester, we will discuss R functions in more detail.
4.2.1 R Function Composition
Let’s take a look at the basic composition of an R function, using the example function below:
<- function(argument1,
my_simple_function
argument2){
<- argument1 + argument2
output
}
The code above consists of four main parts:
- The function has a name,
my_simple_function
. - We use the
function
function to (unsurprisingly) create a function. - Inside the open brackets
()
followingfunction
, we specifyargument1
andargument2
to be the arguments, i.e. inputs, of our function. - To begin the ‘main body’ of our function, we use a left curly brace,
{
, directly after the right open bracket)
. In the ‘main body’ of the function, we specify the calculations the function should perform. Formy_simple_function
, these calculations are simply adding theargument1
andargument2
values provided. This sum is stored in the objectoutput
, and when the function is computed, theoutput
value is provided as output. To wrap up our function, we conclude with a right curly brace,}
.
You might be wondering what argument1
and argument2
are, and from where they came. In fact, these are arbitrary names, and don’t correspond to any specific numbers. We can provide whatever numbers or string of numbers we would like to our function, and within the my_simple_function
environment these values will be used in place of argument1
and argument2
. For example, suppose we would like to add 39 and 3. We can do this using our function as follows:
<- my_simple_function(39, 3)
res res
## [1] 42
Copy the code in the two code chunks above, paste it into an R script, and run the code. You should obtain the result 42
shown above.
Of course, we didn’t really need to use a function for this - we could have simply computed 39 + 3
in R. In general, functions will be a little more complicated than my_simple_function
.
4.2.2 Important Notes
Strictly speaking, we did not need to include the output <-
code within my_simple_function
. R will output the last evaluated expression in the function, so we could simply have used argument1 + argument2
in the main body of the function. If however we wanted to output more than one result, then specifying our desired output can be informative and help avoid errors.
This additional output <-
code was primarily included here to demonstrate a key feature of R functions - any code contained within a function is part of that function’s local environment
, meaning that it is stored only inside that function, and can’t be called from the global environment
outside the function.
To clarify the distinction, take a look at the code below:
<- 39 + 3
test test
## [1] 42
Here, we assign the sum 39 + 3
to the object test
. This is not done within a function, so the object test
is stored in the global environment
. This means we can then call the test
object at a later date, and see our result of 42
.
Similarly, we could compute this result using my_simple_function
, as shown above in 4.2.1.
However, try to now call the object output
(which is defined within this function). You will receive an error - something along the lines of Error in eval(expr, envir, enclos) : object 'object' not found
.
This is something worth keeping in mind. If you intend to use an object outside of a function, it might be worth defining it separately. It is also a good idea to use different names for all your objects (inside and outside functions) to avoid confusion.
4.2.3 Practice
Using the code provided in 4.2.1 above, try to create your own function, with the following specifications:
- Your function should take three numeric inputs.
- It should multiply the first input by the second input, and then subtract the third input from this result.
- Finally, it should square the resultant value.
See how you go!