1.3 R basics
Note, if you’re are already a confident programmer you should be able to skim this section at speed!
As a prerequisite you should have completed the previous chapter on Getting Ready, in particular to have installed R and RStudio and run some simple R test examples.
Let’s start by covering some very basic programming ideas: the idea of defining variables to store information along with the variable type, manipulating and displaying these variables. We will move onto interacting with data that is stored more permanently as files in the chapter on Rich Data. We will examine the different ways that we store code in R and sufficient basic concepts to get started.
1.3.1 Variables, assignment and type
Variables are a key concept for any programming language.
A variable in R is a named container for data and this data can be set to some value, referenced or modified. The variable has an associated type, e.g., a number, a string, a single character, a logical (True or False), or a floating-point number. A variable can also store multiple data elements e.g., a vector, list or data frame (see Chapter 2 on Rich Data).
R tries to guess the appropriate type depending on how we use the variable.
The easiest way to create a variable is to by using the assignment operator, which is <-
17 Below are three examples.
# In this R code we create three different variables
numericVariable <- 10
stringVariable <- "Hello world!"
logicVariable <- TRUE
So now we have three new variables. In each case, R chooses the data type from the assignment. R deduces we want NumericVariable
to be numeric, or as Computer Scientists would say, to have a numeric data type, since we are assigning the value 10. Knowing what type a variable is important because it determines what sensible operations we can do to the variable. For example, adding 2 to a numeric variable is meaningful, but to a string is not. This is not to say R won’t sometimes try, so be careful with data types!
A data type is an attribute of a variable which tells the R interpreter how we intend to use the data. It defines the operations that can be done on the data, the meaning of the data, and the limits to values that can be stored, e.g., if the type is logical then the only values that can be stored are TRUE and FALSE. In R the data type is usually defined implicitly from how we first use a variable. However, as a technical aside, you can coerce a different type to the variable. If you look at the Environment Pane in RStudio it will tell you what type a variable has. Here is a simple example of coercion.
# Create a variable representing age. The type is inferred as
# a character string because we assign a string literal (because it's in quotes).
age <- "25"
# Check the class of the age variable
class(age)
## [1] "character"
# Coerce the age variable to a numeric type. (This time 25 has no quotes.)
age <- 25
# Check the class of the age variable after coercion
class(age)
## [1] "numeric"
Also note that although the above R code fragment assigned values to the three variables, it hasn’t displayed anything. You can see the current value of any variable in the Environment Pane (and this can be useful to help you debug your code if it isn’t doing what you expect). However, if we want our R code to display a value then one way to do is just to type the name of the variable, alternatively we can use the print()
function.
## [1] 10
## [1] TRUE
## [1] "Hello world!"
NB there are many alternatives to print()
but it’s a simple, generic means of producing output.
If you aren’t sure about a variable type there’s a useful R function called class()
that allows you to inspect a variable, e.g.,
## [1] "logical"
For the variable names, you can choose the (meaningful and unique) variable names. A valid name consists of letters, numbers and the dot or underscore character. The variable name starts with a letter or a dot as long as the dot isn’t followed by a number. Avoid names of built-in functions such as mean and summary as this will lead to confusion and potentially strange side effects.
X1
,y.1
and.y1
are valid names
1X
,.1y
andX-Y
are invalid names
Variable naming: Remember that R is case sensitive so that var1
and VAR1
are not the same.
Don’t use a hyphen in a variable name e.g., thisIs-BAD
because R will consider this to be a minus operator. However, RStudio has an auto-complete function that prompts you to choose an existing variable to minimise the chances of this kind of error.
1.3.1.1 Literals
Sometimes we just want to use the value but don’t need a variable. An example is the code statement we have previously encountered stringVariable <- "Hello world!"
where we use the string literal “Hello world!”. This is because there’s no need to refer to this string by name. Another example is numericVariable <- 10, where 10 is an integer literal.
When using literals be aware that if you use quotation marks then this means the data type is a character string so that "10"
is not equal to 10
.
A simple variable assignment example
Suppose I’m trying to calculate how much money I’m going to make from this book. There are several different things I will need to store: how many copies I’ll sell as sales
, the price per copy price
18 and then the income generated income
. Let’s assume I sell one copy per student in my class given by numberStudents
and finally let’s assume some proportion of you are so impressed that you recommend the book to your friends and family so that recommend
comprises the proportion. Let’s turn this into R 19.
# Compute sales income for Martin's book
price <- 10 # assume price is 10 UKP per book copy
numberStudents <- 200 # number of students in the class
recommend <- 0.5 # proportion of students who recommend the book
sales <- numberStudents * (1 + recommend)
income <- sales * price
outputMsg <- paste("Martin will earn: £", income, sep = "")
# format a readable string with the paste function
# sep = "" says no spaces between the string and income
print(outputMsg, quote = FALSE) # output the concatenated string without quotes
## [1] Martin will earn: £3000
R provides the standard arithmetic operators, including *
for multiply. These are used to compute sales
and income
. Although we could just print()
income, I have added a string to make the output of the code easier to interpret. Since the print()
outputs a single string I use paste()
to combine a character string literal denoted by quotation marks and the income.
Other arithmetic operators include \
for division, %%
for modulus division, %/%
for integer division and ^
or **
for exponentiation.
It’s very important to get into the habit of writing easy to follow R code. For this reason I have added comments and tried to select meaningful variable names.
1.3.2 Functions, packages and libraries
Functions are a very useful way we can wrap a lot of functionality into a single name and not need to worry how it works. Very often these are provided for us.
# An example of the use of a function - the median
# Create a vector v of 10 integers
v <- c(3,2,5,-1,11,4,3,5,2,4)
# Call the built-in median function
median(v)
## [1] 3.5
Although it wouldn’t be particularly difficult to implement a median function ourselves, it’s much easier that this has already been done and we can have high confidence that it works correctly.
So we have discussed functions in an informal sense. It’s now worth thinking a little more precisely. There are four different parts to a function:
- Function Name: this is how we call the function e.g.,
head()
. The function name is always followed by parentheses optionally containing arguments.
- Argument(s): an argument is a placeholder. There may be zero or more arguments. When a function is called, we pass values to match the arguments. These go in the parentheses e.g.,
head(x)
wherex
is the value given to the function when we call it, so it knows which variable we want to apply the function to. Sometimes arguments can have default values so no explicit value is needed, e.g.,head(x)
has a default to return the first 6 items ofx
so if we want a different number we would need to give a second argumenthead(x, 3)
if we only want the first 3 items.
- Function Body: the function body contains a collection of statements that defines what the function does. For in-built functions these R statements are hidden from us since usually we don’t need to know how a function is actually implemented.
- Return Value: the return value of a function is what results or it “gives back” after being called, e.g., if we call the function min using
x <- min(z)
the return value (which will be the smallest value in z) is placed inx
.
As we have discussed, R has many in-built functions which can be directly called in the program without defining them first. However, we can also create and use our own functions referred as user-defined functions. We will cover these in Week 3.
A function is a self-contained set of R statements grouped together to perform a specific task that should be described by the function name. When a function is called (by its name along with zero or more arguments) it performs its task and returns control to the caller and passes back any result. For instance, we might call a function named double that takes a single argument which is the number to be doubled and returns a result which is hopefully twice the argument, e.g., x <- double(5)
will result in x
containing 10.
If you are unsure how an unfamiliar function works or what arguments it takes, you can, for example the function paste(), type ?paste
into the Console Pane and R will show the relevant documentation. Alternatively, you could type the function name directly into the search box above the Help Pane in RStudio.
Plain vanilla R, usually referred to as Base R, comes with a wide a range of functions and facilities, however, there are many occasions when we wish to extend beyond this. Whilst Base R contains many useful functions, for more specialist analysis or richer functionality, we frequently make use of packages. An example you will shortly encounter is the ggplot2
package which provides a very powerful graphics capability and is a widely used workhorse for data visualisation (Kabacoff 2019).
Additional functions are included in external packages. You can install.package()
to add extra functions. To use the functions of a particular package i.e., add them to your library you use the library()
function. The package only need be installed once, however, you need to add the package to your local library every time you initialise your local runtime environment (usually each time you launch RStudio). Then you can call them by name.
Usually, we download packages from CRAN. This happens automatically if you use the install.packages()
route. This is safest as any package added to CRAN goes through a number of quality and security checks. Sometimes, for various reasons, we may download a package from GitHub20 in which case use a little more caution.
It’s easy to forget to install a package. If you see something like Error in library(ggplot2) : there is no package called ‘ggplot2’
this suggests that you need to run install.packages("ggplot2")
before running library(ggplot2)
. NB install.packages()
requires the package name in quotation marks whilst library()
does not. You also need to be accurate about upper and lower case.
For a very thorough coverage see Thomas Neitmann’s blog at https://thomasadventure.blog/posts/install-r-packages/.
There are presently in excess of 20,000 packages on CRAN, so it is well worth investigating whether there already exists a suitable set of functions for whatever it is that your application requires. This can save a lot of unnecessary coding!
References
There are actually three ways to assign a value to a variable in R. You can also use
=
or even a rightwards operator->
but<-
is simplest, clearest and most widely used, so unless you are feeling wilfully perverse let’s just use<-
!↩︎I know the book is free but indulge me!↩︎
The Week 1 Seminar Worksheet contains an extended version of this R code.↩︎
To do this you need to install a package called
remotes
and thenremotes::install_github("author/package")
.↩︎