1.3 R basics

As a prerequisite you should have completed the previous chapter on Getting Ready, in particular to have installed R and RStudio and run some simple R test examples.

Let’s start by covering some very basic programming ideas: the idea of defining variables to store information along with the variable type, manipulating and displaying these variables. We will move onto interacting with data that is stored more permanently as files in the chapter on Rich Data. We will examine the different ways that we store code in R and sufficient basic concepts to get started.

1.3.1 Variables, assignment and type

Variables are a key concept for any programming language.

A variable in R is a named container for data and this data can be set, modified or referenced. The variable has an associated type, e.g., a number, a string, a single character, a Boolean (T or F), or a floating-point number. A variable can also store multiple data elements of the same type, such as a vector. We can be build more complex collections of variables and give them names as lists or data frames (see Chapter 2 on Rich Data).

R tries to guess the appropriate type depending on how we use the variable.

The easiest way to create a variable is to by using the assignment operator, which is <-16 Below are three examples.

# In this R code we create three different variables
numericVariable <- 10
stringVariable <- "Hello world!"
logicVariable <- TRUE

So now we have three new variables. In each case, R chooses the data type from the assignment. R deduces we want NumericVariable to be numeric, or as Computer Scientists would say, to have a numeric data type, since we are assigning the value 10. Knowing what type a variable is important because it determines what sensible operations we can do to the variable. For example, adding 2 to a numeric variable is meaningful, but to a string is not. This is not to say R won’t sometimes try, so be careful with data types!

A data type is an attribute of a variable which tells the R interpreter how we intend to use the data. It defines the operations that can be done on the data, the meaning of the data, and the limits to values that can be stored, e.g., if the type is logical then the only values that can be stored are TRUE and FALSE. In R the data type is usually defined implicitly from how we first use a variable. However, as a technical aside, you can coerce a different type to the variable. If you look at the Environment Pane in RStudio it will tell you what type a variable has.

Also note that although the above R code fragment assigned values to the three variables, it hasn’t displayed anything. You can see the current value of any variable in the Environment Pane (and this can be useful to help you debug your code if it isn’t doing what you expect). However, if we want our R code to display a value then one way to do is just to type the name of the variable, alternatively we can use the print() function.

# To programmatically display the contents of your variables
numericVariable
## [1] 10
logicVariable
## [1] TRUE
# Alternatively, you can display using the print() function
print(stringVariable)
## [1] "Hello world!"

NB there are many alternatives to print() but it’s a simple, generic means of producing output.

If you aren’t sure about a variable type there’s a useful R function called class() that allow syou to inspect a variable, e.g.,

class(logicVariable)
## [1] "logical"

For the variable names, you can choose the (hopefully meaningful and unique) variable names. A valid name consists of letters, numbers and the dot or underscore character. The variable name starts with a letter or a dot not followed by a number. Avoid names of built-in functions such as mean and summary.

X1, y.1 and .y1 are valid names
1X, .1y and X-Y are invalid names

Variable naming: Remember that R is case sensitive so that var1 and VAR1 are not the same.

Don’t use a hyphen in a variable name e.g., thisIs-BAD because R will consider this to be a minus operator.

However, RStudio has an auto-complete function that prompts you to choose an existing variable to minimise the chances of this kind of error.

A simple variable assignment example

Suppose I’m trying to calculate how much money I’m going to make from this book. There are several different things I will need to store: how many copies I’ll sell sales, the price per copy price17 and then the income generated income. Let’s assume I sell one copy per student in my class numberStudents and finally let’s assume some proportion of you are so impressed that you recommend the book to your friends and family so that recommend comprises the proportion. Let’s turn this into R.18

# Compute sales income for Martin's book

price <- 10                                     # assume price is 10 UKP per book copy
numberStudents <- 100                           # number of students in the class
recommend <- 0.5                                # proportion of students who recommend the book

sales <- numberStudents * (1 + recommend)
income <- sales * price
outputMsg <- paste("Martin will earn:", income) # format a readable string with the paste function
print(outputMsg, quote = FALSE)                 # output the concatenated string without quotes
## [1] Martin will earn: 1500

R provides the standard arithmetic operators, including * for multiply. These are used to compute sales and income. Although we could just print() income, I have added a string to make the output of the code easier to interpret. Since the print() outputs a single string I use paste() to combine a character string literal denoted by quotation marks and the income.

Other arithmetic operators include \ for division, %% for modulus division, %/% for integer division and ^ or ** for exponentiation.

It’s very important to get into the habit of writing easy to follow R code. For this reason I have added comments and tried to select meaningful variable names.

1.3.2 Functions, packages and libraries

We have discussed functions in an informal sense. It’s now worth thinking a little more precisely. There are four different parts to a function:

  • Function Name: this is how we call the function e.g., head().
  • Argument(s): an argument is a placeholder. There may be zero or more arguments. When a function is called, we pass values to match the arguments. These go in the parentheses e.g., head(x) where x is the value give the function when we call it, so it knows which variable we want to apply the function to. Sometimes arguments can have default values so no explicit value is needed, e.g., head(x) has a default to return the first 6 items of x so if we want a different number we would need to give a second argument head(x, 3) if we only want the first 3 items.
  • Function Body: the function body contains a collection of statements that defines what the function does. For in-built functions these R statements are hidden from us since usually we don’t need to know how a function is actually implemented.
  • Return Value: the return value of a function is what results or it “gives back” after being called, e.g., if we call the function min using x <- min(z) the return value (which will be the smallest value in z) is placed in x.

As we have discussed, R has many in-built functions which can be directly called in the program without defining them first. However, we can also create and use our own functions referred as user-defined functions. We will cover these in Week 3.

A function is a self-contained set of R statements grouped together to perform a specific task that should be described by the function name. When a function is called (by its name along with zero or more arguments) it performs its task and returns control to the caller and passes back any result. For instance, we might call a function named double that takes a single argument which is the number to be doubled and returns a result which is hopefully twice the argument, e.g., x <- double(5) will result in x containing 10.

If you are unsure how an unfamiliar function works or what arguments it takes, you can, for example paste(), enter ?paste into the Console Pane and R will show the relevant documentation. Alternatively, you could type the function name directly into the search box above the Help Pane in RStudio.

Plain vanilla R, usually referred to as Base R, comes with a wide a range of functions and facilities, however, there are many occasions when we wish to extend beyond this. Whilst Base R contains many useful functions, for more specialist analysis or richer functionality, we frequently make use of packages. An example you will shortly encounter is the ggplot package which provides a very powerful graphics capability and is a widely used workhorse for data visualisation (Kabacoff 2019).

Additional functions are included in external packages. You can install.package() to add extra functions. To use the functions of a particular package i.e., add them to your library you use the library() function. The package only need be installed once, however, you need to add the package to your local library every time you initialise your local runtime environment (usually each time you launch RStudio). Then you can call them by name.

Usually, we download packages from CRAN. This happens automatically if you use the install.packages() route. This is safest as any package added to CRAN goes through a number of quality and security checks. Sometimes, for various reasons, we may download a package from GitHub19 in which case use a little more caution.

It’s easy to forget to install a package. If you see something like Error in library(ggplot) : there is no package called ‘ggplot’ this suggests that you need to run install.packages("ggplot") before running library(ggplot). NB install.packages() requires the package name in quotation marks whilst library() does not. You also need to be accurate about upper and lower case.

There are presently in excess of 12,000 packages on CRAN, so it is well worth investigating whether there already exists a suitable set of functions for whatever it is that your application requires. This can save a lot of unnecessary coding!

References

Kabacoff, Robert. 2019. “Data Visualization with R.” https://rkabacoff.github.io/datavis/.


  1. There are actually three ways to assign a value to a variable in R. You can also use = or even a rightwards operator -> but <- is simplest, clearest and most widely used.↩︎

  2. I know the book is free but indulge me!↩︎

  3. The Week 1 Seminar Worksheet contains an extended version of this R code.↩︎

  4. To do this you need to install a package called DevTools and then devtools::install_github(install_github("author/package").↩︎