Session 3 Data Structures

3.1 Assigning values to a variable

We can create variables that R will store using the assign arrow, <- to assign a value to a variable. This is a less than sign and a hyphen with no space between them. Although it is possible to use = to assign a value globally in the same fashion as the assign arrow, for ease of interpretation, we use = syntax when inside brackets only. This is used for assigning local variables since it is only assigned within that function and not to the global environment. Note that the name of the object you wish to assign to cannot have any spaces and must begin with a letter. Some characters are permitted, e.g. _ but most characters have special functions and will cause an error.

a <- 4
3 * a

## [1] 12

b <- a^5
c <- b + a

# why does this not work?
2 * a <- 1

There are different ways to assign values to a variable (or placeholder). The most common is the arrow method <- that assigns a value to the variable it is pointing at. The “assign” function could be useful when you try to save multiple results from a program loop, as you can “paste0” variable name to save a new variable in each loop. See an example

x <- 2  # assigns a numeric value to a variable
y = 15  # same effect, but not the notation we usually follow.
assign('d', 7) # same effect
colour <- 'green' # assigns a character value to a variable

We can also consider testing to see if two objects are equal. For example, when we write our own function, it is always a good idea to try and program it in two ways and check that the results are the same, e.g.

a <- 2^3
b <- 2 * 2 * 2
d <- 2 * 3
a == b

## [1] TRUE

identical(a, b)

## [1] TRUE

a == d

## [1] FALSE

identical(a, d)

## [1] FALSE

we avoided using c as it is already an in-built command in R. The command c is a function that combines values to a vector

c(2, 3, 4)

## [1] 2 3 4

c(d, colour, 'Flowers')

## [1] "6"       "green"   "Flowers"

As mentioned, a vector is a sequence of data elements of the same basic type. We can consider it as simply a row of data. Members in a vector are officially called components. Note how d is a number but when used in the vector it changes to a ‘character’ class (which you can think of as text). This is because the components are required to be stored in the same type in a vector. Here is a vector containing three numeric values 8, 1 and 3.

c(8, 1, 3)

## [1] 8 1 3

Notice how because these are all numbers, the vector prints without quotation marks around the numbers since they are stored as numeric in this vector.

Try to avoid assigning objects to names which are also functions to avoid any potential issues. For example, following code replaces the default “mean” function to a new function with the function name as “mean,” which creates error.

v <- c(1, 2, 3)
mean(v)

## [1] 2

mean <- function(){
  print("I 'accidentally' replaced the default mean function!")
}
mean()

## [1] "I 'accidentally' replaced the default mean function!"

mean(v)

## Error in mean(v): unused argument (v)

# remove this incorrect function so the default one works again
rm(mean)
mean(v)

## [1] 2

3.2 Simply summary functions

R has built in functions for simple statistical measures

          vector <- c(x,y,d,f,b, 50.3, 7^2) 
          vector
          length(vector)
          sum(vector)
          mean(vector)
          max(vector)
          min(vector)
          median(vector)
          summary(vector)

3.3 Vectors and Subsetting

We can consider a vector as simply a row of data. Here is a vector containing three numeric values 8, 1 and 3.

b <- c(8, 1, 3)  # to input a vector we use the syntax c( ) with commas
b * 3  # R performs componentwise multiplication

## [1] 24  3  9

To extract the second component of a vector we can use square brackets

b[2]  # extract the second component of the vector b

## [1] 1

The c function can also be used inside square brackets to combine values of common type together to form a vector. For example, it can be used to access two components of b, e.g. the second and third

b[c(2, 3)]# extracts the second and third component of the vector b

## [1] 1 3

You will notice that the following will produce an error

b[ 2, 3 ]

To understand why the error is produced, let’s create vectors of health care data, e.g. blood pressure, age and gender of 6 patients

id     <- c("N198","N805","N333","N117","N195","N298")
gender <- c(1, 0, 1, 1, 0, 1)  # 0 denotes male, 1 denotes female
age    <- c(30, 60, 26, 75, 19, 60)
blood  <- c(0.4, 0.2, 0.6, 0.2, 0.8, 0.1)

Vectors can be arranged in rows or columns to form a structure similar to a matrix (a rectangle of data). We can combine them together using the functions cbind or rbind which translate to binding the vectors together as columns or rows, respectively. We can assign the combined columns or rows to be a new object:

#When R read these codes
#It first excutes cbind function to bind vectors together
#Then it assign the result to a new object health_data

#think of "cbind" as a shortening of 'column bind',
#i.e. combining the objects as columns
health_data <- cbind(id, gender, age, blood) 
health_data

##      id     gender age  blood
## [1,] "N198" "1"    "30" "0.4"
## [2,] "N805" "0"    "60" "0.2"
## [3,] "N333" "1"    "26" "0.6"
## [4,] "N117" "1"    "75" "0.2"
## [5,] "N195" "0"    "19" "0.8"
## [6,] "N298" "1"    "60" "0.1"

#think of "rbind" as a shortening of 'row bind', i.e. combining the objects as rows
health_data_rbind <-rbind(id, gender, age, blood) 
health_data_rbind

##        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]  
## id     "N198" "N805" "N333" "N117" "N195" "N298"
## gender "1"    "0"    "1"    "1"    "0"    "1"   
## age    "30"   "60"   "26"   "75"   "19"   "60"  
## blood  "0.4"  "0.2"  "0.6"  "0.2"  "0.8"  "0.1"

Combining the vectors in this manner gives a matrix of information. The class of an object can be found using the function class. Note that if we try this on a vector, the class depends on the type of components, e.g. we have “numeric” for a vector of numbers and “character” for a vector of names. Similarly to a vector, the entries of a matrix must be of the same class, hence again we see that the numbers have been changed to have quotation marks around them as they are now treated as text.

Further Reading: Chapter 4 and Chapter 6 of Wickham & Grolemund

3.4 Data Frames

Most of time we would use a structure called a data frame to store our data. It consists of a list of variables of the same number of rows with unique row and column names. If no variables are included, the row names determine the number of rows.

my_data <- data.frame(id, gender, age, blood) 
my_data <- data.frame(ID = id, Sex = gender, Age = age, Blood = blood) # or specify your names
my_data

##     ID Sex Age Blood
## 1 N198   1  30   0.4
## 2 N805   0  60   0.2
## 3 N333   1  26   0.6
## 4 N117   1  75   0.2
## 5 N195   0  19   0.8
## 6 N298   1  60   0.1

The commands we have used above create a data structure called a data frame, which is a list of variables of the same number of rows with unique row names, given class “data.frame.” If no variables are included, the row names determine the number of rows.

What is the difference between a matrix and a data frame? A matrix and a data frame are different classes of objects. A matrix stores every entry as the same class but a data frame lets different columns be different classes. For example if we use the command as.matrix(my_data) we can see that the entries all have quote marks. They are now viewed as characters by R even though we would prefer some of them to be numbers. This, however, does not mean that a matrix is not useful. All of the functions which you use and write will only work for a specific class of object. However, most of time you would use “dataframe” as default, and it could be easily converted to matrix via “as.matrix()” and restored via “as.data.frame().”There are some functions such as matrix multiplication which will cannot be used on a data frame. This is the same issue you would get if you tried to use the mean function on a vector containing characters. To understand it better, try the following:

mean(c("a","b","c"))

Some of the available functions which can be useful to check your data are as follows:

          summary(my_data)
          class(my_data)
          
          head(my_data) # first six lines, 
          #note if we want fewer lines, say 2, we can specify this:
          head(my_data,2)
          
          tail(my_data) # last six lines
          tail(my_data,2) # last two lines
          
          colnames(my_data) # returns the column names of the data.frame
          nrow(my_data)  # number of rows
          ncol(my_data)  # number of columns

Note that in the summary we have output called 1st Qu, median and 3rd Qu. These refer to the 25-, 50- and 75-percentiles of the data, respectively. What this means is that if we had all whole numbers from 0 to 10, the 25-percentile would be 2.5 (half way between 2 and 3), the 50-percentile would be 5 (this is the middle number) and the 75-percentile would be 7.5 (half way between 7 and 8).

Similar to $(x,y)$ coordinates, the matrix indicies always read [ROWS, COLUMNS]. To extract a single cell value from the second row and third column, we type

my_data[2, 3]

## [1] 60

We prefer to work with data frames from now on.

Omitting column values implies all columns; here all columns in row 2

my_data[2, ]

##     ID Sex Age Blood
## 2 N805   0  60   0.2

my_data[2, 1:ncol(my_data)] # The same command, note ncol works out the number of columns

##     ID Sex Age Blood
## 2 N805   0  60   0.2

Omitting row value implies all rows; here all rows in column 3 (Age). Since you would select columns (variables) more often than select rows. It is better you konw all the pros and cons of following codes

#Implict, as once the data has changed, 
#you might choose a wrong column, but very convenient
my_data[ ,3]

## [1] 30 60 26 75 19 60

#You have to obtain columns first, but very specific. 
#However, since R is case-sensitive (treats uppercase and lowercase differently),
#eaiser to make mistakes.
my_data$"Age"

## [1] 30 60 26 75 19 60

#Sometime might not work in specific R environment, 
#but could be useful in function or loop.
my_data[, "Age"]

## [1] 30 60 26 75 19 60

#Allow you to choose multiple columns
my_data[, c("Age")]

## [1] 30 60 26 75 19 60

my_data[, c("Age", "ID")]

##   Age   ID
## 1  30 N198
## 2  60 N805
## 3  26 N333
## 4  75 N117
## 5  19 N195
## 6  60 N298

We can also use ranges - rows 2 and 3, columns 2 and 3

my_data[2:3, 2:3]

##   Sex Age
## 2   0  60
## 3   1  26

Exercise 1: What is the difference between cbind() and rbind()?

*Exercise 2: We found out that the blood pressure instrument is under-recording each measure and all measurement incorrect by 0.1. How would you add 0.1 to all values in the blood vector?**

*Exercise 3: We found out that the first patient is 33 years old. How would you change the first element of the vector age to 33 years?**

3.5 Data available in R

There are many datasets already available in R, usually in data frame format. These can be discovered using the function data(). For example, the first dataset listed is called AirPassengers and contains data of airline passengers from 1949 to 1960.

When you load additional packages, more datasets may be available, for example the package nycflights13 contains a dataset which details flight information. As with the functions, we can use the help manual to learn more about these datasets, i.e., using the command ?AirPassengers.

Data are often arranged such that they are rectangular, with column headings. They are usually stored in R as objects which are classed as data frames or tibbles. As we have seen, the column names can be imported into the environment as objects which can be called.

3.6 Further Reading

See Chapter 4 of R for Data Science for a more extensive run through of data structures in R.
See Chapter 9 of R for Data Science for a more extensive description of subsetting R objects.