# Chapter 1 Getting started with R

## 1.1 Introduction to R

We will shortly start programming using R. We will start with familiarising you with basic R programming.

In R, you can do various arithmetic calculations. Some examples are presented below. Run these codes in your Rstudio (in the Console tab) and see what is happening. Also try some other mathematical operations to get more experience.

1 + 6
2 * 5
2 * 8/5
sqrt(16)
2 ^ 2
log(10)
exp(2)

If you want to refer to a result later on, we can give each calculation a name, which we call object name. For example: a <- 2*3. We read this off as “assign (<-)” the result of the summation to an object named a. If you type this object name in the R console, you will see the result of this computation (i.e., 6 in this case). Now, it is your turn to practise.

Exercise: Calculate the square root of 16 multiplied by 7 and divided by 2. Assign this calculation to an object called b. Practice as many mathematical operations as you can in RStudio.

Tip: To take the square root of a number, you use sqrt() function.

Next, find the R-help file for the operator exponential (exp). You should type the following code in your R Studio console and then check your Help tab for this help page:

? exp

## 1.2 Data Types

There are different types of data such as discrete, continuous, nominal, and ordinal to describe quantitative and qualitative measurements. In R different names are used to identify data types. The ones that you will be likely to use in this course are

Logical - boolean values of TRUE and FALSE.

Character - simple character strings (e.g.,words and sentences). Can be used as nominal or ordinal data. Another type of data type similar to character specific for categorical data is called Factor which we will explore soon.

Integer - integer numerical values, without any decimal point. (indicated with an L) . Can be used mainly for ordinal or interval data, but may be used as ratio data—such as counts—with some caution.

Numeric/double - Real numbers. Floating point numerical values. May be used for all scales of measurement, but is particularly suited to ratio scale measurements. It is the default numerical type when you work with a number in R.
Note: In R, integers are subset of numericals.

We can check the type of a variable simply by using class() or typeof(). Let’s create some variables: a<-2.5 , b=3, c=“hello.”

a <- 2.5
b <- 3
c <- "Hello"

Now let’s check the types of the following variables

class(a)
## [1] "numeric"
class(b)
## [1] "numeric"
class(c)
## [1] "character"

If you know the type, and would like to confirm, then use is.integer() or is.numeric(). Now, go back to Rstudio and try these two commands.

We can also use logical operators, such as

• & (AND),
• | OR and ! (negotiation),
• > (greater than),
• < (less than),
• <= (less than or equal to),
• >= (greater than or equal to),
• == (equal to), and
• != (not equal to).

Exercise: Given x=5, y=2, and z=10, code the following arguments and find the answers.
1. Is (x+y)/z greater than zero?
2. Does (z/x) return an integer?
3. Is (x*y) greater than or equal to z?

Solution:

# Creating variables
x <- 5
y <- 2
z <- 10
(x + y)/z > 0
## [1] TRUE
z/x
## [1] 2
is.integer(2)
## [1] FALSE
is.integer(2L)
## [1] TRUE
is.integer(z/x) # By default, R stores this as numerical
## [1] FALSE
is.integer(as.integer(z/x)) # We specify  z/x to be saved as integer and then test if it is an integer
## [1] TRUE
x * y >= z
## [1] TRUE

Factors

Another type of data in R is factors. You can think of factors as special character vectors with some nice additional functions. They take on a limited number of different values; such variables are often referred to as categorical variables. We use them to categorise the data and store it as levels. They can store both strings (texts) and integers. They are useful in the columns which have a limited number of unique values. Like Employed/Unemployed and True/False etc. They are useful in data analysis for statistical modelling.

Factors are created using the factor () function by taking a vector as input.

Exercise: Let’s create a vector called regions and the observations in this vector are: "East","West","South","North". We then print this vector to see the values it takes and confirm whether the levels are factor type data.

Solution:

# Create a vector as input and test if it is a factor or not.
regions <- c("East","West","South","North")
print(regions)
## [1] "East"  "West"  "South" "North"
is.factor(regions)
## [1] FALSE

We see that the levels are not factor type data. Let’s apply the factor() function to convert the data to factor and test again whether is it a factor or not.

regions <- c("East","West","South","North")
factor_regions <- factor(regions)
is.factor(factor_regions)
## [1] TRUE

While it may be necessary to convert a numeric variable to a factor for a particular application, it is often very useful to convert the factor back to its original numeric values, since even simple arithmetic operations will fail when using factors. Since the as.numeric function will simply return the internal integer values of the factor, the conversion must be done using the levels attribute of the factor.

Once you converted a numeric or character varible to a factor variable, you can find its levels by using the levels() function as well.

regions <- c("East","West","South","North")
factor_regions <- factor(regions)
levels(factor_regions)
## [1] "East"  "North" "South" "West"

Factors in R come in two varieties: ordered and unordered, e.g., small, medium, large and pen, brush, pencil. For most analyses, it will not matter whether a factor is ordered or unordered. If the factor is ordered, then the specific order of the levels matters (small < medium < large). If the factor is unordered, then the levels will still appear in some order, but the specific order of the levels matters only for convenience (pen, pencil, brush) – it will determine, for example, how output will be printed, or the arrangement of items on a graph.

One way to change the level order is to use factor() on the factor and specify the order directly. In this example, the function ordered() could be used instead of factor().

Suppose we are studying the effects of several levels of a fertilizer on the growth of a plant. We record the fertilizer levels in a vector called fert. To order the levels from smallest to the largest value, we ordered = TRUE as below:

fert = c(10,20,20,50,10,20,10,50,20)
fert = factor(fert,levels = c(10,20,50), ordered = TRUE)
fert
## [1] 10 20 20 50 10 20 10 50 20
## Levels: 10 < 20 < 50

If we wished to calculate the mean of the original numeric values of the fert variable, we would have to convert all factor levels to numeric values using as.numeric() function:

fert = c(10,20,20,50,10,20,10,50,20)
fert = factor(fert,levels = c(10,20,50), ordered = TRUE)
mean(as.numeric(levels(fert)[fert]))
## [1] 23.33333

Note: If you do not speciy the levels for an integer, R can still order the levels when printing the output. However, if your variable has character values, such as small, medium, large, and you don’t specify the levels, R would print the levels in an alphabetical order. Try the following in RStudio:

a = c("xlarge","medium","small","large")
b = factor(a, levels = c("small", "medium", "large", "xlarge"), ordered = TRUE)
c = factor(a, ordered = TRUE)
b
c

## 1.3 Data Structure

Data structures are used to store data in an organised fashion in order to make data manipulation and other data operations more efficient. When programming in R, we use the following data structures:
1. Vectors
2. Matrices
3. Dataframes
4. Lists

### 1.3.1 Vectors

Vectors are the basic data structure in R. Vectors are one-dimensional structures. All numbers that we typed above were vectors with the length of one. Vector in R programming is created using the c() function. The “c” stands for combine or concatenate. We can query the length of a vector using length(), class of a vector by class(), and the data type in a vector by typeof().

# This is a vector of length 1
25
## [1] 25
#Length of this vector is
length(25)
## [1] 1
#Let's create a new vector called "num_vec" that takes numeric values. Once we have created it, let's print what it is, and find its length
num_vec <- c(1,2,3,4)
num_vec
## [1] 1 2 3 4
length(num_vec)
## [1] 4

Exercise: Create a vector called “b” taking both numeric (e.g., 2,3) and character (also called “string”) (e.g., “HELLO,” “HOW,” “ARE,” “YOU,” “?”) values. Find the class and length of this vector as well.

Solution:

# Just an example:
b <- c(1,2, "HELLO", 4, "HOW", "ARE", "YOU?")
class(b)
## [1] "character"
length(b)
## [1] 7

Now suppose we wanted to put the ages of our friends in a vector called “friend_ages.” We can again use the c function:

friend_ages <- c(25L, 37L, 22L, 30L)
friend_ages
## [1] 25 37 22 30
class(friend_ages)
## [1] "integer"

Note the use of the L value here. This tells R that the numbers entered have no decimal components. If we didn’t designate the L we can see that the values are read in as "numeric" by default:

ages_numeric <- c(25, 37, 22, 30)
class(ages_numeric)
## [1] "numeric"

From a user’s perspective, there is not a huge difference in how these values are stored, but it is still a good habit to specify what class your variables are whenever possible to help with collaboration and documentation.

We can do maths with vectors:

num_vec <- c(1,2,3,4)
num_vec + 2 
## [1] 3 4 5 6
num_vec <- c(1,2,3,4)
# element-wise power
num_vec ^ 2 
## [1]  1  4  9 16
num_vec <- c(1,2,3,4)
# Sum of vector elements
sum(num_vec) 
## [1] 10
# Mean of vector
sum(num_vec) / length(num_vec) 
## [1] 2.5
#OR
mean(num_vec)
## [1] 2.5

Using the seq and rep functions

There are two handy R functions: seq() and rep(). The first one creates a sequence and the second one replicates observations in a vector or sequences.

# Let's create a sequence from 1 to 10.
one_to_ten <- seq(1:10)

# OR you can simply use ":" operator
1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
# A sequence with a starting and ending value, and the amount by which to increment each step in the sequence

sequence_by_2 <- seq(from = 0L, to = 100L, by = 2L)
sequence_by_2
##  [1]   0   2   4   6   8  10  12  14  16  18  20  22  24  26  28
## [16]  30  32  34  36  38  40  42  44  46  48  50  52  54  56  58
## [31]  60  62  64  66  68  70  72  74  76  78  80  82  84  86  88
## [46]  90  92  94  96  98 100
# Let's repeat one_to_ten sequence 3 times
rep(one_to_ten, times = 3)
##  [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10
## [21]  1  2  3  4  5  6  7  8  9 10
# Let's repeat each element of this vector by 3 times
rep(one_to_ten, each = 3)
##  [1]  1  1  1  2  2  2  3  3  3  4  4  4  5  5  5  6  6  6  7  7
## [21]  7  8  8  8  9  9  9 10 10 10

You should now have a better sense of what the numbers in the [ ] before the output refer to. This helps you keep track of where you are in the printing of the output. So the first element denoted by [1] is the first observation. In sequenceby_2 vector, [1] is 0, the 18th entry ([18]) is 34, and the 35th entry ([35]) is 68. Practise creating a sequence and findings its elements in RStudio.

### 1.3.2 Matrices

Matrices are the 2-dimensional extension of vectors. We can combine two vectors to a matrix using cbind (combine by columns) or rbind combine by rows:

vec1 <- 1:5
vec2 <- rep(9, 5)
# combine by columns
mat1 <- cbind(vec1, vec2)
mat1
##      vec1 vec2
## [1,]    1    9
## [2,]    2    9
## [3,]    3    9
## [4,]    4    9
## [5,]    5    9

Once again, we can do element-wise arithmetic operations with matrices.

vec1 <- 1:5
vec2 <- rep(9, 5)
# combine by columns
mat1 <- cbind(vec1, vec2)
sqrt(mat1)
##          vec1 vec2
## [1,] 1.000000    3
## [2,] 1.414214    3
## [3,] 1.732051    3
## [4,] 2.000000    3
## [5,] 2.236068    3

Matrices can be also created with the matrix() function:

a <- matrix(c(1, 2, 3, 4), ncol = 2, nrow = 2)
a
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

### 1.3.3 Dataframes

Matrices constrain us to include all data as the same type, which is not appealing given that we may have datasets composed of both numeric and nominal variables. A data.frame can handle this. Dataframe is a 2-dimensional (i.e., having rows and columns) flexible data structure.

You can create dataframes using the data.frame() function.

df <- data.frame(1:5, c("A", "B", "C", "D", "E"))
df
##   X1.5 c..A....B....C....D....E..
## 1    1                          A
## 2    2                          B
## 3    3                          C
## 4    4                          D
## 5    5                          E

We can add the column names to this dataframe you created. Let’s call the first variable “number” and the second variable “character”:

df <- data.frame(number = 1:5, character = c("A", "B", "C", "D", "E"))
df
##   number character
## 1      1         A
## 2      2         B
## 3      3         C
## 4      4         D
## 5      5         E

You can extract the column and row names using colnames() and rownames(), where you simply type the dataframe name in the parentheses.

Most data (csv-files, Excel sheets) is organised in 2 dimensional fashion and can be read directly into dataframes. Imagine, when you have a big dataset, it is not always convenient to tell what type of variables there are by looking at it, but instead we might want to use str() to investigate the structure of the dataframe carrying the data.

df <- data.frame(1:5, c("A", "B", "C", "D", "E"))
str(df)
## 'data.frame':    5 obs. of  2 variables:
##  $X1.5 : int 1 2 3 4 5 ##$ c..A....B....C....D....E..: chr  "A" "B" "C" "D" ...

Here we see that the dataframe has 5 rows (obs.) and 2 columns (variables). The first variable is named number and is of type integer. The second variable is named character and is of type factor.

### 1.3.4 Lists

Lists are the most flexible data type in R. Elements of lists can be of any type (unlike vectors and matrices) and of different length (unlike dataframes). Lists can be created using the list() function.

df <- data.frame(1:5, c("A", "B", "C", "D", "E"))
t <- list('A', c(1, 3, 4), df)
t
## [[1]]
## [1] "A"
##
## [[2]]
## [1] 1 3 4
##
## [[3]]
##   X1.5 c..A....B....C....D....E..
## 1    1                          A
## 2    2                          B
## 3    3                          C
## 4    4                          D
## 5    5                          E

Here I stored the first element as a character vector of length one, the second element as a numeric vector and the third element as a whole dataframe.

Again str() is useful to understand the structure of the list.

df <- data.frame(1:5, c("A", "B", "C", "D", "E"))
t <- list('A', c(1, 3, 4), df)
str(t)
## List of 3
##  $: chr "A" ##$ : num [1:3] 1 3 4
##  $:'data.frame': 5 obs. of 2 variables: ## ..$ X1.5                      : int [1:5] 1 2 3 4 5
##   ..$c..A....B....C....D....E..: chr [1:5] "A" "B" "C" "D" ... It tells us the object t is a list with 3 entries and then gives details of the elements. Note the entry for the last element is identical to the one from above. ## 1.4 Indexing and Subsetting Let’s use the df dataframe we used earlier and extract smaller pieces of information from this dataframe. If we want to focus on any specific variable we can use the $ operator.

df <- data.frame(numbers = 1:5, characters = c("A", "B", "C", "D", "E"))
df$numbers ## [1] 1 2 3 4 5 ### 1.4.1 Using [ ] with a vector/variable We can use [ ] to select the 3rd to the 5th elements of the numbers variable from df dataframe: df <- data.frame(numbers = 1:5, characters = c("A", "B", "C", "D", "E")) df$numbers[3:5]
## [1] 3 4 5

Similarly, we can only select a few elements from a vector. For example, for the friend_names vector, we can specify the entries directly:

friend_names <- c("Amy", "Robert", "Andy", "Ida")
friend_names[c(1, 3)]
## [1] "Amy"  "Andy"

We can also use - to select everything but the elements listed after it:

friend_names <- c("Amy", "Robert", "Andy", "Ida")
friend_names[-c(2, 4)]
## [1] "Amy"  "Andy"

### 1.4.2 Using [ , ] with a dataframe

You have now seen how to select specific elements of a vector or a variable, but what if we want a subset of the values in the full dataframe across both rows (observations) and columns (variables)? We can use [ , ] where the spot before the comma corresponds to rows and the spot after the comma corresponds to columns. Let’s select rows 2 to 4 and columns 1, 2, and 4 from df_extended:

df_extended <- data.frame(numbers = 1:5, characters = c("A", "B", "C", "D", "E"), age = c(10,20,45,23,61), hair = c("Long", "Short", "Long", "Long", "Short"))
df_extended[2:4, c(1,2,4)]
##   numbers characters  hair
## 2       2          B Short
## 3       3          C  Long
## 4       4          D  Long

Now, select rows 1 and 5, and columns 1, 3, and 4 from df_extended

Solution:

df_extended <- data.frame(numbers = 1:5, characters = c("A", "B", "C", "D", "E"), age = c(10,20,45,23,61), hair = c("Long", "Short", "Long", "Long", "Short"))
df_extended[c(1, 5), c(1, 3, 4)]
##   numbers age  hair
## 1       1  10  Long
## 5       5  61 Short

### 1.4.3 Using logicals

As you’ve seen, we can specify directly which elements we’d like to select based on the integer values of the indices of the dataframe. Another way to select elements is by using a logical vector:

friend_names <- c("Amy", "Robert", "Andy", "Ida")
friend_names[c(TRUE, FALSE, TRUE, FALSE)]
## [1] "Amy"  "Andy"

This can be extended to choose specific elements from a dataframe based on the values in the “cells” of the dataframe. A logical vector like the one above (c(TRUE, FALSE, TRUE, FALSE)) can be generated based on our entries:

friend_names <- c("Amy", "Robert", "Andy", "Ida")
friend_names == "Amy"
## [1]  TRUE FALSE FALSE FALSE

We see that only the 1st element in this new vector is set to TRUE because "Amy" is the first entry in the friend_names vector.

There are many more complicated ways to subset a dataframe and one can use the subset function built into R, but in my experience, whenever you want to do anything more complicated than what we have done here, it is easier to use the dplyr` package, which we will see later.

### 1.4.4 Congratulations

You’ve now met R’s basic table structures and you have learned how to inspect their contents.