2 Basic aspects of R

Install/load libraries used

if (!require(utils)) install.packages('utils') 
library(utils)
if (!require(tidyverse)) install.packages('tidyverse') 
library(tidyverse)

2.1 Data types and structures

The assignment operator in R is the symbol = or more commonly <-. For example a <- 5 assigns a the value 5. The operators = and <- produce in most cases the same result, but they are not identical and in certain cases using one instead of another may produce an error.

If any operation or function call is performed through code and is not assigned to any variable, the result is printed and not stored.

Type of data

Type Example
character
  • "my text "
factor
  • as.factor("Europe")
numeric
  • 5
  • as.numeric("5")
logical
  • TRUE
  • FALSE
data not available
  • NA
date
  • as.Date("2022-12-31")

Coercion allows us to change data types. For example as.Date("2022-12-31") passes the text variable “2022-12-31” to a date type data. Other examples of coercion functions are as.numeric(), or as.character().

When we read data from a table, we must pay special attention to examining the types assigned to the variables and checking, in particular, that variables with numbers and dates have been identified as such. Otherwise, the first thing to do is resolve that incident.

The factor type is used to classify, and is associated with, the so-called categorical variables. For example, in a table with country information, continent information is typically a factor categorical variable. To speed up their management, factor type variables are stored internally as a mixture of characters and numbers, and therefore they are not the same as character type variables and this can cause some problems when comparing both types of variables.

String manipulation

An especially important case of data types are character strings that we often have to manipulate to get the results we need. Therefore we are going to see below some basic processes in manipulating strings.

Function paste and paste0

The paste function allows you to combine variables to construct a string

a <- 5
paste("the speed is",a, "km2")
## [1] "the speed is 5 km2"

The paste function inserts white spaces between the strings it combines. paste0 does the same as paste but without inserting whitespace.

Extract a piece of a string

s <- "this is a string"
# we extract positions from 6 to 16
substr(s,6,16)
## [1] "is a string"

Extract a word from a string

When our string is a phrase with words separated by whitespace, the word function of the tidyverse library allows us to extract the word that occupies a given position in the phrase. For example, the following code extracts the second word from the string s declared earlier.

word(s,2)
## [1] "is"

Comparison of strings

s1 <- "ABC"
s2 <- "abc"
#literal comparison
s1==s2
## [1] FALSE
#comparison converting everything to lowercase
tolower(s1)==tolower(s2)
## [1] TRUE
#detect if one string is part of another
str_detect("UNITED NATIONS","NAT")
## [1] TRUE

Distance between strings with adist()

Sometimes we have to compare strings that are similar but not identical. For example, we may want the strings “Antonio”, “San antonio” and “S. Antonio” to be recognized as similar. The adist function of the utils library allows us to compute a distance between two strings that is zero when they are identical and that becomes greater the more they differ. This function has various parameters to adjust how we want to compare the strings. We will use this function in the future to obtain, given a string, the closest one to it in a vector of strings.

# literal comparison
adist("Antonio","San Antonio")
##      [,1]
## [1,]    4
adist("S. Antonio","San Antonio")
##      [,1]
## [1,]    2
# comparison ignoring lowercase/uppercase
adist("Antonio","San antonio",ignore.case=TRUE)
##      [,1]
## [1,]    4
# comparison of the first string with a piece of the second
adist("Antonio","S. antonio",ignore.case=TRUE,partial=TRUE)
##      [,1]
## [1,]    0
# with the partial=TRUE option, the order of the strings affects the result
adist("S. antonio","Antonio",ignore.case=TRUE,partial=TRUE)
##      [,1]
## [1,]    3

Date format handling

Since data often has dates associated with it, we will show below some examples of date format management:

as.Date("2022-12-25") # Coerce to date using standard date format
## [1] "2022-12-25"
as.Date("25/12/2022",format = "%d/%m/%Y") # reading non-standard date format
## [1] "2022-12-25"
# extract date information
format(as.Date("2022-12-25"),"year: %Y, month: %m, day: %d, day within the week: %u")
## [1] "year: 2022, month: 12, day: 25, day within the week: 7"
format(as.Date("2022-12-25"),"Week number of the year: %W, day of the year: %j")
## [1] "Week number of the year: 51, day of the year: 359"

For more information about date formats, run ??strptim in the console and choose the base::format.POSIXct option.

There are programs that encode dates as numbers based on a reference date (For example in EXCEL the reference date is “1899-12-30”). We can pass this numeric encoding to date format with the statement:

# convert number 0 to date
as.Date(0, origin = "1899-12-30")
## [1] "1899-12-30"
# convert number 45000 to date
as.Date(45000, origin = "1899-12-30")
## [1] "2023-03-15"

Sometimes the dates associated with the data are given by months, for example “2008-02” or by quarter, for example “2008 Q2”. To handle dates in this way the lubridate library provides some useful tools. Let’s look at some examples:

ym("2008-02") # the ym() function returns the date as the first day of the month of the string "2008-02"
## [1] "2008-02-01"
ceiling_date(ym("2008-02"),"month")-1 # last day of the month
## [1] "2008-02-29"
ym("2008/02") #ym() argument supports multiple formats
## [1] "2008-02-01"
ym("200802") #ym() argument supports multiple formats
## [1] "2008-02-01"
my("02/2008") #if the month comes before the year we can use my()
## [1] "2008-02-01"
yq("2008 Q2") # the yq() function returns the date as the first day of the quarter "2008 Q2"
## [1] "2008-04-01"
ceiling_date(yq("2008 Q2"),"quarter")-1 # last day of the quarter
## [1] "2008-06-30"
yq("2008Q2") # yq() argument supports multiple formats
## [1] "2008-04-01"

Vectors

Creating a vector by concatenating values

v <- c(1,7,9)
paste("the second element of v is: ",v[2])
## [1] "the second element of v is:  7"

Creating a vector by repeating values

v <- rep(1,6)
v
## [1] 1 1 1 1 1 1
paste("the length of the vector is: ",length(v))
## [1] "the length of the vector is:  6"

Creating a vector using consecutive numbers

3:7
## [1] 3 4 5 6 7

Creating a vector with descending values

seq(7, 3, by = -1)
## [1] 7 6 5 4 3

Creating a vector with assigned size but without filling

numeric(5) # vector of numbers of size 5
## [1] 0 0 0 0 0
character(5) # vector of strings of size 5
## [1] "" "" "" "" ""

Creating a vector using consecutive dates

as.Date("2022-12-31")+0:3
## [1] "2022-12-31" "2023-01-01" "2023-01-02" "2023-01-03"

It is often necessary to create a vector of dates spaced by months, weeks, etc. We can achieve this with the seq function. For example, let’s see how a vector is constructed with the first six months of the year 2023.

seq(as.Date("2023-01-01"), as.Date("2023-06-01"), by = "months")
## [1] "2023-01-01" "2023-02-01" "2023-03-01" "2023-04-01" "2023-05-01"
## [6] "2023-06-01"

Remove elements from a vector

x <- c(2,3,2,3,5)
x[-2] # we eliminate the second position
## [1] 2 2 3 5
x[-(2:3)] # we eliminate the second and third positions
## [1] 2 3 5
x[! x %in% c(2,5)] #remove all existing 2s and 5s
## [1] 3 3

Replace elements of a vector

We are going to use the gsub function to replace elements of a vector.

x <- c("my book", "my car","your car")
gsub('my','your',x)
## [1] "your book" "your car"  "your car"
gsub('car','tree',x)
## [1] "my book"   "my tree"   "your tree"

Find the position of a value in a vector

We are going to use the which function to find the position of an element in a vector

x <- c("my book", "my car","your car")
which(x=='my car')
## [1] 2

Extract non-repeating elements from a vector

Often, when we read a table with factors, we need to know what factors are used. To do this we use the levels function that returns the unrepeated elements of a vector of factors

levels(as.factor(c("red","red","red","blue","blue")))
## [1] "blue" "red"

Dataframes and Tibbles

A dataframe is an array that can combine different types of data, for example dates and numeric values. The tibble structure is an improved version of dataframe found in the tidyverse library. In this course we will mainly use this data structure to manage tables. In fact, unless otherwise indicated, when we talk about tables in the course, we will be referring to tibbles structures. When we talk about variables we will be referring to the columns (col) of the table and when we talk about records, we will be referring to the rows (row) of the table.

2.1.0.1 Manual creation of a tibble

tb <- tibble(
   date=as.Date("2022-12-31")+0:5,
   value=c(9,6,1,7,5,8))
tb # printing tb 
## # A tibble: 6 × 2
##   date       value
##   <date>     <dbl>
## 1 2022-12-31     9
## 2 2023-01-01     6
## 3 2023-01-02     1
## 4 2023-01-03     7
## 5 2023-01-04     5
## 6 2023-01-05     8
paste("The second value of the tibble date variable is ",tb$date[2])
## [1] "The second value of the tibble date variable is  2023-01-01"
paste("The third value of the tibble's value variable is ",tb$value[3])
## [1] "The third value of the tibble's value variable is  1"
paste("The third value of the tibble's value variable is ",tb[3,2])
## [1] "The third value of the tibble's value variable is  1"

2.1.0.2 Access first and last values of the tibble

head(tb,3) # first 3 rows of tb
## # A tibble: 3 × 2
##   date       value
##   <date>     <dbl>
## 1 2022-12-31     9
## 2 2023-01-01     6
## 3 2023-01-02     1
tail(tb,4) # last 4 rows of tb
## # A tibble: 4 × 2
##   date       value
##   <date>     <dbl>
## 1 2023-01-02     1
## 2 2023-01-03     7
## 3 2023-01-04     5
## 4 2023-01-05     8

Basic handling of rows and columns

nrow(tb) # number of rows
## [1] 6
ncol(tb) # number of columns
## [1] 2
tb[1:3,] # selection of the first 3 rows
## # A tibble: 3 × 2
##   date       value
##   <date>     <dbl>
## 1 2022-12-31     9
## 2 2023-01-01     6
## 3 2023-01-02     1
tb[,2:2] # second column selection
## # A tibble: 6 × 1
##   value
##   <dbl>
## 1     9
## 2     6
## 3     1
## 4     7
## 5     5
## 6     8
tb[1:3,2:2] # combined selection
## # A tibble: 3 × 1
##   value
##   <dbl>
## 1     9
## 2     6
## 3     1
colnames(tb) # column names
## [1] "date"  "value"
tb2 <- set_names(tb,c("date_new","value_new")) # change the column names
tb2
## # A tibble: 6 × 2
##   date_new   value_new
##   <date>         <dbl>
## 1 2022-12-31         9
## 2 2023-01-01         6
## 3 2023-01-02         1
## 4 2023-01-03         7
## 5 2023-01-04         5
## 6 2023-01-05         8

2.1.0.3 Add a record (row) to a tibble

add_row(tb,date = as.Date("2023-01-25"),value=11)
## # A tibble: 7 × 2
##   date       value
##   <date>     <dbl>
## 1 2022-12-31     9
## 2 2023-01-01     6
## 3 2023-01-02     1
## 4 2023-01-03     7
## 5 2023-01-04     5
## 6 2023-01-05     8
## 7 2023-01-25    11

Concatenate records from multiple tibbles

The rbind() function allows us to concatenate the records of several tibbles that have the same variables. Let’s look at an example:

tb1 <- tibble(
   date=as.Date("2022-12-31")+0:2,
   value=c(9,6,1))
tb2 <- tibble(
   date=as.Date("2022-12-31")+3:5,
   value=c(7,5,8))
rbind(tb1,tb2)
## # A tibble: 6 × 2
##   date       value
##   <date>     <dbl>
## 1 2022-12-31     9
## 2 2023-01-01     6
## 3 2023-01-02     1
## 4 2023-01-03     7
## 5 2023-01-04     5
## 6 2023-01-05     8

2.1.0.4 Add a variable (column) to a tibble

To add a variable to a tibble, simply create a vector the size of the number of rows in the tibble and add the variable to the tibble.

tb$new_value <- 1:nrow(tb)
tb
## # A tibble: 6 × 3
##   date       value new_value
##   <date>     <dbl>     <int>
## 1 2022-12-31     9         1
## 2 2023-01-01     6         2
## 3 2023-01-02     1         3
## 4 2023-01-03     7         4
## 5 2023-01-04     5         5
## 6 2023-01-05     8         6

Change a value in the entire tibble

tb[tb==5] <- NA
tb
## # A tibble: 6 × 3
##   date       value new_value
##   <date>     <dbl>     <int>
## 1 2022-12-31     9         1
## 2 2023-01-01     6         2
## 3 2023-01-02     1         3
## 4 2023-01-03     7         4
## 5 2023-01-04    NA        NA
## 6 2023-01-05     8         6

2.1.0.5 Select records using slice

slice(tb,2:4)
## # A tibble: 3 × 3
##   date       value new_value
##   <date>     <dbl>     <int>
## 1 2023-01-01     6         2
## 2 2023-01-02     1         3
## 3 2023-01-03     7         4

Do an operation on all elements of selected variables

We are going to convert all the values of the value and new_value variables to type character

tb %>% mutate(across(value:new_value,as.character))
## # A tibble: 6 × 3
##   date       value new_value
##   <date>     <chr> <chr>    
## 1 2022-12-31 9     1        
## 2 2023-01-01 6     2        
## 3 2023-01-02 1     3        
## 4 2023-01-03 7     4        
## 5 2023-01-04 <NA>  <NA>     
## 6 2023-01-05 8     6

The mutate function belongs to the dplyr library and will be seen in detail later

Delete a record from a tibble

# Deleting second tibble record
tb %>%
    filter(row_number()!=2)
## # A tibble: 5 × 3
##   date       value new_value
##   <date>     <dbl>     <int>
## 1 2022-12-31     9         1
## 2 2023-01-02     1         3
## 3 2023-01-03     7         4
## 4 2023-01-04    NA        NA
## 5 2023-01-05     8         6

Lists

A list is a collection of objects that can have different structures and sizes. We will make little use of lists in this course since our main source of information will be data tables that we will manage with tibbles.

list("date"=as.Date("2022-12-25"), "value1"= 1:3,"value2"=1:4) # create list
## $date
## [1] "2022-12-25"
## 
## $value1
## [1] 1 2 3
## 
## $value2
## [1] 1 2 3 4

2.2 Functions in R

The functions allow you to simplify and organize software development by packaging code to be used later in a compact and simple way. In R, functions are usually written in files with a .R extension. To be able to use the functions created in a .R file from any R code, you must put the instruction at the beginning of the R code

source("FileName.R")

In general, functions have some input parameters, perform some type of operation and return something. The general way to define a function is

FunctionName <- function(parameter 1, parameter 2, ...){
   code lines
   .....
   return(result)
}

Next we will create a function to add 2 numbers and use it to calculate 2+3

sum <- function(x,y){
   z <- x+y
   return(z)
}
sum(2,3)
## [1] 5

Most of the functions in standard libraries such as paste, as.Date, etc., when they receive a vector as an argument, return another vector by applying the function to each element of the vector. For example:

d <- c("2008-01","2014-10","2023-12")
paste("ym() function converts",d,"into",ym(d))
## [1] "ym() function converts 2008-01 into 2008-01-01"
## [2] "ym() function converts 2014-10 into 2014-10-01"
## [3] "ym() function converts 2023-12 into 2023-12-01"

Likewise, when the function receives 2 arguments, if these arguments are vectors, then the function can return a dataframe with the result of applying the function to each combination of the elements of both vectors. Let’s see an example with the adist function to compare strings:

adist(c("Teror","Galdar"),c("Telde","Arucas","San Mateo"))
##      [,1] [,2] [,3]
## [1,]    3    6    9
## [2,]    4    5    7

This ability of functions to act based on the type of argument they receive depends on how the function has been implemented.

Control structures

The following control structures allow you to control the flow of code execution:

  • if - else allows control flow through a condition.

  • for generates a loop with an iterator

  • while generates a loop with a stop condition

  • break forces the termination of a loop

  • next forces the next iteration of a loop.

Logical operators

  • == identical

  • x | y x OR y

  • x & y x AND y

  • !x not x

  • isTRUE(x) checks if x is TRUE

  • is.na(x) checks if x is NA

Examples of functions

We are going to implement a function that calculates the mean of a vector of numbers leaving out the possible unavailable values NA

MyMean <- function(V){
   sum=0
   Nvalues=0
   for(i in 1:length((V))){
     if(is.na(V[i])==TRUE) next
     sum=sum+V[i]
     Nvalues=Nvalues+1
   }
   return(sum/Nvalues)
}
V <- c(1.,NA,2.,3.)
MyMean(V)
## [1] 2

This same result can be obtained by calling the mean function but first omitting the NA from the calculation.

mean(na.omit(V))
## [1] 2

The na.omit function is widely used when we want to do numerical calculations on vectors that can contain NA.

Let’s now implement a function to calculate the first position in a vector of strings, sV where the string s is located. If the string is not found, NA is returned

FindString <- function(s,sV){
   i=1
   result<-NA
   while(i<=length(sV)){
     if(s==sV[i]){
       result <- i
       break
     }
     i=i+1
   }
   return(result)
}
sV <- c("abc","cde","efg")
FindString("cde",sV)
## [1] 2
FindString("cd",sV)
## [1] NA

This same result can be obtained directly using the which function, with the exception that if it does not find the string it returns a zero instead of NA:

which(sV=="cde")
## [1] 2
which(sV=="cd")
## integer(0)

The scope of functions

Functions in any language are the basis for sharing code and functionality between programs and users. At first level, the functions in R are stored in files with .R extension. For example, we have created the file utilidades.R where we have implemented some functions that we will need to use. To use in my code the functions implemented in utilidades.R I put before using the functions The instruction :

source("utilidades.R")

At a second level, the functions are packaged in libraries that we can share with other users, usually through a personal repository using GitHub or the CRAN repository. There are thousands of libraries in R and a problem that arises quite frequently is that several libraries that we use at the same time have functions that are called the same but do different things. Normally, by default, the function that the code uses is that of the last library that was loaded, but sometimes, the last one that was loaded is not the one that we are interested in using at that moment. For example the dplyr library has functions like filter() or rename() that easily conflict with functions similar calls created by other libraries. These conflicts easily generate errors in the execution of the scripts. When this occurs, what we have to do is specify the library we want to use in the function call. For example, When we call the dplyr::filter() function we will be specifying that we want to use the filter() function from the dplyr library. In this way we avoid these possible conflicts. Ideally, in any function call, specify the library used, but this is not usually done because it takes a lot of time writing code. But if we want to publish our code in the CRAN repository, they will force us to do so to avoid conflicts between the libraries. Usually, the best way to minimize the errors is to load the libraries in reverse order of their importance. For example tidyverse is very important and should be one of the last to load.

There are so many people working on R that we often find that the same functionality is implemented in several libraries. To decide which library to use we can combine two criteria: the first is the number of downloads that the library has had in CRAN and the second is whether the library has been recently updated .

The most immediate way to get help using a function is to run the help(FunctionName) command in the console.

The apply family of functions

Suppose we have 2 string vectors, sV1 and sV2 that we want to compare. Using the FindString(s,sV) function implemented above we could iterate through all the elements of sV1 and compare them with sV2. However, there is a more efficient way to do this comparison using the apply family functions that allow loops and other types of nested operations to be performed efficiently. For example, the sapply and lapply functions can be used to loop applying the FindString(s,sV) function. The difference between them is that lapply returns a list and sapply a vector. Let’s see how sapply is applied to apply, with a single statement, the function FindString(s,sV) to all the elements of sV1

sV1 <- c("abc","cde","efg")
sV2 <- c("efg","abc","cde")
sapply(sV1,FindString,sV=sV2)
## abc cde efg 
##   2   3   1

The parameters of sapply() are: (1) the vector sV1, (2) the function FindString that acts on the elements of sV1 and (3) sV=sV2 the second argument of the function FindString.

The functions of the apply family allow us to perform more complex operations that we will not study in this course.

The pipe %>% or |> operator

Some R packages have implemented the %>% pipe operator (also denoted by |>) which allows function calls to be concatenated more clearly and without the need to use nested parentheses. That is, if we normally write the concatenation as

function2(function1(x)))

Using the pipe %> operator, the same call would be written as:

x %>% function1() %>% function2()

As we will see, on many occasions, this way of writing the call to functions by concatenating the output of one operation with the input of the next, is more practical and natural than the usual way that presents the problem that the last thing we do (the call to the last function) is the first thing we have to write, that is, we write in the opposite order to the one in which we do the operations.

Referencias

[Bo] Juan Bosco Mendoza. R para principiantes.

[Pe15] Roger D. Peng. R Programming for Data Science, Lulu, 2015.