2 Basic aspects of R
Install/load libraries used
if (!require(utils)) install.packages('utils')
library(utils)
if (!require(tidyverse)) install.packages('tidyverse')
library(tidyverse)
2.1 Data types and structures
The assignment operator in R
is the symbol =
or more commonly <-
. For example a <- 5
assigns a
the value 5. The operators =
and <-
produce in most cases the same result, but they are not identical and in certain cases using one instead of another may produce an error.
If any operation or function call is performed through code and is not assigned to any variable, the result is printed and not stored.
Type of data
Type | Example |
---|---|
character |
|
factor |
|
numeric |
|
logical |
|
data not available |
|
date |
|
Coercion allows us to change data types. For example as.Date("2022-12-31")
passes the text variable “2022-12-31” to a date type data. Other examples of coercion functions are as.numeric()
, or as.character()
.
When we read data from a table, we must pay special attention to examining the types assigned to the variables and checking, in particular, that variables with numbers and dates have been identified as such. Otherwise, the first thing to do is resolve that incident.
The factor
type is used to classify, and is associated with, the so-called categorical variables. For example, in a table with country information, continent information is typically a factor
categorical variable. To speed up their management, factor
type variables are stored internally as a mixture of characters and numbers, and therefore they are not the same as character
type variables and this can cause some problems when comparing both types of variables.
String manipulation
An especially important case of data types are character strings that we often have to manipulate to get the results we need. Therefore we are going to see below some basic processes in manipulating strings
.
Function paste
and paste0
The paste
function allows you to combine variables to construct a string
## [1] "the speed is 5 km2"
The paste
function inserts white spaces between the strings it combines. paste0
does the same as paste
but without inserting whitespace.
Extract a word from a string
When our string
is a phrase with words separated by whitespace, the word
function of the tidyverse
library allows us to extract the word that occupies a given position in the phrase. For example, the following code extracts the second word from the string s
declared earlier.
## [1] "is"
Distance between strings
with adist()
Sometimes we have to compare strings
that are similar but not identical. For example, we may want the strings “Antonio”, “San antonio” and “S. Antonio” to be recognized as similar. The adist
function of the utils
library allows us to compute a distance between two strings
that is zero when they are identical and that becomes greater the more they differ. This function has various parameters to adjust how we want to compare the strings
. We will use this function in the future to obtain, given a string
, the closest one to it in a vector of strings
.
## [,1]
## [1,] 4
## [,1]
## [1,] 2
## [,1]
## [1,] 4
# comparison of the first string with a piece of the second
adist("Antonio","S. antonio",ignore.case=TRUE,partial=TRUE)
## [,1]
## [1,] 0
# with the partial=TRUE option, the order of the strings affects the result
adist("S. antonio","Antonio",ignore.case=TRUE,partial=TRUE)
## [,1]
## [1,] 3
Date format handling
Since data often has dates associated with it, we will show below some examples of date format management:
## [1] "2022-12-25"
## [1] "2022-12-25"
# extract date information
format(as.Date("2022-12-25"),"year: %Y, month: %m, day: %d, day within the week: %u")
## [1] "year: 2022, month: 12, day: 25, day within the week: 7"
## [1] "Week number of the year: 51, day of the year: 359"
For more information about date formats, run ??strptim
in the console and choose the base::format.POSIXct
option.
There are programs that encode dates as numbers based on a reference date (For example in EXCEL the reference date is “1899-12-30”). We can pass this numeric encoding to date format with the statement:
## [1] "1899-12-30"
## [1] "2023-03-15"
Sometimes the dates associated with the data are given by months, for example “2008-02” or by quarter, for example “2008 Q2”. To handle dates in this way the lubridate library provides some useful tools. Let’s look at some examples:
ym("2008-02") # the ym() function returns the date as the first day of the month of the string "2008-02"
## [1] "2008-02-01"
## [1] "2008-02-29"
## [1] "2008-02-01"
## [1] "2008-02-01"
## [1] "2008-02-01"
## [1] "2008-04-01"
## [1] "2008-06-30"
## [1] "2008-04-01"
Vectors
Creating a vector using consecutive dates
## [1] "2022-12-31" "2023-01-01" "2023-01-02" "2023-01-03"
It is often necessary to create a vector of dates spaced by months, weeks, etc. We can achieve this with the seq
function. For example, let’s see how a vector is constructed with the first six months of the year 2023.
## [1] "2023-01-01" "2023-02-01" "2023-03-01" "2023-04-01" "2023-05-01"
## [6] "2023-06-01"
Replace elements of a vector
We are going to use the gsub
function to replace elements of a vector.
## [1] "your book" "your car" "your car"
## [1] "my book" "my tree" "your tree"
Dataframes
and Tibbles
A dataframe
is an array that can combine different types of data, for example dates and numeric values. The tibble
structure is an improved version of dataframe
found in the tidyverse
library. In this course we will mainly use this data structure to manage tables. In fact, unless otherwise indicated, when we talk about tables in the course, we will be referring to tibbles
structures. When we talk about variables we will be referring to the columns (col) of the table and when we talk about records, we will be referring to the rows (row) of the table.
2.1.0.1 Manual creation of a tibble
## # A tibble: 6 × 2
## date value
## <date> <dbl>
## 1 2022-12-31 9
## 2 2023-01-01 6
## 3 2023-01-02 1
## 4 2023-01-03 7
## 5 2023-01-04 5
## 6 2023-01-05 8
## [1] "The second value of the tibble date variable is 2023-01-01"
## [1] "The third value of the tibble's value variable is 1"
## [1] "The third value of the tibble's value variable is 1"
2.1.0.2 Access first and last values of the tibble
## # A tibble: 3 × 2
## date value
## <date> <dbl>
## 1 2022-12-31 9
## 2 2023-01-01 6
## 3 2023-01-02 1
## # A tibble: 4 × 2
## date value
## <date> <dbl>
## 1 2023-01-02 1
## 2 2023-01-03 7
## 3 2023-01-04 5
## 4 2023-01-05 8
Basic handling of rows and columns
## [1] 6
## [1] 2
## # A tibble: 3 × 2
## date value
## <date> <dbl>
## 1 2022-12-31 9
## 2 2023-01-01 6
## 3 2023-01-02 1
## # A tibble: 6 × 1
## value
## <dbl>
## 1 9
## 2 6
## 3 1
## 4 7
## 5 5
## 6 8
## # A tibble: 3 × 1
## value
## <dbl>
## 1 9
## 2 6
## 3 1
## [1] "date" "value"
## # A tibble: 6 × 2
## date_new value_new
## <date> <dbl>
## 1 2022-12-31 9
## 2 2023-01-01 6
## 3 2023-01-02 1
## 4 2023-01-03 7
## 5 2023-01-04 5
## 6 2023-01-05 8
2.1.0.3 Add a record (row) to a tibble
## # A tibble: 7 × 2
## date value
## <date> <dbl>
## 1 2022-12-31 9
## 2 2023-01-01 6
## 3 2023-01-02 1
## 4 2023-01-03 7
## 5 2023-01-04 5
## 6 2023-01-05 8
## 7 2023-01-25 11
Concatenate records from multiple tibbles
The rbind()
function allows us to concatenate the records of several tibbles
that have the same variables. Let’s look at an example:
tb1 <- tibble(
date=as.Date("2022-12-31")+0:2,
value=c(9,6,1))
tb2 <- tibble(
date=as.Date("2022-12-31")+3:5,
value=c(7,5,8))
rbind(tb1,tb2)
## # A tibble: 6 × 2
## date value
## <date> <dbl>
## 1 2022-12-31 9
## 2 2023-01-01 6
## 3 2023-01-02 1
## 4 2023-01-03 7
## 5 2023-01-04 5
## 6 2023-01-05 8
2.1.0.4 Add a variable (column) to a tibble
To add a variable to a tibble, simply create a vector the size of the number of rows in the tibble and add the variable to the tibble.
## # A tibble: 6 × 3
## date value new_value
## <date> <dbl> <int>
## 1 2022-12-31 9 1
## 2 2023-01-01 6 2
## 3 2023-01-02 1 3
## 4 2023-01-03 7 4
## 5 2023-01-04 5 5
## 6 2023-01-05 8 6
Change a value in the entire tibble
## # A tibble: 6 × 3
## date value new_value
## <date> <dbl> <int>
## 1 2022-12-31 9 1
## 2 2023-01-01 6 2
## 3 2023-01-02 1 3
## 4 2023-01-03 7 4
## 5 2023-01-04 NA NA
## 6 2023-01-05 8 6
2.1.0.5 Select records using slice
## # A tibble: 3 × 3
## date value new_value
## <date> <dbl> <int>
## 1 2023-01-01 6 2
## 2 2023-01-02 1 3
## 3 2023-01-03 7 4
Do an operation on all elements of selected variables
We are going to convert all the values of the value
and new_value
variables to type character
## # A tibble: 6 × 3
## date value new_value
## <date> <chr> <chr>
## 1 2022-12-31 9 1
## 2 2023-01-01 6 2
## 3 2023-01-02 1 3
## 4 2023-01-03 7 4
## 5 2023-01-04 <NA> <NA>
## 6 2023-01-05 8 6
The mutate
function belongs to the dplyr
library and will be seen in detail later
Delete a record from a tibble
## # A tibble: 5 × 3
## date value new_value
## <date> <dbl> <int>
## 1 2022-12-31 9 1
## 2 2023-01-02 1 3
## 3 2023-01-03 7 4
## 4 2023-01-04 NA NA
## 5 2023-01-05 8 6
2.1.0.6 Print information from a tibble
In addition to directly print the tibble
to the console, other ways to get information about the tibble
are:
## # A tibble: 6 × 3
## date value new_value
## <date> <dbl> <int>
## 1 2022-12-31 9 1
## 2 2023-01-01 6 2
## 3 2023-01-02 1 3
## # ℹ 3 more rows
## tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
## $ date : Date[1:6], format: "2022-12-31" "2023-01-01" ...
## $ value : num [1:6] 9 6 1 7 NA 8
## $ new_value: int [1:6] 1 2 3 4 NA 6
## date value new_value
## Min. :2022-12-31 Min. :1.0 Min. :1.0
## 1st Qu.:2023-01-01 1st Qu.:6.0 1st Qu.:2.0
## Median :2023-01-02 Median :7.0 Median :3.0
## Mean :2023-01-02 Mean :6.2 Mean :3.2
## 3rd Qu.:2023-01-03 3rd Qu.:8.0 3rd Qu.:4.0
## Max. :2023-01-05 Max. :9.0 Max. :6.0
## NA's :1 NA's :1
Lists
A list is a collection of objects that can have different structures and sizes. We will make little use of lists in this course since our main source of information will be data tables that we will manage with tibbles
.
## $date
## [1] "2022-12-25"
##
## $value1
## [1] 1 2 3
##
## $value2
## [1] 1 2 3 4
2.2 Functions in R
The functions allow you to simplify and organize software development by packaging code to be used later in a compact and simple way. In R
, functions are usually written in files with a .R
extension. To be able to use the functions created in a .R
file from any R
code, you must put the instruction at the beginning of the R
code
In general, functions have some input parameters, perform some type of operation and return something. The general way to define a function is
Next we will create a function to add 2 numbers and use it to calculate 2+3
## [1] 5
Most of the functions in standard libraries such as paste
, as.Date
, etc., when they receive a vector as an argument, return another vector by applying the function to each element of the vector. For example:
## [1] "ym() function converts 2008-01 into 2008-01-01"
## [2] "ym() function converts 2014-10 into 2014-10-01"
## [3] "ym() function converts 2023-12 into 2023-12-01"
Likewise, when the function receives 2 arguments, if these arguments are vectors, then the function can return a dataframe
with the result of applying the function to each combination of the elements of both vectors. Let’s see an example with the adist
function to compare strings:
## [,1] [,2] [,3]
## [1,] 3 6 9
## [2,] 4 5 7
This ability of functions to act based on the type of argument they receive depends on how the function has been implemented.
Control structures
The following control structures allow you to control the flow of code execution:
if
-else
allows control flow through a condition.for
generates a loop with an iteratorwhile
generates a loop with a stop conditionbreak
forces the termination of a loopnext
forces the next iteration of a loop.
Logical operators
==
identicalx | y
x OR yx & y
x AND y!x
not xisTRUE(x)
checks if x is TRUEis.na(x)
checks if x is NA
Examples of functions
We are going to implement a function that calculates the mean of a vector of numbers leaving out the possible unavailable values NA
MyMean <- function(V){
sum=0
Nvalues=0
for(i in 1:length((V))){
if(is.na(V[i])==TRUE) next
sum=sum+V[i]
Nvalues=Nvalues+1
}
return(sum/Nvalues)
}
V <- c(1.,NA,2.,3.)
MyMean(V)
## [1] 2
This same result can be obtained by calling the mean
function but first omitting the NA
from the calculation.
## [1] 2
The na.omit
function is widely used when we want to do numerical calculations on vectors that can contain NA
.
Let’s now implement a function to calculate the first position in a vector of strings, sV
where the string s
is located. If the string is not found, NA
is returned
FindString <- function(s,sV){
i=1
result<-NA
while(i<=length(sV)){
if(s==sV[i]){
result <- i
break
}
i=i+1
}
return(result)
}
sV <- c("abc","cde","efg")
FindString("cde",sV)
## [1] 2
## [1] NA
This same result can be obtained directly using the which
function, with the exception that if it does not find the string it returns a zero instead of NA
:
## [1] 2
## integer(0)
The scope of functions
Functions in any language are the basis for sharing code and functionality between programs and users. At first level, the functions in R
are stored in files
with .R
extension. For example, we have created the file utilidades.R where we have implemented some functions that we will need to use. To use in my code the functions implemented in utilidades.R
I put before using the functions
The instruction :
At a second level, the functions are packaged in libraries that we can share with other users, usually through a personal repository using GitHub or the CRAN repository. There are thousands of libraries in R
and a problem that arises quite frequently is that several libraries that we use at the same time have functions that are called the same but do different things.
Normally, by default, the function that the code uses is that of the last library that was loaded, but sometimes, the last one that was loaded is not the one that we are interested in using at that moment. For example the dplyr
library has functions like filter()
or rename()
that easily conflict with functions
similar calls created by other libraries. These conflicts easily generate errors in the execution of the scripts. When this occurs, what we have to do is specify the library we want to use in the function call. For example,
When we call the dplyr::filter()
function we will be specifying that we want to use the filter()
function from the dplyr
library. In this way we avoid these possible conflicts. Ideally, in any function call,
specify the library used, but this is not usually done because it takes a lot of time writing code. But if we want to publish our code in the CRAN repository, they will force us to do so to avoid conflicts between the libraries. Usually, the best way to minimize the errors is to load the libraries in reverse order of their importance. For example tidyverse
is very important and should be one of the last to load.
There are so many people working on R
that we often find that the same functionality is implemented in several libraries. To decide which library to use we can combine two criteria: the first is the number of downloads that the library has had in CRAN and the second is whether the library has been recently updated .
The most immediate way to get help using a function is to run the help(FunctionName)
command in the console.
The apply family of functions
Suppose we have 2 string vectors, sV1
and sV2
that we want to compare. Using the FindString(s,sV)
function implemented above we could iterate through all the elements of sV1
and compare them with sV2
. However, there is a more efficient way to do this comparison using the apply
family functions that allow loops and other types of nested operations to be performed efficiently. For example, the sapply
and lapply
functions can be used to loop applying the FindString(s,sV)
function. The difference between them is that lapply
returns a list and sapply
a vector. Let’s see how sapply
is applied to apply, with a single statement, the function FindString(s,sV)
to all the elements of sV1
## abc cde efg
## 2 3 1
The parameters of sapply()
are: (1) the vector sV1
, (2) the function FindString
that acts on the elements of sV1
and (3) sV=sV2
the second argument of the function FindString
.
The functions of the apply
family allow us to perform more complex operations that we will not study in this course.
The pipe %>%
or |>
operator
Some R
packages have implemented the %>%
pipe operator (also denoted by |>
) which allows function calls to be concatenated more clearly and without the need to use nested parentheses. That is, if we normally write the concatenation as
function2(function1(x)))
Using the pipe %>
operator, the same call would be written as:
As we will see, on many occasions, this way of writing the call to functions by concatenating the output of one operation with the input of the next, is more practical and natural than the usual way that presents the problem that the last thing we do (the call to the last function) is the first thing we have to write, that is, we write in the opposite order to the one in which we do the operations.
Referencias
[Bo] Juan Bosco Mendoza. R para principiantes.
[Pe15] Roger D. Peng. R Programming for Data Science, Lulu, 2015.