Chapter 4 R programming

Tien Thuy Bui, Msc

Edited by: Nhan Thi Ho, MD, PhD

4.1 Main content

  • Data types in R: 1D, 2D, special data types
  • Subsetting
  • Control flow: choices, loops
  • Functions
  • Errors and error handling in practice
  • Coding styles
  • How to write a package

4.2 Data structures in R

  • R’s base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether they’re homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). This gives rise to the five data types most often used in data analysis:
Dimension Homogeneous Heterogenuous
1D Atomic vector List
2D Matrix Data frame, Tibble
n-D Array
  • Almost all other objects are built upon these foundations
  • Individual numbers or strings or boolean, are actually vectors of length one.

4.2.1 Atomic vector

  • There are four common types of atomic vectors: logical, integer, double (often called numeric), and character.
  • All elements of an atomic vector must be the same type. If not, coercion.
  • Atomic vectors are usually created with c(), short for combine:
dbl_var <- c(1, 2.5, 4.5)

# With the L suffix, we get an integer rather than a double
int_var <- c(1L, 6L, 10L)

# Use TRUE and FALSE (or T and F) to create logical vectors
log_var <- c(TRUE, FALSE, T, F)

chr_var <- c("these are", "some strings")
  • Atomic vectors are always flat, even if we nest c()’s:
c(1, c(2, c(3, 4)))
## [1] 1 2 3 4
# the same as
c(1, 2, 3, 4)
## [1] 1 2 3 4

4.2.1.1 Type and test

Given a vector, we can determine its type with typeof(), or check if it’s a specific type with an “is” function: is.character(), is.double(), is.integer(), is.logical(), or, more generally, is.atomic() .

dbl_var <- c(1, 2.5, 4.5)
typeof(dbl_var)
## [1] "double"
is.double(dbl_var)
## [1] TRUE
is.atomic(dbl_var)
## [1] TRUE

4.2.1.2 Coercion

  • All elements of an atomic vector must be the same type, so when we attempt to combine different types they will be coerced to the most flexible type. Types from least to most flexible are: logical, integer, double, and character.
str(c("a", 1))
##  chr [1:2] "a" "1"
  • When a logical vector is coerced to an integer or double, TRUE becomes 1 and FALSE becomes 0. This is very useful in conjunction with sum() and mean().
x <- c(FALSE, FALSE, TRUE)
as.numeric(x)
## [1] 0 0 1
# Total number of TRUEs
sum(x)
## [1] 1
# Proportion that are TRUE
mean(x)
## [1] 0.3333333

4.2.1.3 Naming a vector

We can name a vector in three ways:

# When creating it: 
x <- c(a = 1, b = 2, c = 3)

# By assigning a character vector to names()
x <- 1:3
names(x) <- c("a", "b", "c")

# Inline, with setNames():
x <- setNames(1:3, c("a", "b", "c"))

x
## a b c 
## 1 2 3

4.2.2 Factor

  • Used for Categorical data, where values come from a fixed set of levels recorded in factor vectors.
  • levels of factor defines the set of allowed values.
x <- factor(c("a", "b", "b", "a"))
x
## [1] a b b a
## Levels: a b
  • When reading in csv data (or any table-formatted data), R functions like read.csv and data.frame automatically conver character vectors into factors.
  • If we want to suppress this behaviour, set the argument stringsAsFactors = FALSE

4.2.2.1 Ordered factor

  • For ordinal variable (categorical with clear ordering)
grade <- ordered(c("b", "b", "a", "c"), levels = c("c", "b", "a"))
grade
## [1] b b a c
## Levels: c < b < a

4.2.2.2 Other special vectors

  • Dates: built on top of double vector
# Date
today <- Sys.Date()
str(today)
##  Date[1:1], format: "2025-03-10"
date <- as.Date("1970-02-01")
str(date)
##  Date[1:1], format: "1970-02-01"
  • Date-times
# Date-times
now_ct <- as.POSIXct("2018-08-01 22:00", tz = "UTC")
str(now_ct)
##  POSIXct[1:1], format: "2018-08-01 22:00:00"
  • Duration
# Duration 
one_week_1 <- as.difftime(1, units = "weeks")
one_week_1
## Time difference of 1 weeks

4.2.3 Matrix

  • Similar to atomic vector, matrix contains data of the same type, but in 2-dimension
  • we can create matrices and arrays with matrix(), or by using the assignment form of dim():
a <- matrix(1:6, nrow = 2, ncol = 3)
a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
# We can also modify an object in place by setting dim()
b <- 1:6
dim(b) <- c(3, 2)
b
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

4.2.4 List

  • Lists are a step up in complexity from atomic vectors: each element can be any type, not just vectors.
  • We can construct a list with list() function
  • We can name components of a list by names() function, or directly name inside the list function
  • The typeof() a list is list. we can test for a list with is.list(), and coerce to a list with as.list() .
my_list2 <- list("my_df" = data.frame("col1"=1:5,
                                      "col2"=letters[1:5]), 
                 "my_char"="abc",
                 "my_num_list" = 1:10 )
my_list2
## $my_df
##   col1 col2
## 1    1    a
## 2    2    b
## 3    3    c
## 4    4    d
## 5    5    e
## 
## $my_char
## [1] "abc"
## 
## $my_num_list
##  [1]  1  2  3  4  5  6  7  8  9 10
names(my_list2)[3] <- "new_name_for_num_vec"
my_list2
## $my_df
##   col1 col2
## 1    1    a
## 2    2    b
## 3    3    c
## 4    4    d
## 5    5    e
## 
## $my_char
## [1] "abc"
## 
## $new_name_for_num_vec
##  [1]  1  2  3  4  5  6  7  8  9 10

4.2.5 Dataframe and tibble

  • Built on top of lists, both are named lists of vectors
  • Tibbles are provided by the tibble package and share the same structure as data frames. The only difference is that the class vector is longer, and includes tbl_df. This allows tibbles to behave differently :
  1. Coercion: tibbles never coerce their input
library(tibble)
df1 <- data.frame(x = 1:3, y = letters[1:3]) # y coerced into factor
df2 <- tibble(x = 1:3, y = letters[1:3]) # y not coerced
str(df1)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"
str(df2)
## tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
##  $ x: int [1:3] 1 2 3
##  $ y: chr [1:3] "a" "b" "c"
  1. Name transformation: While data frames automatically transform non-syntactic names (unless check.names = FALSE), tibbles do not.
names(data.frame(`1` = 1))
## [1] "X1"
names(tibble(`1` = 1))
## [1] "1"
  1. Recylcing rule: While every element of a data frame (or tibble) must have the same length, both data.frame() and tibble() will recycle shorter inputs. However, while data frames automatically recycle columns that are an integer multiple of the longest column, tibbles will only recycle vectors of length one.
# data frame recycle
data.frame(x = 1:4, y = 1:2)
##   x y
## 1 1 1
## 2 2 2
## 3 3 1
## 4 4 2
# second column is of length 1, recycling happens without error
tibble(x = 1:4, y=1)
## # A tibble: 4 × 2
##       x     y
##   <int> <dbl>
## 1     1     1
## 2     2     1
## 3     3     1
## 4     4     1
# this will throw an error
tibble(x = 1:4, y=1:2)
## Error in `tibble()`:
## ! Tibble columns must have compatible sizes.
## • Size 4: Existing data.
## • Size 2: Column `y`.
## ℹ Only values of size one are recycled.
  1. tibble() allows us to refer to variables created during construction, while data frame cannot.
tibble(
  x = 1:3,
  y = x * 2   # dataframe cannot do this
)
## # A tibble: 3 × 2
##       x     y
##   <int> <dbl>
## 1     1     2
## 2     2     4
## 3     3     6

4.2.6 NULL

  • NULL has a unique type, is always length zero
  • Commonly used to an empty vector (a vector of length zero) of arbitrary type. For example, if we use c() but don’t include any arguments, we get NULL, and concatenating NULL to a vector will leave it unchanged.
c()
## NULL

4.3 Subsetting

  • There are three subsetting operators, [[, [, and $.
  • Subsetting operators interact differently with different vector types (e.g., atomic vectors, lists, factors, matrices, and data frames).
  • Subsetting can be combined with assignment.

4.3.1 Subsetting atomic vector

  • Positive integers return elements at the specified positions. The first element is at position 1 (most programming language the first element is at position 0)
  • Negative integers exclude elements at the specified positions
x <- c(2.1, 4.2, 3.3, 5.4)
x[c(1,3,5)] # position 5 is linked in NA
## [1] 2.1 3.3  NA
x[c(-1,-3)] # exclude position 1 and 3
## [1] 4.2 5.4
  • Logical vectors select elements where the corresponding logical value is TRUE. This is probably the most useful type of subsetting because we write the expression that creates the logical vector:
  • If the logical vector is shorter than the vector being subsetted, it will be recycled to be the same length.
x <- c(2.1, 4.2, 3.3, 5.4)
# exclude element 2 and 4. Notice how the recycling rule is applied.
x[c(TRUE, FALSE)]
## [1] 2.1 3.3
# get values above 3. Notice the logical statement
x[x>3]
## [1] 4.2 3.3 5.4
x>3
## [1] FALSE  TRUE  TRUE  TRUE
  • If the vector is named, we can also use character vectors to return elements with matching names.
names(x) <- letters[1:4]
# subset by names
x[c("a","c")]
##   a   c 
## 2.1 3.3
  • Note that $ is invalid for atomic vector.

4.3.2 Subsetting a list

  • Subsetting a list works in the same way as subsetting an atomic vector. Using [ always returns a list; [[ and $ let we pull out elements of a list. $ must be followed by name of a single element.
my_list <- list("a" = 1:5, "b"=letters[3:7], "c"=LETTERS[9:15])
my_list
## $a
## [1] 1 2 3 4 5
## 
## $b
## [1] "c" "d" "e" "f" "g"
## 
## $c
## [1] "I" "J" "K" "L" "M" "N" "O"
# access the first element only, return the vector 1:5
my_list[[1]]  
## [1] 1 2 3 4 5
my_list[["a"]]
## [1] 1 2 3 4 5
# access the first element only by its name, return a vector 1:5
my_list$a
## [1] 1 2 3 4 5
# access the first and third element, return a sub-list
my_list[c(1,3)]
## $a
## [1] 1 2 3 4 5
## 
## $c
## [1] "I" "J" "K" "L" "M" "N" "O"

4.3.3 Subsetting matrix

  • The most common way: supply a 1D index for each dimension, separated by a comma.
  • Blank subsetting lets us keep all rows or all columns.
a <- matrix(1:9, nrow = 3)
colnames(a) <- c("A", "B", "C")
a
##      A B C
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
# first 2 rows, all columns
a[1:2, ]
##      A B C
## [1,] 1 4 7
## [2,] 2 5 8
# return row #1 and #3 of column B and A. Notice how recylcing rule applies here.
a[c(TRUE, FALSE), c("B", "A")]
##      B A
## [1,] 4 1
## [2,] 6 3
# row 0 is empty. exclude the 2nd column
a[0, -2]
##      A C
  • We can also subset with single 1D vector. Note that matrices/arrays in R are stored in column-major order
vals <- outer(1:5, 1:5, FUN = "paste", sep = ",")
vals
##      [,1]  [,2]  [,3]  [,4]  [,5] 
## [1,] "1,1" "1,2" "1,3" "1,4" "1,5"
## [2,] "2,1" "2,2" "2,3" "2,4" "2,5"
## [3,] "3,1" "3,2" "3,3" "3,4" "3,5"
## [4,] "4,1" "4,2" "4,3" "4,4" "4,5"
## [5,] "5,1" "5,2" "5,3" "5,4" "5,5"
vals[c(4, 15)]
## [1] "4,1" "5,3"

4.3.4 Subsetting dataframe

  • Data frames have the characteristics of both lists and matrices:
  • When subsetting with a single index, they behave like lists and index the columns, so df[1:2] selects the first two columns.
  • When subsetting with two indices, they behave like matrices, so df[1:3, ] selects the first three rows and all the columns.
my_df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
my_df
##   x y z
## 1 1 3 a
## 2 2 2 b
## 3 3 1 c
my_df[my_df$x == 2, ]
##   x y z
## 2 2 2 b
my_df[c(1, 3), ]
##   x y z
## 1 1 3 a
## 3 3 1 c
# There are two ways to select columns from a data frame
# Like a list
my_df[c("x", "z")]
##   x z
## 1 1 a
## 2 2 b
## 3 3 c
# Like a matrix
my_df[, c("x", "z")]
##   x z
## 1 1 a
## 2 2 b
## 3 3 c
  • If we select a single column, matrix subsetting returns a simplified vector, while list subsetting returns a dataframe.
  • $ operator also returns a vector
# subset like a list, returns a 1-column data frame
str(my_df["x"])
## 'data.frame':    3 obs. of  1 variable:
##  $ x: int  1 2 3
# subset like a matrix, returns a vector
str(my_df[, "x"])
##  int [1:3] 1 2 3
# subset with $, returns a vector
str(my_df$x)
##  int [1:3] 1 2 3

4.4 control flow

  • There are two primary tools of control flow: choices and loops.
  • Choices, like if statements and switch() calls, allow us to run different code depending on the input. - Loops, like for and while, allow us to repeatedly run code, typically with changing options.

4.4.1 Choices (if - else)

The basic syntax of an if statement in R:

if (condition) { true_action }
if (condition) { true_action } else { false_action }
  • Condition to evaluate must have a single value of either TRUE or FALSE
  • Condition is wrapped inside ( )
  • Typically the actions are compound statements contained within { }
x = 56.1

if (x > 90) {
  print("A")
} else if (x > 80) {
  print("B")
} else if (x > 50) {
  print("C")
} else {
  print("F")
}
## [1] "C"

4.4.2 switch statememt

Closely related to if is the switch()-statement. It’s a compact, special purpose equivalent that lets we replace code like:

x = "a"
if (x == "a") {
  "option 1"
} else if (x == "b") {
  "option 2" 
} else if (x == "c") {
  "option 3"
} else {
  stop("Invalid `x` value")
}
## [1] "option 1"

with the more succinct:

x = "a"
switch(x,
  a = "option 1",
  b = "option 2",
  c = "option 3",
  stop("Invalid `x` value")
)
## [1] "option 1"

It is recommended to use switch() only with character inputs.

4.5 Loops

4.5.1 for loops

  • for loops are used to iterate over items in a vector. They have the following basic form: (Notice the use of () - wraps around the sequence and {} - wrap around loop body )
for (item in vector) { 
  perform_action 
}
  • For each item in vector, perform_action is called once; updating the value of item each time. Consider this very simple example that calculates median value of each column in a dataframe.
set.seed(10)
my_df <- data.frame(a = rnorm(10),
                    b = rnorm(10),
                    c = rnorm(10),
                    d = rnorm(10))

median_vec <- vector("double", ncol(my_df))  # 1. hold-out vector
for (i in seq_along(my_df)) {                # 2. sequence
  median_vec[i] <- median(my_df[, i])        # 3. body
}
median_vec
## [1] -0.3100772  0.6121843 -0.6812107 -0.4880535

4.5.2 seq_along()

  • The seq_along() function: generates a sequence the same length of the argument passed. In R, the length of a dataframe is the number of columns.
# create a vector from 1 to 4 (starts from 1 by default)
seq_along(my_df) # 4 columns
## [1] 1 2 3 4

4.5.3 Unknown output length

In the case we want to simulate some random vectors of random lengths. we might be tempted to solve this problem by progressively growing the vector (time complexity is O(n^2)).

means <- c(0, 1, 2)
# not efficient
output <- NULL
for (i in seq_along(means)) {
  n <- sample(100, 1)
  output <- c(output, rnorm(n, means[[i]]))
}
str(output)

But this is not very efficient because in each iteration, R has to copy all the data from the previous iterations. This results in O(n^2) complexity! A better solution to initialize a vector or list of fixed length.

means <- c(0, 1, 2)
# efficient
out <- vector("list", length(means))
out
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
for (i in seq_along(means)) {
  n <- sample(100, 1)
  out[[i]] <- rnorm(n, means[[i]])
}
str(unlist(out))
##  num [1:84] -0.0894 0.3148 -2.0605 -0.5998 0.9712 ...

We use unlist() to flatten a list of vectors into a single vector.

4.5.4 Terminate a loop early

  • next exits the current iteration, and move to the next iteration
  • break exits the entire for loop.
for (i in 1:10) {
  if (i < 3) 
    next
  print(i)
  if (i >= 5)
    break
}
## [1] 3
## [1] 4
## [1] 5

4.5.5 Looping patterns

  • The most general pattern is looping over the numeric indices with for (i in seq_along(xs)), and extracting the value with x[[i]].
  • We can loop over the names: for (nm in names(xs)). We can use to access the value with xs[[nm]]. This is useful if we want to use the name in a plot title or a file name. If we’re creating named output, make sure to name the results vector like so:
results <- vector("list", length(x))
names(results) <- names(x)
  • We can loop over the elements: for (x in xs). This is most useful if we only care about side-effects, like plotting or saving a file, because it’s difficult to save the output efficiently this way.

4.5.6 Looping with While and repeat

  • for loops are useful if we know in advance the set of values that we want to iterate over. If we don’t know, there are two related tools with more flexible specifications:

  • while(condition) action : performs action while condition is TRUE.

  • repeat(action) : repeats action forever (i.e. until it encounters break).

  • R does not have an equivalent to the do {action} while (condition) syntax found in other languages.

  • We can rewrite any for loop to use while instead, and we can rewrite any while loop to use repeat, but the converses are NOT true. That means while is more flexible than for, and repeat is more flexible than while. It’s good practice, however, to use the least-flexible solution to a problem, so we should use for wherever possible.

# Example with while
i = 3
while(i <= 5) {
  print(i)
  i = i + 1 # there should be an increment to update the index. What happend if we dont?
}
## [1] 3
## [1] 4
## [1] 5
# Example with repeat
i = 3
repeat {
  print(i)
  if (i >= 5)
    break
  i = i + 1 # there should be an increment to update the index. What happend if we dont?
}
## [1] 3
## [1] 4
## [1] 5

4.6 The apply family functions

  • Manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way.
  • Cross the data in a number of ways and avoid explicit use of loop constructs
  • Act on an input list, matrix or array and apply a named function with one or several optional arguments
  • Made up of the apply(), lapply() , sapply(), mapply(), vapply(), rapply(), and tapply()
  • We will touch on the first 4 functions. We can explore the rest with [this tutorial] (https://ademos.people.uic.edu/Chapter4.html#3_lapply,_sapply,_and_vapply)

4.6.1 How to use apply()

Operates on arrays (including the 2D matrices). General function call:

apply(X, MARGIN, FUN, ...)

Where: - X is an array or a matrix - MARGIN is a variable defining how the function is applied: when MARGIN=1, it applies over rows, whereas with MARGIN=2, it works over columns. If MARGIN=c(1,2), it applies to both rows and columns; and - FUN, which is the function that we want to apply to the data. - Additional arguments to FUN are passed to ...

# Construct a 5x6 matrix
X <- matrix(rnorm(30), nrow=5, ncol=6)
X
##            [,1]       [,2]       [,3]       [,4]        [,5]       [,6]
## [1,]  0.2487580  0.6334359  0.4953317 -1.2051854  0.53064987 -0.2498675
## [2,] -1.0626228 -1.9968156  0.7258175 -1.9632525  0.10198345  1.1551047
## [3,] -0.3639822 -0.6818322  0.6672987  1.4707523  1.33778247 -0.8647272
## [4,] -1.2069949 -0.4600555  0.9547864  0.3724723  0.08723477 -0.8666783
## [5,]  1.4292128 -0.9830692 -1.6753322  1.0658793 -0.39110421 -2.3210170
# Sum the values of each column with `apply()`
apply(X, MARGIN=2, FUN=sum)
## [1] -0.9556291 -3.4883366  1.1679022 -0.2593339  1.6665463 -3.1471854

Looping through a vector with apply() will throw an error (why?).

vec <- c(1:10)
apply(vec, 1, sum)
# Error in apply(vec, 1, sum) : dim(X) must have a positive length
How it works:
How it works:

4.6.2 The lapply() and sapply() functions

  • The general function call:
*apply(X, FUN, ...)
  • The lapply() applies a given function to every element of an object such as list/vector/dataframe/…, and results in a list which has the same number of elements as the object passed to it.
  • It does not matter what the input data type for X is, the output of lapply() is always a list. To suppess this, feed simplify = TRUE into argument (default value is FALSE).
  • sapply works the same way, but it tries to simplify the output (usually it returns a vector for 1 dimension output or array for higher dimension).
# Create a list of matrices
A <- matrix(1:9, ncol = 3)
B <- matrix(1:16, ncol = 4)
C <- cbind(8:10, 8:10)

my_list <- list(A=A, B=B, C=C) # this is a named list
my_list
## $A
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## $B
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
## 
## $C
##      [,1] [,2]
## [1,]    8    8
## [2,]    9    9
## [3,]   10   10
# Extract the 2nd column from `my_list` with the selection operator `[` with `lapply()`
# Notice how we pass 2 arguments into the subset operator
lapply(X = my_list, FUN = "[", , 2)
## $A
## [1] 4 5 6
## 
## $B
## [1] 5 6 7 8
## 
## $C
## [1]  8  9 10
# Extract the first 2 rows of the 2nd column from `my_list` with the selection operator `[` with `lapply()`
sapply(X = my_list, FUN = "[", 1:2, 2)
##      A B C
## [1,] 4 5 8
## [2,] 5 6 9

4.6.3 mapply()

  • multivariate version of sapply
  • General function call:
mapply(X, FUN, MoreArgs, ...)
  • Let us take an example
# times and x are arguments of function rep. Here we apply the rep() functions to 4 values: 1,2,3,4
mapply(x = 1:4, FUN = rep, times = 4)
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    1    2    3    4
## [3,]    1    2    3    4
## [4,]    1    2    3    4

4.7 User defined functions

  • R functions are objects in their own right, a R language property often called “first-class functions”.
  • The format for defining our own function is
function|Name <- function(arg1, arg2, arg3=user_default_value){
  statement1
  statement2
}
  • We can set defaulted value to arguments, and can override them when calling the function

4.7.1 User defined functions - Example

  • We will define a function that performs one sample t-test
set.seed(1)
oneSampleTTest <- function(input.data, mu0){
  n    <- length(input.data)
  xbar <- mean(input.data)
  s    <- sd(input.data)
  t    <- (xbar - mu0)/(s / sqrt(n))
  if( t < 0 ){
    p.value <- 2 * pt(t, df=n-1)
  }else{
    p.value <- 2 * (1-pt(t, df=n-1))
  }
  cat('t =', t, 'and p.value =', p.value, "\n")
  return( list("t"= t, "p.value" = p.value) )
}

test_data <- runif(25, min = 0, max = 25) # 25 random numbers between the values 0 and 10
ttest_val1 <- oneSampleTTest(test_data, mu0=2 )
## t = 7.672742 and p.value = 6.556742e-08
ttest_val1
## $t
## [1] 7.672742
## 
## $p.value
## [1] 6.556742e-08

4.7.2 Default arguments

  • Error occurs if we do not pecify both argument
# test_data <- runif(25, min = 0, max = 25)
oneSampleTTest(test_data)
## Error in oneSampleTTest(test_data): argument "mu0" is missing, with no default
  • We can specify the defaulted value for arguments when defining function.
oneSampleTTest1 <- function(input.data, mu0 = 1){
  n    <- length(input.data)
  xbar <- mean(input.data)
  s    <- sd(input.data)
  t    <- (xbar - mu0)/(s / sqrt(n))
  if( t < 0 ){
    p.value <- 2 * pt(t, df=n-1)
  }else{
    p.value <- 2 * (1-pt(t, df=n-1))
  }
  cat('t =', t, 'and p.value =', p.value, "\n")
  return( list("t"= t, "p.value" = p.value) )
}

# test_data <- runif(25, min = 0, max = 25) # 25 random numbers between the values 0 and 10
oneSampleTTest1(test_data )
## t = 8.352068 and p.value = 1.460246e-08
## $t
## [1] 8.352068
## 
## $p.value
## [1] 1.460246e-08

4.7.3 Missing arguments

  • Alternatively, we can check existence of the argument value with if test via function missing()
oneSampleTTest2 <- function(input.data){
  if(missing(mu0) ){
    mu0 <- 1
  }
  
  n    <- length(input.data)
  xbar <- mean(input.data)
  s    <- sd(input.data)
  t    <- (xbar - mu0)/(s / sqrt(n))
  if( t < 0 ){
    p.value <- 2 * pt(t, df=n-1)
  }else{
    p.value <- 2 * (1-pt(t, df=n-1))
  }
  cat('t =', t, 'and p.value =', p.value, "\n")
  return( list("t"= t, "p.value" = p.value) )
}
oneSampleTTest2(test_data )
## Error in oneSampleTTest2(test_data): 'missing(mu0)' did not find an argument

4.7.4 Invoke a function with do.call

do.call() has two arguments. The function to call, and a list containing the function arguments:

do.call(oneSampleTTest, list(test_data, 2) )
## t = 7.672742 and p.value = 6.556742e-08
## $t
## [1] 7.672742
## 
## $p.value
## [1] 6.556742e-08

4.7.5 … (dot dot dot) argument

Many functions in R take an arbitrary number of inputs:

sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
## [1] 55
  • These functions rely on a special argument: … (pronounced dot-dot-dot). This special argument captures any number of arguments that aren’t otherwise matched
  • It’s useful because we can then send those … on to another function.
commas <- function(...) paste(..., collapse = ", ")
commas(letters[1:10])
## [1] "a, b, c, d, e, f, g, h, i, j"
# pass ... to another function
printCommas <- function(...){
  print(paste("There are", length(...), "elements:", commas(...) ))
}
printCommas(letters[1:10])
## [1] "There are 10 elements: a, b, c, d, e, f, g, h, i, j"

4.7.6 Defining a function inside apply() functions

Functions can be defined inside an *apply() call. For example:

# Create a function in the arguments
apply(X, MARGIN=1, FUN = function(x) (sum(x)+10)/length(x) )
## [1] 1.742187 1.160036 1.927549 1.480127 1.187428

4.7.7 Choosing names for arguments

  • The names of the arguments are also important. R doesn’t care, but the readers of our code (including future-us!) will.
  • Generally we prefer longer, more descriptive names, but there are a handful of very common, very short names:
  • x, y, z: vectors.
  • w: a vector of weights.
  • df: a data frame.
  • i, j: numeric indices (typically rows and columns).
  • n: length, or number of rows.
  • p: number of columns.

4.7.8 Name masking

Names defined inside a function mask names defined outside a function. This is illustrated in the following example:

x <- 10
y <- 20
g02 <- function() {
  x <- 1
  y <- 2
  cat("x =", x, "y =", y)
}
g02()
## x = 1 y = 2

If a name isn’t defined inside a function, R looks one level up.

x <- 2
g03 <- function() {
  y <- 1
  cat("x =", x, "y =", y)
}
g03()
## x = 2 y = 1

What if we call y in global environment?

y
## [1] 20

The same rules apply if a function is defined inside another function. First, R looks inside the current function. Then, it looks where that function was defined (and so on, all the way up to the global environment). Finally, it looks in other loaded packages.

x <- 1
g04 <- function() {
  y <- 2
  i <- function() {
    z <- 3
    c(x, y, z)
  }
  i()
}
g04()
## [1] 1 2 3

4.7.9 Dynamic lookup and lazy evaluation

my_func <- function(x) {
  x + y
} 
  • In many programming languages, this would be an error, because y is not defined inside the function. In R, this is valid code because:
  1. The function only evaluated when it is called. We have not called function my_func() yet. This behaviour is called lazy evaluation
  2. R will try to find the value associated with y when the function is called.
  • Since y is not defined inside the function, R will look in the environment where the function was defined. Within the scope of this class, every function is created within the global environment of R.
y <- 100
my_func(10)
## [1] 110
y <- 1000
my_func(10)
## [1] 1010
  • Disadvantage: bugs-proned. If we make a spelling mistake in our code, we won’t get an error message when we create the function. And if the variable with the same name is defined in global environment, we will not even get an error message when we run the function.
    To detect this problem, use codetools::findGlobals(). This function lists all the external dependencies (unbound symbols) within a function:
codetools::findGlobals(my_func)
## [1] "{" "+" "y"

4.7.10 Return values - Implicit

There are two ways that a function can return a value: implicitly (without a return statement) and explicitly (with a return statement)
For implicit return, the last evaluated expression is the return value.

j01 <- function(x) {
  if (x < 10) {
    0
  } else {
    10
  }
}

The same example, with a return statement:

j02 <- function(x) {
  if (x < 10) {
    return(0)
  } else {
    return(10)
  }
}

4.7.11 Invisible values

From the previous example, the functions return visibly: calling the function in an interactive context prints the result. We can prevent automatic printing by applying invisible() to the last value

data(iris)
show_missings <- function(df) {
  n <- sum(is.na(df))
  # cat("Missing values: ", n, "\n", sep = "")
  if( n == 0)
    invisible(df)
  else 
    NULL
}
show_missings(iris) # print nothing

To verify that this value does indeed exist, we can explicitly print it or wrap it in parentheses:

print(head(show_missings(iris)))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

4.7.12 Errors

Errors most often occur when code is used in a way that it is not intended to be used. For example adding two strings together produces the following error:

"hello" + "world"
## Error in "hello" + "world": non-numeric argument to binary operator

4.7.12.1 Generating errors in functions

  • An error indicates that something has gone wrong, and forces the user to deal with the problem
  • The stop() function will generate an error (with a message)
  • A good error message shouls tell the users what is wrong and point to the right direction to fix the problem.
add <- function(x, y){
  if (!is.numeric(x) || !is.numeric(y))  # OR operator
    stop("Error! x and y must be numeric")
  x + y
}
add(3, 2)
## [1] 5
add("hello", "world")
## Error in add("hello", "world"): Error! x and y must be numeric

4.7.13 Warning

Warnings, signalled by warning(), are weaker than errors: they signal that something has gone wrong, but the code has been able to recover and continue.

sqrt(-6)
## [1] NaN

In a user defined function:

sqrtExt <- function(x){ # square root of complex number
  if(! is.numeric(x))
    stop("Error! x must be numeric")
  else if (x < 0) {
    warning("x has negative value. Complex number produced! \n")
    x = x + 0i
  }
  return(sqrt(x))
}

sqrtExt(-4)
## [1] 0+2i

4.7.14 Messages

Messages, signalled by message(), are informational; we generate a message to tell the user that the code has done something on their behalf. Good messages are a balancing act: we want to provide just enough information so the user knows what’s going on, but not so much that they’re overwhelmed.
Remember this histogram sketch from ggplot2?

library(ggplot2)
rnorm_df <- data.frame(x = rnorm(n=100))
ggplot(data = rnorm_df, mapping = aes(x=x)) + geom_histogram()

We can provide a message in our own function:

plot_histogram <- function(x, bins, ...){
  if (missing (bins)) # check if argument bins is provided (indicating number of bins)
    message("`plot_histogram()` uses `bins=30` by default. Pick better value with `bins=` ")
    bins = 30
  hist(x, breaks=bins)
}
plot_histogram(rnorm_df$x)
## `plot_histogram()` uses `bins=30` by default. Pick better value with `bins=`

4.7.14.1 When to use a message?

  • When a default argument requires some non-trivial amount of computation and we want to tell the user what value was used. For example, ggplot2 reports the number of bins used if we don’t supply a binwidth.
  • When we’re about to start a long running process with no intermediate output. A progress bar (e.g. with progress) is better, but a message is a good place to start.
  • When writing a package, we sometimes want to notify users that the package is loaded.

4.7.14.2 Supress messages upon loading a package

This can be useful when knitting our rmd to a report. The spell is:

suppressPackageStartupMessages(library(THE_PACKAGE_NAME)) 

For multiple libraries:

suppressPackageStartupMessages({
    library(ggplot2)
    library(ggpubr)
})

4.7.15 Error handling

  • Imagine writing a program that will take a long time to complete because of a complex calculation or because we’re handling a large amount of data.
  • If an error occurs, the program stops and probably returns nothing. We can lose all of the results that were calculated before the error, or our program may not finish a critical task that a program further down our pipeline is depending on.
  • If we can anticipate the possible errors occurring during the execution of our program then we can design our program to handle them appropriately.
  • Exception or Error handling is a process of responding to anomalous occurrences in the code that disrupt the flow of the code. In general, the scope for the exception handlers begins with try and ends with a catch.
  • R provides try() and trycatch() function for the same.
  • The `tryCatch() function is the workhorse of handling errors and warnings in R.
  • The try() function is a wrapper function for trycatch() which prints the error and then continues. On the other hand, trycatch() gives us the control of the error function and also optionally, continues the process of the function.

4.7.15.1 tryCatch()

  • The first argument of this function is any R expression,
  • Second argument is the conditions which specify how to handle an error or a warning.
  • The last argument specifies a function or expression that will be executed after the expression no matter what, even in the event of an error or a warning.
4.7.15.1.1 tryCatch - example

Let’s first define a function that takes an expression as an argument and tries to evaluate it.

beera <- function(expr) expr
beera(2 + 2)
## [1] 4
beera({
  as.numeric(c(1, "two", 3))
})
## [1]  1 NA  3
beera(2 + "two")
## Error in 2 + "two": non-numeric argument to binary operator

What if the beera(2 + "two") evaluation was called the first ? We cannot proceed to the other two lines.

Now let’s use tryCatch() to catch the errors and warnings.

beera <- function(expr){
  tryCatch(expr,
         error = function(e){
           message("An error occurred:", e) # e is the error message generated by R
         },
         warning = function(w){
           message("A warning occured:", w)
         },
         finally = { # this always shows
           message("Finally done!")
         })
}

beera(2 + "two")
beera(2 + 2)
## [1] 4
beera({
  as.numeric(c(1, "two", 3))
})
4.7.15.1.2 tryCatch - another example

Consider our self defined is_even() function.

is_even <- function(n){
  n %% 2 == 0
}
is_even("two")
## Error in n%%2: non-numeric argument to binary operator

Let’s catch the error, and return some indicating value when an error happens.

is_even <- function(n){
  tryCatch(n %% 2 == 0,
           error = function(e){
             message("An error occurred: ", e, "Returning NA value")
             return(NA)
           })
}

is_even(714)
## [1] TRUE
is_even(13)
## [1] FALSE
is_even("eight")
## [1] NA

4.7.16 Comparing R output - print(), cat(), and message()

  • All the 3 functions print output to console
  • print() lacks the lack of embedded concatenation of terms. We have to relay on paste() for concatenation
  • print() also prints out quotation mark
print(paste("Hello","World!"))
## [1] "Hello World!"
  • cat() addresses all of these critiques. By default, cat() adds space between terms. To suppress or change this, we can use sep argument.
x= 10
cat("our x value is ", x, "!", sep ="")
## our x value is 10!
  • The message() function is one step further than cat() ! It changed the color from standard black output to red output to catch the users eye. (similar to stop and warning).
message("our x value is ", x, "!")
  • However, the purposes of cat() and message() are different. Use cat() when the primary role of the function is to print to the console, like print() or str() methods. Use message() as a side-channel to print to the console, like when we test our R programs.

4.7.17 Coding style

R is a high-level programming language used primarily for statistical computing and graphics. The goal of the R Programming Style Guide is to make our R code easier to read, share, and verify. There are a few conventions recommended by Google’s R style guide.

  • For assignment, Use <- , not =. Hot key for RStudio is Alt + = keys.
# Good
x <- 5
# Bad
x = 5
  • Variable names should use only lowercase letters, numbers, and _
  • Function names should use BigCamelCase
  • Generally, strive to use verb for function names, and use nouns for variable names
  • It is not recommended to use dot . in our object names. It can cause confusion with methods in S3 class in R, and in other object-oriented programming languages.
# Good
doNothing <- function() {
  return(invisible(NULL))
}

first_name <- "Ngoc"

# not recommended: "." in object name.
first.name <- "Sang"
  • Avoid re-using names of common functions and variables. The defaulted values will be overidden, and this will cause confusion for the readers of our code.
# Bad
T <- FALSE
c <- 10
mean <- function(x) sum(x)
  • It is NOT recommended to use attach() when loading a package. The possibilities for creating errors when using attach() are numerous. library or require are for this purpose
  • It is NOT recommended to use right-hand assignment
suppressPackageStartupMessages(library(dplyr))

# Bad
iris %>%
  dplyr::summarize(max_petal = max(Petal.Width)) -> results
  • Finally, using explicit return (with return()) makes functions clearer.

4.8 Bonus topic:Making a package

4.8.1 Set up

  • Hadley Wickham has provided the R community with devtools which helps with building R packages. We will be using this package to make our lives easier:
  • We’ll also need roxygen2 for documenting our functions (see below).
install.packages("devtools")
install.packages("roxygen2")

4.8.2 Creating the framework

  • The first thing we want to do is create the framework for our R package. We can do this using devtools:
devtools::create("myfirstpackage")
  • This ends up creating a folder with the same name as our package name with 4 files inside the folder:
  1. DESCRIPTION: This is where all the meta-data about our package goes. This includes: Dependencies, Versioning, Title and Description, and Author information. Refer to Hadley’s detailed explanation on the detailed description of each field
  2. myfirstpackage.Rproj: This is a RStudio specific file.
  3. NAMESPACE: In short, this file indicates what needs to be exposed to users for our R package. From my experience, I’ve never edited this file as devtools takes care of the changes.
  4. R: This is where all our R code goes for our package.
  • We now have the bare bones of our first R package. First let’s start by filling out the details in the DESCRIPTION file. When that is done, we can start adding some functions!

4.8.3 Add R functions

  • All of our R functions that we want in our R package belong in the R directory. We can create an .R file that has the same name as the function we want in it.
  • Let’s create a file called R/load_mat.R and add the following contents to the file:
#' @export
load_mat <- function(infile){
  in.dt <- data.table::fread(infile, header = TRUE)
  in.dt <- in.dt[!duplicated(in.dt[, 1]), ]
  in.mat <- as.matrix(in.dt[, -1, with = FALSE])
  rownames(in.mat) <- unlist(in.dt[, 1, with = FALSE])
  in.mat
}
  • Note that one .R file can contain multiple functions. In general, try to group related functions into the same .R file
  • IMPORTANT: we need to add the @export tag above the functions to make it available for users to use. Otherwise they cannot assess these functions when installing the package in their system
  • The #' @export syntax is actually an Roxygen tag. By doing this, this ensures that the load_mat() function gets added to the NAMESPACE (when we run devtools::document()) to indicate that it needs to be exposed.

4.8.4 External dependencies

  • load_mat() function actually depends on the data.table::fread() function to read in files super quickly. NOTICE the double colon “:” used to directly call fread() from data.table package
  • NEVER USE library() or require() when writing an R package
  • We will need to add this information to DESCRIPTION file under the Imports content. For this case, we need the data.table R package, so we added the following to our DESCRIPTION file:
# Notice how we also specified the version of the data.table
Imports:
    data.table (>= 1.9.4)
  dplyr

4.8.5 Documenting My Functions

  • We can leverage off the roxygen2 which provides a very simple way of documenting our functions and then produces man/load_mat.Rd files which is what we see when we go ?load_mat.
  • For instance, here is how we might document the load_mat() function:
#' Load a Matrix
#'
#' This function loads a file as a matrix. It assumes that the first column
#' contains the rownames and the subsequent columns are the sample identifiers.
#' Any rows with duplicated row names will be dropped with the first one being
#' kepted.
#'
#' @param infile Path to the input file
#' @return A matrix of the infile
#' @export
load_mat <- function(infile){
  in.dt <- data.table::fread(infile, header = TRUE)
  in.dt <- in.dt[!duplicated(in.dt[, 1]), ]
  in.mat <- as.matrix(in.dt[, -1, with = FALSE])
  rownames(in.mat) <- unlist(in.dt[, 1, with = FALSE])
  in.mat
}
  • Once we’ve got our documentation completed, we can simply run:
devtools::document()
  • We will get one .Rd file for each function in our R package.
  • Each time we add new documentation to our R function, we need to run devtools::document() again to re-generate the .Rd files.
  • In our example, the load_mat.Rd file is generated in the man folder:
% Generated by roxygen2 (4.1.0): do not edit by hand
% Please edit documentation in R/load.R
\name{load_mat}
\alias{load_mat}
\title{Load a Matrix}
\usage{
load_mat(infile)
}
\arguments{
\item{infile}{Path to the input file}
}
\value{
A matrix of the infile
}
\description{
This function loads a file as a matrix. It assumes that the first column
contains the rownames and the subsequent columns are the sample identifiers.
Any rows with duplicated row names will be dropped with the first one being
kepted.
}

4.8.6 Making Vignettes

  • Vignettes are extremely important to give people a high-level understanding of what our R package can do. To get started with generating a vignette, we can use the devtools::use_vignette() function for this. For instance,
devtools::use_vignette("introduction")
  • Note that starting in devtools 2.1.0 the use_vignette function was moved to the usethis package. So if we are using a newer version of devtools, we can run:
usethis::use_vignette("introduction")
  • This will create a vignette/introduction.Rmd file. This is a vignette template Rmarkdown file that we can then use to fill out steps on how we can use our package.

4.8.7 Install/Use my R Package

  • For now, we have:
  1. Our functions (.R files) the R folder
  2. Documentation (.Rd) files in in the man folder
  • We can use the devtools::load_all() function which will load our R package into memory exposing all the functions and data that we highlighted above. However as soon as we close our R session, the package will no longer be available.
  • To actually install our package, we use the devtools::install() function which installs our R package into our R system library. Then we will be able to load up our package with:
library("myfirstpackage")

4.8.8 Distributing an R package

  • The easiest way is to distribute it through Github.
  • We will have to commit a set of core files:
    • R/*.R files
    • man/*.Rd files
    • DESCRIPTON
    • NAMESPACE
  • Pushing R package to github can be referred from Hadley Wickham’s R package textbook
  • Once this is all done and we’ve pushed them to GitHub, anyone can install it using the following command:
devtools::install_github("yourusername/myfirstpackage")

References

———. 2019. Advanced r. Chapman; Hall/CRC. https://doi.org/10.1201/9781351201315.
Wickham, Hadley, and Jennifer Bryan. 2023. R Packages. O’reilly.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. O’Reilly.