5 Week 5 : Base R

5.1 Objects, variables, and assignment operator

  • In R (or in any programming language), the object, variable, and assignment operator are the concepts that are closely related to each other.

  • The official R Language Definition states those concepts as follows:

“In every computer language variables provide a means of accessing the data stored in memory. R does not provide direct access to the computer’s memory but rather provides a number of specialized data structures we will refer to as objects. These objects are referred to through symbols or variables.” — R Language Definition

  • Simply, data are stored in computer’s memory as the form of an object, and a variable name points to (or binds or references) the data object.

  • For example, the following R code

    • creates an object, a vector of values c(1,2,3), in comuter’s memory
    • and binds the object to a name x using the assignment operator <-
# The assignment operator binds the object c(1,2,3) to a name `x`
x <- c(1,2,3)

x points to a vector in memory (this image came from Hadley Wickham’s Advanced R)

5.2 Functions

  • In R (or in any programming language), a function is a block of codes that is used to perform a single task when the function is called.

  • A function requires

    • arguments whose values will be used if the function is called
    • and body which is a group of R expressions contained in a curly brace ({ and })
  • A function can return values as a result of the task defined by the body of the function.

  • In R, both the arguments that we provide when we call the function and the result of the function execution are R objects.

    • Learning different types of R objects or data structure in R is important in effectively using functions in R.

5.2.1 An Example of Functions

# call (execute or run) the mean() function
mean(c(1,2,3,4))    
## [1] 2.5
# a will store the object returned by the mean()
a <- mean(c(1,2,3,4))   
# mean() will not work with NA
mean(c(1,2,NA,4))     
## [1] NA
# When na.rm = TRUE, NA will be removed before computation
mean(c(1,2,NA,4), na.rm = TRUE)    
## [1] 2.333333

5.2.2 User-Defined Functions

  • We can write our own functions easily
  • function.name <- function(arg1, arg2, arg3){body}
# Define se() function that calculate the standard error
se <- function(x) {
  v <- var(x)
  n <- length(x)
  return(sqrt(v/n))
}
mySample <- rnorm(n=100, mean=20, sd=4)
se(mySample)
## [1] 0.3905337

5.2.3 Exercise on functions

The follow code will generate two numeric vectors randomly sampled from N(0,1) and N(3,2).

x1 <- rnorm(100, mean=0, sd=1)  # generate 100 random numbers from Normal(0,1)
x2 <- rnorm(100, mean=3, sd=2)  # generate 100 random numbers from Normal(3,2)

Write your own function that returns (simplified) Cohen’s d = \(\frac{mean(x_2)-mean(x_1)}{sd(x_1)}\). Specifically, your function should get the above two vectors x1 and x2 as function arguments and return d. For fun, let’s use your own name as the name of this function. Check whether your function actually work by running your_name(x1,x2).

5.2.4 Some Comments on Functions

  • Functions are a fundamental building block of R.
  • We can creat our own functions, but we usually use functions made by others.
  • Packages are a collection of functions made by others.
  • In many cases, our job is to build a pipeline of data flow by connecting many available functions.
  • To do that, we have to handle the input objects (argument) and output objects (returned objects) of functions, which requires knowledge about data structure (e.g., creating, subsetting).

5.3 Operators

  • Arithmetic Operators
Operator |Descr Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
  • Logical Operators
Operator |Descr Description
< less than
<= less than or equal to
> greater than
>= greater than equal to
== exactly equal to
!= not equal to
!x Not x
x|y x OR y
x&Y x AND y
1 == 2
## [1] FALSE
"a" != "b"
## [1] TRUE
(1 == 2) | ("a" != "b")
## [1] TRUE
(1 == 2) & ("a" != "b")
## [1] FALSE

5.4 Data Structure

  • R has base data structures.
  • Almost all other objects are built upon base data structures.
  • R base data structures can be organized by their dimensionality:
Dimension Homogeneous Heterogeneous
1D Atomic vector List
2D Matrix Data frame
nD Array

5.5 Vectors

5.5.1 Vectors Come in Two Flavours

  • Atomic vectors (homogeneous)
    • All elements of an atomic vector must be the same type.
    • There are 6 types of an atomic vector
      • Logical (TRUE or FALSE), integer, double, and character (+ rarely used complex and raw)
    • Atomic vectors are usually created with c(), short for combine:
      • a <- c(TRUE, FALSE, T, F) # logical
      • a <- c(1L, 6L, 5L) # integer
      • a <- c(1, 2.5, 3.8) # double
      • a <- c("apple", "orange") # character
  • Lists (heterogeneous)
    • Lists are different from atomic vectors because their elements can be of any type.
    • List are created by list()
    • > x <- list(1:3, "a", c(TRUE, FALSE))

5.5.2 A Vector Has Three Properties

  • Type: typeof() returns the type of an object.
typeof(c(1,2,3))
## [1] "double"
  • Length: length() returns the number of elements in a vector
length(c(1,2,3))
## [1] 3
  • Attributes: attributes() returns additional arbitrary metadata
attributes(c(1,2,3))
## NULL

5.5.3 Attributes

  • All objects can have attributes to store metadata about the object.
  • Attributes can be considered as a named list.
  • Attributes can be accessed individually with attr() or all at once with attributes().
  • Names are attributes of a vector. You can name a vector in two ways:
a <- c(x=1,y=2,z=3)   # when creating
names(a)
## [1] "x" "y" "z"
names(a) <- c("l", "m", "n")   # by modifying existing names
a
## l m n 
## 1 2 3
attributes(a)    # names are attributes
## $names
## [1] "l" "m" "n"
attributes(mtcars)
## $names
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
## 
## $row.names
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"         
## 
## $class
## [1] "data.frame"

5.5.4 Type Coercion (Conversion)

  • All elements of a vector must belong to the same base data type. If that is not true, R will automatically force it by type coercion.
v <- c(4, 7, 23.5, 76.2, 80, "rrt")
v
## [1] "4"    "7"    "23.5" "76.2" "80"   "rrt"
typeof(v)
## [1] "character"
  • Functions can automatically convert data type.
sum(c(TRUE, FALSE, TRUE))
## [1] 2
  • You can explicitly convert data type with as.character(), as.double(), as.integer(), and as.logical().
a <- c(1,2,3)
a
## [1] 1 2 3
b <- as.character(a)
b
## [1] "1" "2" "3"

5.5.5 NA represents missing

u <- c(4, 6, NA, 2)
u
## [1]  4  6 NA  2
k <- c(TRUE, FALSE, FALSE, NA, TRUE)
k
## [1]  TRUE FALSE FALSE    NA  TRUE

5.5.6 Generate a vector

# we can manually type the element of a vector using c()
a <- c(1,2,3,4,5)
a
## [1] 1 2 3 4 5
# c() also combine vectors
a <- c(1,2,3)
b <- c(4,5,6)
c <- c(a, b)
c
## [1] 1 2 3 4 5 6
# k:n generates a vector whose elements are the sequence of numbers from k to n
1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
# seq() generates regular sequence
# seq(from, to)
seq(1, 10)
##  [1]  1  2  3  4  5  6  7  8  9 10
# seq(from, to, by)
seq(1, 10, 2)
## [1] 1 3 5 7 9
# rep(x, times) replicates the values in x multiple times
# x can be a number or vector
# replicates 1 5 times
rep(1, 5)
## [1] 1 1 1 1 1
# replicates c(1, 2) 5 times
rep(c(1,2), 5)
##  [1] 1 2 1 2 1 2 1 2 1 2
# each element of c(1,2) is repeated 5 times
rep(c(1,2), each = 5)
##  [1] 1 1 1 1 1 2 2 2 2 2
# rnorm(n, mean = 0, sd = 1) generates a vector of n random samples 
# from a normal distribution with specific mean and sd. 
rnorm(100)
##   [1] -0.7180948330  0.5391891437  1.5436734274  0.1209715539 -1.7306197877
##   [6] -1.2776020806 -0.6434212709 -1.1957157850  0.2179217839  0.4557368071
##  [11]  0.5860089462 -0.0323879822  0.8551165114 -0.4237932823 -0.4179127188
##  [16] -0.1542839645  0.5290145282 -0.7989648116  0.8599920831 -0.3835471731
##  [21]  1.3633815454  0.5676958216 -0.0004990567 -0.9497627516  0.0052359549
##  [26] -1.4146958207  0.3731987529 -0.3769838777  1.0002989430 -2.2101492850
##  [31]  0.3562861959  0.3532722858  0.2573906109 -0.7265632275 -0.6786082192
##  [36] -1.1494774882  0.1189402384  0.6295624473  1.3207825402  0.4491728678
##  [41] -0.1505318935 -0.1593003107  0.1739473389 -1.2291071312 -0.0130660614
##  [46]  0.2146579092 -1.9959293494 -0.3084441939  1.1961857993  0.2103710966
##  [51] -0.7564139262 -0.2065193330 -0.4542232589 -0.3724556770 -0.8097182728
##  [56]  1.3213539156  0.7009281891  0.3491607539  1.0155210752  0.1172642753
##  [61]  1.6852905497 -0.0366362356 -1.0323032961  1.2348999048 -0.1518885884
##  [66]  1.1510090637  0.2412658126  0.5529593725 -0.3296161204  1.7468428604
##  [71] -0.6256847616  1.9019679060 -0.4384540752 -0.0207426879  1.5810853045
##  [76] -2.4959277880 -0.6389141424 -0.0451968249 -1.1581737633 -0.2167397981
##  [81]  1.6454215982 -0.8625133711 -1.3101668161 -1.0731171922  0.5360852370
##  [86]  0.0778258527  1.4601409939 -0.1220270369 -1.4273582283 -0.0072564232
##  [91] -0.1308164852 -0.1548073707  1.4216644864 -1.3414068796  0.6189477353
##  [96]  0.1321823036 -0.0502072752 -0.1368422528 -0.7747828035  0.7517000663
qplot(rnorm(10000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# runif(n, min, max) generates a vector of n random samples 
# from a uniform distribution whose limits are min and max. 
runif(100, 0, 1)
##   [1] 0.81354465 0.21861048 0.28608343 0.12356698 0.69885886 0.41381644
##   [7] 0.89452620 0.43277901 0.37289700 0.18774224 0.67713300 0.23101691
##  [13] 0.10629809 0.25905796 0.21457139 0.71882659 0.62056976 0.77182181
##  [19] 0.91383027 0.13874984 0.33887862 0.80470997 0.13547774 0.65482471
##  [25] 0.54126442 0.41955851 0.85619562 0.64067250 0.86007035 0.72061981
##  [31] 0.28361612 0.30942196 0.30185632 0.95063064 0.20594480 0.02923260
##  [37] 0.58505394 0.93286413 0.39669431 0.93745377 0.56418798 0.82057556
##  [43] 0.49091330 0.74150270 0.42791715 0.38614300 0.34936964 0.65066701
##  [49] 0.24700738 0.71361028 0.57609348 0.90738513 0.65623436 0.47134943
##  [55] 0.63641301 0.63262033 0.31924140 0.63335207 0.97635655 0.03923099
##  [61] 0.64211933 0.54652254 0.34121130 0.37036342 0.59060138 0.59090314
##  [67] 0.29011722 0.23497430 0.19492350 0.77632450 0.82934722 0.80451646
##  [73] 0.24035925 0.22780299 0.46375171 0.04772902 0.33726879 0.10170134
##  [79] 0.98735725 0.59672916 0.51721048 0.70481074 0.93369468 0.74178588
##  [85] 0.35106198 0.28999117 0.02344419 0.01086638 0.94888899 0.48949057
##  [91] 0.84047274 0.01659442 0.45316665 0.56737620 0.30589947 0.57554391
##  [97] 0.87676138 0.16610022 0.18435890 0.04656778
qplot(runif(10000, 0, 1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

5.5.7 Indexing or subsetting a Vector

  • You can access a particular element of a vector through an index between square brackets or indexing (subsetting) operator.

  • Positive integers return elements at the specified positions.

x <- c(2,3,4,5,6,7)
x[c(3,1)]
## [1] 4 2
  • Negative integers omit elements at the specified positions:
x[-c(3,1)]
## [1] 3 5 6 7
  • Logical vectors select elements where the corresponding logical value is TRUE. This logical indexing is very useful because we can subset a vector or dataframe based on conditions.
x[c(TRUE, TRUE, FALSE, FALSE, TRUE, TRUE)]
## [1] 2 3 6 7
x > 3
## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE
# This is called a logical indexing, which is a very powerful tool.
# > : greater than (Logical Operators)
x[x > 3]   
## [1] 4 5 6 7
x[x > 3 & x < 5]
## [1] 4
# %in% operator 
# v1 %in% v2 returns a logical vector indicating 
# whether the elements of v1 are included in v2. 
c(1,2,3) %in% c(2,3,4,5,6)
## [1] FALSE  TRUE  TRUE
a <- c(1,2,3,4,5)
a
## [1] 1 2 3 4 5
# we replace an element of a vector using the indexing and assignment operators. 
a[3] <- 100
a
## [1]   1   2 100   4   5
a[c(1,5)] <- 100
a
## [1] 100   2 100   4 100
a <- c(1,2,3,NA,5,6,NA)
a
## [1]  1  2  3 NA  5  6 NA
# is.na indicates which elements are missing
is.na(a)  # returns TRUE when missing
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
# Type conversion: TRUE and FALSE will be converted into 1 and 0, respectively. 
# This expression answers the question: How many NSs are in a?
sum(is.na(a))
## [1] 2
# !x = not x (negation)
!is.na(a)  # returns TRUE when not missing
## [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
# This expression answers the question: How many non-NSs are in a?
sum(!is.na(a))
## [1] 5
# logical indexing
a[is.na(a)] <- 999
a
## [1]   1   2   3 999   5   6 999
# create a vector with names
a <- c(x = 1, y = 2, z = 3)
a
## x y z 
## 1 2 3
# named vector can be indexed using their names
a[c("x", "z")]
## x z 
## 1 3
# R uses a "recycling rule" by repeating the shorter vector
# In this example, R recycled c(TRUE, FALSE) to produce c(TRUE, FALSE, TRUE, FALSE)
i <- c(TRUE, FALSE)
a <- c(1,2,3,4)
a[i]
## [1] 1 3
# R uses a "recycling rule" by repeating the shorter vector
v1 <- c(4,5,6,7)
v2 <- c(10,10)
v1+v2
## [1] 14 15 16 17

5.5.8 Arrange a vector

# sort(x, decreasing = FALSE) 
# By default, sort() sorts ascending order.
sort(c(5,6,4))
## [1] 4 5 6
# sorts into descending order
sort(c(5,6,4), decreasing = TRUE)
## [1] 6 5 4
# rev() provides a reversed version of its argument
rev(c(5,6,4))
## [1] 4 6 5
# rank() returns the sample ranks of the elements in a vector
rank(c(5,6,4))
## [1] 2 3 1
# order() returns a permutation which rearranges 
# its first argument into ascending or descending order. 
# What this means is order(c(5,6,4)) 
# 1) first sorts a vector in ascending order to produce c(4,5,6)
# 2) and returns the indices of the sorted element in the original vector. 
# e.g., we have 3 first b/c the index of 4 in the original vector is 3
# e.g., we have 1 first b/c the index of 5 in the original vector is 1
# e.g., we have 2 first b/c the index of 6 in the original vector is 2
order(c(5,6,4))
## [1] 3 1 2
# We use order() to sort a vector or dataframe
a <- c(5,6,4)
a[order(a)]
## [1] 4 5 6
# sort a dataframe
head(mtcars[order(mtcars$mpg), ])
##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
## Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8  301 335 3.54 3.570 14.60  0  1    5    8

5.5.9 Vectorization of Functions

  • One of the most powerful aspects of R is the vectorization of functions.
  • Many R functions can be applied to a vector of values producing an equal-sized vector of results.
v <- c(1,4,25)
sqrt(v)
## [1] 1 2 5
v <- c(1,2,3)
v^2
## [1] 1 4 9
v1 <- c(4,5,6,7)
v2 <- c(10,2,1,2)
v1+v2
## [1] 14  7  7  9
# R uses a "recycling rule" by repeating the shorter vector
v1 <- c(4,5,6,7)
v2 <- c(10,2)
v1+v2
## [1] 14  7 16  9
# mean will be subtracted from every element of v1
v1 <- c(1,2,3,4)
v1 - mean(v1)
## [1] -1.5 -0.5  0.5  1.5

5.5.10 Some more functions

# table() creates a frequency table
a <- c(1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4)
table(a)
## a
## 1 2 3 4 
## 2 3 3 5
# unique() returns a vector of unique elements
unique(a)
## [1] 1 2 3 4
a <- c(1,2,3,NA,5)
# By default, mean() produces NA when there's NAs in a vector
mean(a)
## [1] NA
# na.rm = TRUE removes NAs before computation
mean(a, na.rm = TRUE)
## [1] 2.75

5.5.11 Generating Sequences

# creating a vector containing integers between 1 and 10
1:10 
##  [1]  1  2  3  4  5  6  7  8  9 10
5:0
## [1] 5 4 3 2 1 0
seq(from=1, to=3, by=0.5)
## [1] 1.0 1.5 2.0 2.5 3.0
# rep() replicates each term in formula
rep(5,3)
## [1] 5 5 5
rep(1:2, 3)
## [1] 1 2 1 2 1 2
rep(1:2, each=3)
## [1] 1 1 1 2 2 2
# gl() generates sequences involving factors
# gl(k,n), k = the number of levels, 
# n = the number of repetitions.
gl(5,3)
##  [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
## Levels: 1 2 3 4 5

5.5.12 Exercise on vectors

mtcars is a dataframe about fuel economy of various cars. In the dataset, mpg represents miles per gallon. mtcars$mpg allows us to access the mpg variable in the mtcars dataframe.

a <- mtcars$mpg
  • Calculate the length of the vector a.
length(a)
## [1] 32
  • Calculate the mean of a using sum() and length() functions.
sum(a)/length(a)
## [1] 20.09062
  • Calculate the mean of a using mean() function.
mean(a)
## [1] 20.09062
  • Calculate the variance of a using sd() function.
sd(a)^2
## [1] 36.3241
  • Calculate the variance of a using var() function.
var(a)
## [1] 36.3241
  • Calculate the variance of a by directly calculating the following expression: \([(a_1 - \bar{a})^2 + (a_2 - \bar{a})^2 + ... (a_n - \bar{a})^2]/(n-1) = \frac{\sum_{i=1}^{n}(a_i-\bar{a})^2}{n-1}\), where \(a = (a_1, a_2, ... , a_n)\) and \(\bar{a} = mean(a)\)
sum((a-mean(a))^2)/(length(a)-1)
## [1] 36.3241
  • Standardize the vector a, i.e., \(z = \frac{a-\bar{a}}{sd(a)}\).
(a-mean(a))/sd(a)
##  [1]  0.15088482  0.15088482  0.44954345  0.21725341 -0.23073453 -0.33028740
##  [7] -0.96078893  0.71501778  0.44954345 -0.14777380 -0.38006384 -0.61235388
## [13] -0.46302456 -0.81145962 -1.60788262 -1.60788262 -0.89442035  2.04238943
## [19]  1.71054652  2.29127162  0.23384555 -0.76168319 -0.81145962 -1.12671039
## [25] -0.14777380  1.19619000  0.98049211  1.71054652 -0.71190675 -0.06481307
## [31] -0.84464392  0.21725341
  • Use scale() function to standardize a and compare the results with your manual calculation.
# check the help document of scale() by typing ?scale for more details
scale(a)
##              [,1]
##  [1,]  0.15088482
##  [2,]  0.15088482
##  [3,]  0.44954345
##  [4,]  0.21725341
##  [5,] -0.23073453
##  [6,] -0.33028740
##  [7,] -0.96078893
##  [8,]  0.71501778
##  [9,]  0.44954345
## [10,] -0.14777380
## [11,] -0.38006384
## [12,] -0.61235388
## [13,] -0.46302456
## [14,] -0.81145962
## [15,] -1.60788262
## [16,] -1.60788262
## [17,] -0.89442035
## [18,]  2.04238943
## [19,]  1.71054652
## [20,]  2.29127162
## [21,]  0.23384555
## [22,] -0.76168319
## [23,] -0.81145962
## [24,] -1.12671039
## [25,] -0.14777380
## [26,]  1.19619000
## [27,]  0.98049211
## [28,]  1.71054652
## [29,] -0.71190675
## [30,] -0.06481307
## [31,] -0.84464392
## [32,]  0.21725341
## attr(,"scaled:center")
## [1] 20.09062
## attr(,"scaled:scale")
## [1] 6.026948
  • Calculate the difference between the largest and smallest numbers in a.
max(a)-min(a)
## [1] 23.5
# another solution
diff(range(a))
## [1] 23.5
  • Normalize the vector a, i.e., \(n = \frac{(x-min(x))}{(max(x)-min(x))}\).
# your maximum value will be 1, and minimum value will be 0. 
(a-min(a))/(max(a)-min(a))
##  [1] 0.4510638 0.4510638 0.5276596 0.4680851 0.3531915 0.3276596 0.1659574
##  [8] 0.5957447 0.5276596 0.3744681 0.3148936 0.2553191 0.2936170 0.2042553
## [15] 0.0000000 0.0000000 0.1829787 0.9361702 0.8510638 1.0000000 0.4723404
## [22] 0.2170213 0.2042553 0.1234043 0.3744681 0.7191489 0.6638298 0.8510638
## [29] 0.2297872 0.3957447 0.1957447 0.4680851
  • Plot the histogram of a using qplot().
qplot(a)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# To set aesthetics, wrap in I()
qplot(a, color = I("red"), fill = I("blue"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • How many elements in a are larger than 20? (use length())
# creates a logical vector in which TRUE indicates the element that is larger than 20
a>20
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## [25] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
# This is a logical indexing where the logical vector 
# within the subsetting operator (i.e., []) will create a vector with elements larger than 20. 
a[a>20]
##  [1] 21.0 21.0 22.8 21.4 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4
length(a[a>20])
## [1] 14
  • How many elements in a are larger than 20? (use sum())
# same result
# this happens because of "vectorization" and "type conversion"
sum(a>20)
## [1] 14

txhousing is a tibble in ggplot2 containing information about the housing market in Texas provided by the TAMU real estate center. In the dataset, median represents median sale price. txhousing$median allows us to access the median variable in the txhousing tibble (or dataframe).

b <- txhousing$median
  • Calculate the length of the vector b.

  • how many missing values (or NAs) are in b?

  • Calculate the mean of b using sum() and length() functions.

  • Calculate the mean of b using mean() function.

  • Are the two means same? If not, Why? How do we get the same result?

  • Calculate the variance of b using sd() function.

  • Calculate the variance of b using var() function.

  • Plot the histogram of b using qplot().

  • Create a new vector c by removing all missing from b.

  • (Using c) What percentage of houses has median sale price larger than $200000?

5.6 Factors

5.6.1 What is a factor?

  • Factors are used to represent categorical data (e.g., gender, states).
  • Factors are stored as a vector of integer values associated with a set of character values (levels) to use when the factor is displayed.
  • Factor have two attributes
    • the class(), “factor”, which make factors behave differently from regular integer vectors, and
    • the levels(), which defines the set of allowed values.

5.6.2 Creating a factor

  • The function factor() is used to encode a numeric or character vector as a factor.
# levels are the set of allowed values
f1 <- factor(c(2,1,1,3,2,1,1))
f1
## [1] 2 1 1 3 2 1 1
## Levels: 1 2 3
  • Factors are built on top of integers, and have a levels attribute
typeof(f1)
## [1] "integer"
attributes(f1)
## $levels
## [1] "1" "2" "3"
## 
## $class
## [1] "factor"
  • levels() displays the levels of a factor
levels(f1)
## [1] "1" "2" "3"
  • The factor’s level is a character vector.
# test for objects of type "character"
is.character(levels(f1))
## [1] TRUE
levels(f1) <- c("one", "two", "three")
f1
## [1] two   one   one   three two   one   one  
## Levels: one two three
  • By default, the level of a factor will be displayed in alphabetical order.
f2 <- factor(c("Dec", "Apr", "Jan", "Mar"))
f2
## [1] Dec Apr Jan Mar
## Levels: Apr Dec Jan Mar
sort(f2)
## [1] Apr Dec Jan Mar
## Levels: Apr Dec Jan Mar
  • levels option can be used to change the order in which the levels will be displayed from their default sorted order
month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
f3 <- factor(c("Dec", "Apr", "Jan", "Mar"), levels = month_levels)
f3
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# In many cases, this is the result that we expect. 
sort(f3)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
table(f3)
## f3
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
##   1   0   1   1   0   0   0   0   0   0   0   1
table(f2)
## f2
## Apr Dec Jan Mar 
##   1   1   1   1

5.6.3 unordered vs ordered factor

  • Although the levels of a factor created by the factor() function has an order for displaying, the factor created by the factor() is called an unordered factor in the sense that the factor does not have any meaningful ordering structure. Comparison operators will not work with the unordered factor. Sometimes, we want to specify the meaningful order of a factor by creating an ordered factor.
# the default level is in alphabetical order 
f4 <- factor(c("high", "low", "medium", "medium", "high"))
f4
## [1] high   low    medium medium high  
## Levels: high low medium
sort(f4)
## [1] high   high   low    medium medium
## Levels: high low medium
f5 <- factor(c("high", "low", "medium", "medium", "high"), levels = c("low", "medium", "high"))
f5
## [1] high   low    medium medium high  
## Levels: low medium high
sort(f5)
## [1] low    medium medium high   high  
## Levels: low medium high
  • min(f5) and f[1] < f[3] will produce error.

  • With ordered = TRUE option, the levels should be regarded as ordered.

f6 <- factor(c("high", "low", "medium", "medium", "high"), levels = c("low", "medium", "high"), ordered = TRUE)
f6
## [1] high   low    medium medium high  
## Levels: low < medium < high
min(f6)
## [1] low
## Levels: low < medium < high
f6[1] > f6[2]
## [1] TRUE
  • ordered() function also creates an ordered factor.
f7 <- ordered(c("high", "low", "medium", "medium", "high"), levels = c("low", "medium", "high"))
f7
## [1] high   low    medium medium high  
## Levels: low < medium < high

5.6.4 Why factors?

  • Factors are an efficient way to store character values, because each unique character value is stored only once, and the factor itself is stored as an integer vector.
  • Factors prevent typo because they only allow us to input the pre-defined values.
  • Factors allow us to encode ordering structure.

5.6.5 Some more comments

  • Be careful. Many base R functions automatically convert character vectors into factors. To suppress this default behavior, use stringsAsFactors = FALSE option within a function. You can explicitly convert data type with as.character(), as.double(), as.integer(), and as.logical().

5.6.6 Exercise on factors

  • You have the following responses of a five-point likert scale survey item: x <- c("Agree", "Disagree", "Neutral", "Agree" ,"Agree", "Strongly disagree", "Neutral"). Create an ordered factor for the five point likert scale responses (Notice that you don’t have “Strongly agree” in x, but include “Strongly agree” in your factor level).
x <- c("Agree", "Disagree", "Neutral", "Agree" ,"Agree", "Strongly disagree", "Neutral")
# you may want to this
factor(x, levels = c("Strongly disagree", "Disagree", "Neutral",  "Agree", "Strongly agree"))
## [1] Agree             Disagree          Neutral           Agree            
## [5] Agree             Strongly disagree Neutral          
## Levels: Strongly disagree Disagree Neutral Agree Strongly agree
# not this
factor(x)
## [1] Agree             Disagree          Neutral           Agree            
## [5] Agree             Strongly disagree Neutral          
## Levels: Agree Disagree Neutral Strongly disagree
  • Using the following character vector x = c("male", "male", "female", "male", "female"), create a factor with levels reversed from its default levels order.
x = c("male", "male", "female", "male", "female")
# by default, female become first
factor(x)
## [1] male   male   female male   female
## Levels: female male
# What I've asked you is to change the default alphabetical order using levels options. 
factor(x, levels = c("male", "female"))
## [1] male   male   female male   female
## Levels: male female
  • Run the following code and explain what the code is doing.
# I just wanted to introduce 'cut()` function
set.seed(7)
x <- rnorm(100)
cut(x, breaks = quantile(x))
##   [1] (0.72,2.72]    (-1.79,-0.559] (-1.79,-0.559] (-0.559,0.106] (-1.79,-0.559]
##   [6] (-1.79,-0.559] (0.72,2.72]    (-0.559,0.106] (0.106,0.72]   (0.72,2.72]   
##  [11] (0.106,0.72]   (0.72,2.72]    (0.72,2.72]    (0.106,0.72]   (0.72,2.72]   
##  [16] (0.106,0.72]   (-1.79,-0.559] (-0.559,0.106] (-0.559,0.106] (0.72,2.72]   
##  [21] (0.72,2.72]    (0.106,0.72]   (0.72,2.72]    (-1.79,-0.559] (0.72,2.72]   
##  [26] (0.106,0.72]   (0.72,2.72]    (0.106,0.72]   (-1.79,-0.559] (-0.559,0.106]
##  [31] (-1.79,-0.559] (0.106,0.72]   (0.106,0.72]   (-0.559,0.106] (-0.559,0.106]
##  [36] (-1.79,-0.559] (0.72,2.72]    (-1.79,-0.559] (-0.559,0.106] (0.106,0.72]  
##  [41] (0.72,2.72]    (-1.79,-0.559] (-0.559,0.106] (-1.79,-0.559] (-0.559,0.106]
##  [46] (-0.559,0.106] (0.72,2.72]    (0.106,0.72]   (-0.559,0.106] (0.72,2.72]   
##  [51] (-0.559,0.106] (-0.559,0.106] (0.106,0.72]   (0.72,2.72]    (0.72,2.72]   
##  [56] (0.106,0.72]   (-1.79,-0.559] (0.106,0.72]   (0.106,0.72]   (-1.79,-0.559]
##  [61] (-0.559,0.106] (0.106,0.72]   (0.106,0.72]   (0.106,0.72]   (0.106,0.72]  
##  [66] (0.72,2.72]    (0.72,2.72]    (0.72,2.72]    (0.72,2.72]    (0.106,0.72]  
##  [71] (0.106,0.72]   (-1.79,-0.559] (-1.79,-0.559] (-1.79,-0.559] (-1.79,-0.559]
##  [76] (-1.79,-0.559] (-1.79,-0.559] (-0.559,0.106] (-0.559,0.106] (0.72,2.72]   
##  [81] (0.106,0.72]   (-0.559,0.106] (-0.559,0.106] (-0.559,0.106] (-1.79,-0.559]
##  [86] (-0.559,0.106] (-0.559,0.106] (-0.559,0.106] <NA>           (0.106,0.72]  
##  [91] (0.72,2.72]    (-1.79,-0.559] (0.106,0.72]   (0.106,0.72]   (0.72,2.72]   
##  [96] (-1.79,-0.559] (0.72,2.72]    (-1.79,-0.559] (-0.559,0.106] (-0.559,0.106]
## Levels: (-1.79,-0.559] (-0.559,0.106] (0.106,0.72] (0.72,2.72]

5.7 Lists

5.7.1 What is a list?

  • A list is a one-dimensional heterogeneous data structure. _ Because a list is a one-dimensional data structure, we can index the element of a list using a single number.
    • Unlike a vector, a list is a heterogeneous data structure, meaning that the element of a list can be any object in R.

5.7.2 Creating a list

  • list() is used to create a list.
x <- list(1:6, "a", c(TRUE, TRUE, FALSE), c(1.2, 3.3, 4.6, 6.6))
x
## [[1]]
## [1] 1 2 3 4 5 6
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1]  TRUE  TRUE FALSE
## 
## [[4]]
## [1] 1.2 3.3 4.6 6.6
# str() display the internal structure of an R object
str(x)
## List of 4
##  $ : int [1:6] 1 2 3 4 5 6
##  $ : chr "a"
##  $ : logi [1:3] TRUE TRUE FALSE
##  $ : num [1:4] 1.2 3.3 4.6 6.6
typeof(x)
## [1] "list"

5.7.3 Why lists?

  • Because of its flexible structure, many R functions store their outputs as a list, and return the list.
# In R, lm() is a function that fits a regression model to data.
# In the following R expression, 'mpg' is a dependent variable
# and `disp` and `cyl` are independent variable. 
fit <- lm(mpg ~ disp + cyl, data = mtcars)
fit
## 
## Call:
## lm(formula = mpg ~ disp + cyl, data = mtcars)
## 
## Coefficients:
## (Intercept)         disp          cyl  
##    34.66099     -0.02058     -1.58728
typeof(fit)
## [1] "list"
str(fit)
## List of 12
##  $ coefficients : Named num [1:3] 34.661 -0.0206 -1.5873
##   ..- attr(*, "names")= chr [1:3] "(Intercept)" "disp" "cyl"
##  $ residuals    : Named num [1:32] -0.844 -0.844 -3.289 1.573 4.147 ...
##   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ effects      : Named num [1:32] -113.65 -28.44 -6.81 2.04 4.06 ...
##   ..- attr(*, "names")= chr [1:32] "(Intercept)" "disp" "cyl" "" ...
##  $ rank         : int 3
##  $ fitted.values: Named num [1:32] 21.8 21.8 26.1 19.8 14.6 ...
##   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ assign       : int [1:3] 0 1 2
##  $ qr           :List of 5
##   ..$ qr   : num [1:32, 1:3] -5.657 0.177 0.177 0.177 0.177 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##   .. .. ..$ : chr [1:3] "(Intercept)" "disp" "cyl"
##   .. ..- attr(*, "assign")= int [1:3] 0 1 2
##   ..$ qraux: num [1:3] 1.18 1.09 1.19
##   ..$ pivot: int [1:3] 1 2 3
##   ..$ tol  : num 1e-07
##   ..$ rank : int 3
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 29
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = mpg ~ disp + cyl, data = mtcars)
##  $ terms        :Classes 'terms', 'formula'  language mpg ~ disp + cyl
##   .. ..- attr(*, "variables")= language list(mpg, disp, cyl)
##   .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:3] "mpg" "disp" "cyl"
##   .. .. .. ..$ : chr [1:2] "disp" "cyl"
##   .. ..- attr(*, "term.labels")= chr [1:2] "disp" "cyl"
##   .. ..- attr(*, "order")= int [1:2] 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(mpg, disp, cyl)
##   .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:3] "mpg" "disp" "cyl"
##  $ model        :'data.frame':	32 obs. of  3 variables:
##   ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##   ..$ disp: num [1:32] 160 160 108 258 360 ...
##   ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ disp + cyl
##   .. .. ..- attr(*, "variables")= language list(mpg, disp, cyl)
##   .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:3] "mpg" "disp" "cyl"
##   .. .. .. .. ..$ : chr [1:2] "disp" "cyl"
##   .. .. ..- attr(*, "term.labels")= chr [1:2] "disp" "cyl"
##   .. .. ..- attr(*, "order")= int [1:2] 1 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(mpg, disp, cyl)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:3] "mpg" "disp" "cyl"
##  - attr(*, "class")= chr "lm"

5.7.4 Subsetting a List

  • Subsetting a list works in the same way as subsetting an atomic vector. Using [ ] will always return a list; [[ ]] and $ let you pull out the components of the list.
my.lst <- list(stud.id=34453,      # creat a list
               stud.name="John", 
               stud.marks=c(14.3,12,15,19))
my.lst
## $stud.id
## [1] 34453
## 
## $stud.name
## [1] "John"
## 
## $stud.marks
## [1] 14.3 12.0 15.0 19.0
# [ ] extracts a sub-list 
my.lst[1]
## $stud.id
## [1] 34453
typeof(my.lst[1])
## [1] "list"
# [[ ]] extracts the value of an individual element 
my.lst[[1]]
## [1] 34453
typeof(my.lst[[1]])
## [1] "double"
# my.lst[[3]] will index the third element of a list, which is a numeric vector
# my.lst[[3]][2] will index the second element of the numeric vector
my.lst[[3]][2]
## [1] 12
# In the case of lists with named elements
# $ extracts the value of an individual element 
my.lst$stud.id
## [1] 34453
typeof(my.lst$stud.id)
## [1] "double"

5.7.5 Exercise on lists

fit is a list that contains the outputs of the lm() function for linear regression. Explore the structure of the fit object using str().

fit <- lm(mpg ~ disp + cyl, data = mtcars)
fit
## 
## Call:
## lm(formula = mpg ~ disp + cyl, data = mtcars)
## 
## Coefficients:
## (Intercept)         disp          cyl  
##    34.66099     -0.02058     -1.58728
  • Extract the coefficient of “Intercept” with indexing using a positive integer.
# fit$coefficient is a vector
fit$coefficients
## (Intercept)        disp         cyl 
## 34.66099474 -0.02058363 -1.58727681
# So, we can subset the first element using the following expression
fit$coefficients[1]
## (Intercept) 
##    34.66099
  • Extract the coefficient of “Intercept” with indexing using a name.
# We can also use the name of element for indexing. 
fit$coefficients["(Intercept)"]
## (Intercept) 
##    34.66099

5.7.6 Data frames

  • A data frame is a list of equal-length vectors.
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
typeof(iris)
## [1] "list"
str(iris)
## 'data.frame':	150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

5.7.7 The apply family of functions

  • The apply() family of functions refers to apply(), lapply(), sapply(), vapply(), mapply(), rapply(), and tapply().
  • Why do we need them?
    • They will make your code much shorter by replacing your own copy and paste
# A motivating example: check the number of missing values in each column of the following data frame 'm'
m <- data.frame(matrix(c(1,2,3,4,NA,6,7,NA,NA,NA,NA,NA), ncol = 4))
m
##   X1 X2 X3 X4
## 1  1  4  7 NA
## 2  2 NA NA NA
## 3  3  6 NA NA
sum(is.na(m$X1))
## [1] 0
sum(is.na(m$X2))
## [1] 1
sum(is.na(m$X3))
## [1] 2
sum(is.na(m$X4))
## [1] 3
  • lapply(X, FUN)
    • X = a list object in R
    • FUN = a function in R
  • lapply()
    • takes a function (FUN)
    • applies it to each element of a list (X)
    • and returns the results in the form of a list
# This one line of code will still work even when the number of columns are 1000 or more. 
lapply(m, function(x) sum(is.na(x)))
## $X1
## [1] 0
## 
## $X2
## [1] 1
## 
## $X3
## [1] 2
## 
## $X4
## [1] 3
# lapply() returns a list, whereas sapply() returns a vector, matrix, or array. 
sapply(m, function(x) sum(is.na(x)))
## X1 X2 X3 X4 
##  0  1  2  3

5.7.8 Exercise

  • How many columns in the bfi dataset have missing values more than 20?
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
head(bfi)
##       A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4
## 61617  2  4  3  4  4  2  3  3  4  4  3  3  3  4  4  3  4  2  2  3  3  6  3  4
## 61618  2  4  5  2  5  5  4  4  3  4  1  1  6  4  3  3  3  3  5  5  4  2  4  3
## 61620  5  4  5  4  4  4  5  4  2  5  2  4  4  4  5  4  5  4  2  3  4  2  5  5
## 61621  4  4  6  5  5  4  4  3  5  5  5  3  4  4  4  2  5  2  4  1  3  3  4  3
## 61622  2  3  3  4  5  4  4  5  3  2  2  2  5  4  5  2  3  4  4  3  3  3  4  3
## 61623  6  6  5  6  5  6  6  6  1  3  2  1  6  5  6  3  5  2  2  3  4  3  5  6
##       O5 gender education age
## 61617  3      1        NA  16
## 61618  3      2        NA  18
## 61620  2      2        NA  17
## 61621  5      2        NA  17
## 61622  3      1        NA  17
## 61623  1      2         3  21

5.7.9 More resources

5.8 Matrices and Arrays

  • matrices and arrays are implemented as vectors with special attributes
  • Adding a dim() attribute to an atomic vector allows it to behave like a multi-dimensional array.
a <- 1:6
a
## [1] 1 2 3 4 5 6
# Get or set specific attributes of an object.
attr(a, "dim") <- c(3,2)
a
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
# by default, a matrix is filled by column
a <- matrix(1:6, ncol=3, nrow=2)
a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
a <- matrix(1:6, ncol=3)
a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
# a matrix can be filled by row using `byrow = TRUE`
a <- matrix(1:6, ncol=3, nrow=2, byrow = TRUE)
a
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
attributes(a)
## $dim
## [1] 2 3
dim(a)
## [1] 2 3
b <- array(1:12, c(2,3,2))
b
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
dim(b)
## [1] 2 3 2
  • length() generalises to nrow() and ncol() for matrices, and dim() for arrays.
  • names() generalises to rownames() and colnames() for matrices, and dimnames(), a list of character vectors, for arrays.
results <- matrix(c(10, 30, 40, 50, 43, 56, 21, 30), 2, 4, byrow = TRUE)
colnames(results) <- c("1qrt", "2qrt", "3qrt", "4qrt")
rownames(results) <- c("store1", "store2")
results
##        1qrt 2qrt 3qrt 4qrt
## store1   10   30   40   50
## store2   43   56   21   30

5.8.1 Subsetting a Matrix and Array

  • You can supply 1d index for each dimension.
a <- matrix(1:9, nrow = 3)
colnames(a) <- c("A", "B", "C")
a
##      A B C
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
a[c(TRUE, FALSE, TRUE), c("B", "A")]
##      B A
## [1,] 4 1
## [2,] 6 3
a[1, c(2,3)]
## B C 
## 4 7
# If you omit any dimension, you obtain full columns or rows
a[2,]
## A B C 
## 2 5 8
a[,3]
## [1] 7 8 9
a
##      A B C
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
a > 3
##          A    B    C
## [1,] FALSE TRUE TRUE
## [2,] FALSE TRUE TRUE
## [3,] FALSE TRUE TRUE
a[a>3] <- NA
a
##      A  B  C
## [1,] 1 NA NA
## [2,] 2 NA NA
## [3,] 3 NA NA

5.8.2 Exercise

  • mtcars is a fuel economy dataset. Subset the mtcars dataset such that you only keep mpg, cyl, and gear variables with 6 cylinders.

  • Subset the mtcars dataset such that you only keep mpg, cyl, disp, hp, dart, wt, qsec, and am variables with 4 or 6 cylinders.

  • Subset the mtcars dataset such that you only keep mpg, cyl, disp, hp, dart, wt, qsec, and am variables with 4 or 6 cylinders, and mpg larger than 20.

5.8.3 Combine Matrices by Columns or Rows

# combine by columns
cbind(a,a)   
##      A  B  C A  B  C
## [1,] 1 NA NA 1 NA NA
## [2,] 2 NA NA 2 NA NA
## [3,] 3 NA NA 3 NA NA
# combine by rows
rbind(a,a)   
##      A  B  C
## [1,] 1 NA NA
## [2,] 2 NA NA
## [3,] 3 NA NA
## [4,] 1 NA NA
## [5,] 2 NA NA
## [6,] 3 NA NA

5.8.4 Names of the Columns and Rows of Matrices

colnames(a) 
## [1] "A" "B" "C"
rownames(a) <- c("D", "E", "F")
rownames(a)
## [1] "D" "E" "F"

5.9 Data Frames

  • A data frame is a list of equal-length vectors.
  • A data frame is the most common way of storing data in R.
df <- data.frame(x=1:3, y=c("a", "b", "c"))
df
##   x y
## 1 1 a
## 2 2 b
## 3 3 c
# display the internal structure of an R object
str(df) 
## 'data.frame':	3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"
  • data.frame() converts strings into factors by default.
    • This default setting can cause serious problems in some cases.
    • stringAsFactors = FALSE suppresses this default setting.
    • Using str() to check data types is always a good practice.
df <- data.frame(x=1:3, y=c("a", "b", "c"), stringsAsFactors = FALSE)
str(df) # display the internal structure of an R object
## 'data.frame':	3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"

5.10 Control Flow

  • Control flow = the order in which individual statement are executed (or evaluated)

5.10.1 if-else

  • Selection
    • if (condition) expression: If the condition is TRUE, the expression gets executed.
    • if (condition) expression1 else expression2: The else part is only executed if the condition if FALSE.
x <- -5
if (x>0) {
  print("Positive number")
} else {
  print("Negative number")
}
## [1] "Negative number"

5.10.2 for

  • for (value in sequence) {statements}
  • for loop allows us to repeat (loop) through the elements in a vector and run the code inside the block within curly brackets.
for (i in 1:3) {
  print(i^2)
}
## [1] 1
## [1] 4
## [1] 9
# count the number of even numbers
x <- c(2,5,3,9,8,11,6)
count <- 0
for (val in x) {
  if(val %% 2 == 0)  count = count+1
}
print(count)
## [1] 3

5.11 Further reading

  • Wickham, H. (2014). Advanced R. Chapman and Hall/CRC
    • http://adv-r.had.co.nz
    • This is a nice book to read after you become comfortable in base R (not required in this course)