Chapter 6 Base-R
6.1 Objects, variables, and assignment operator
In R (or in any programming language), the object, variable, and assignment operator are the concepts that are closely related to each other.
The official R Language Definition states those concepts as follows:
“In every computer language variables provide a means of accessing the data stored in memory. R does not provide direct access to the computer’s memory but rather provides a number of specialized data structures we will refer to as objects. These objects are referred to through symbols or variables.” — R Language Definition
Simply, data are stored in computer’s memory as the form of an object, and a variable name points to (or binds or references) the data object.
For example, the following R code
- creates an object, a vector of values
c(1,2,3)
, in comuter’s memory - and binds the object to a name
x
using the assignment operator<-
- creates an object, a vector of values
# The assignment operator binds the object c(1,2,3) to a name `x`
c(1,2,3) x <-

x points to a vector in memory (this image came from Hadley Wickham’s Advanced R)
6.2 Functions
In R (or in any programming language), a function is a block of codes that is used to perform a single task when the function is called.
A function requires
- arguments whose values will be used if the function is called
- and body which is a group of R expressions contained in a curly brace (
{
and}
)
A function can return values as a result of the task defined by the body of the function.
In R, both the arguments that we provide when we call the function and the result of the function execution are R objects.
- Learning different types of R objects or data structure in R is important in effectively using functions in R.
6.2.1 An Example of Functions
# call (execute or run) the mean() function
mean(c(1,2,3,4))
## [1] 2.5
# a will store the object returned by the mean()
mean(c(1,2,3,4)) a <-
# mean() will not work with NA
mean(c(1,2,NA,4))
## [1] NA
# When na.rm = TRUE, NA will be removed before computation
mean(c(1,2,NA,4), na.rm = TRUE)
## [1] 2.333333
6.2.2 User-Defined Functions
- We can write our own functions easily
- function.name <- function(arg1, arg2, arg3){body}
# Define se() function that calculate the standard error
function(x) {
se <- var(x)
v <- length(x)
n <-return(sqrt(v/n))
}
rnorm(n=100, mean=20, sd=4)
mySample <-se(mySample)
## [1] 0.45517
6.2.3 Exercise on functions
The follow code will generate two numeric vectors randomly sampled from N(0,1) and N(3,2).
rnorm(100, mean=0, sd=1) # generate 100 random numbers from Normal(0,1)
x1 <- rnorm(100, mean=3, sd=2) # generate 100 random numbers from Normal(3,2) x2 <-
Write your own function that returns (simplified) Cohen’s d = mean(x2)−mean(x1)sd(x1). Specifically, your function should get the above two vectors x1 and x2 as function arguments and return d. For fun, let’s use your own name as the name of this function. Check whether your function actually work by running your_name(x1,x2)
.
6.2.4 Some Comments on Functions
- Functions are a fundamental building block of R.
- We can creat our own functions, but we usually use functions made by others.
- Packages are a collection of functions made by others.
- In many cases, our job is to build a pipeline of data flow by connecting many available functions.
- To do that, we have to handle the input objects (argument) and output objects (returned objects) of functions, which requires knowledge about data structure (e.g., creating, subsetting).
6.3 Operators
- Arithmetic Operators
Operator |Descr | Description |
---|---|
+ | addition |
- | subtraction |
* | multiplication |
/ | division |
^ or ** | exponentiation |
- Logical Operators
Operator |Descr | Description |
---|---|
< | less than |
<= | less than or equal to |
> | greater than |
>= | greater than equal to |
== | exactly equal to |
!= | not equal to |
!x | Not x |
x|y | x OR y |
x&Y | x AND y |
1 == 2
## [1] FALSE
"a" != "b"
## [1] TRUE
1 == 2) | ("a" != "b") (
## [1] TRUE
1 == 2) & ("a" != "b") (
## [1] FALSE
6.4 Data Structure
- R has base data structures.
- Almost all other objects are built upon base data structures.
- R base data structures can be organized by their dimensionality:
Dimension | Homogeneous | Heterogeneous |
---|---|---|
1D | Atomic vector | List |
2D | Matrix | Data frame |
nD | Array |
6.5 Vectors
6.5.1 Vectors Come in Two Flavours
- Atomic vectors (homogeneous)
- All elements of an atomic vector must be the same type.
- There are 6 types of an atomic vector
- Logical (TRUE or FALSE), integer, double, and character (+ rarely used complex and raw)
- Atomic vectors are usually created with
c()
, short for combine:a <- c(TRUE, FALSE, T, F)
# logicala <- c(1L, 6L, 5L)
# integera <- c(1, 2.5, 3.8)
# doublea <- c("apple", "orange")
# character
- Lists (heterogeneous)
- Lists are different from atomic vectors because their elements can be of any type.
- List are created by
list()
> x <- list(1:3, "a", c(TRUE, FALSE))
6.5.2 A Vector Has Three Properties
- Type:
typeof()
returns the type of an object.
typeof(c(1,2,3))
## [1] "double"
- Length:
length()
returns the number of elements in a vector
length(c(1,2,3))
## [1] 3
- Attributes:
attributes()
returns additional arbitrary metadata
attributes(c(1,2,3))
## NULL
6.5.3 Attributes
- All objects can have attributes to store metadata about the object.
- Attributes can be considered as a named list.
- Attributes can be accessed individually with
attr()
or all at once withattributes()
. - Names are attributes of a vector. You can name a vector in two ways:
c(x=1,y=2,z=3) # when creating
a <-names(a)
## [1] "x" "y" "z"
names(a) <- c("l", "m", "n") # by modifying existing names
a
## l m n
## 1 2 3
attributes(a) # names are attributes
## $names
## [1] "l" "m" "n"
attributes(mtcars)
## $names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
##
## $row.names
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
##
## $class
## [1] "data.frame"
6.5.4 Type Coercion (Conversion)
- All elements of a vector must belong to the same base data type. If that is not true, R will automatically force it by type coercion.
c(4, 7, 23.5, 76.2, 80, "rrt")
v <- v
## [1] "4" "7" "23.5" "76.2" "80" "rrt"
typeof(v)
## [1] "character"
- Functions can automatically convert data type.
sum(c(TRUE, FALSE, TRUE))
## [1] 2
- You can explicitly convert data type with
as.character()
,as.double()
,as.integer()
, andas.logical()
.
c(1,2,3)
a <- a
## [1] 1 2 3
as.character(a)
b <- b
## [1] "1" "2" "3"
6.5.5 NA represents missing
c(4, 6, NA, 2)
u <- u
## [1] 4 6 NA 2
c(TRUE, FALSE, FALSE, NA, TRUE)
k <- k
## [1] TRUE FALSE FALSE NA TRUE
6.5.6 Generate a vector
# we can manually type the element of a vector using c()
c(1,2,3,4,5)
a <- a
## [1] 1 2 3 4 5
# c() also combine vectors
c(1,2,3)
a <- c(4,5,6)
b <- c(a, b)
c <- c
## [1] 1 2 3 4 5 6
# k:n generates a vector whose elements are the sequence of numbers from k to n
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
# seq() generates regular sequence
# seq(from, to)
seq(1, 10)
## [1] 1 2 3 4 5 6 7 8 9 10
# seq(from, to, by)
seq(1, 10, 2)
## [1] 1 3 5 7 9
# rep(x, times) replicates the values in x multiple times
# x can be a number or vector
# replicates 1 5 times
rep(1, 5)
## [1] 1 1 1 1 1
# replicates c(1, 2) 5 times
rep(c(1,2), 5)
## [1] 1 2 1 2 1 2 1 2 1 2
# each element of c(1,2) is repeated 5 times
rep(c(1,2), each = 5)
## [1] 1 1 1 1 1 2 2 2 2 2
# rnorm(n, mean = 0, sd = 1) generates a vector of n random samples
# from a normal distribution with specific mean and sd.
rnorm(100)
## [1] 0.01993784 1.48480109 1.01758025 -0.04327038 -0.66580359 -0.40248806
## [7] 0.62492477 -0.92819343 0.71222285 0.68158038 0.42934877 -0.01682348
## [13] 0.97320467 -0.23433145 -0.31647808 -1.57655313 1.00020496 -0.01317028
## [19] 1.46074667 0.86551482 0.28131980 -0.31500696 1.62326402 0.86858797
## [25] -1.33757483 1.80150759 0.79348494 0.19087593 -0.20010248 0.82153238
## [31] 1.63473717 -1.32261107 0.02758227 1.48082578 0.56794389 0.87745602
## [37] -1.59443938 -0.30765237 -2.00130430 -1.57606762 1.86642288 0.70180292
## [43] 1.13255015 0.45058330 1.84840744 -1.04859539 -0.73222609 1.22042445
## [49] -1.45149724 0.14080772 0.07018727 -0.75059543 0.45704957 2.43294339
## [55] 0.55411211 -0.92091760 0.04196133 -0.48739717 0.76808535 -1.01178935
## [61] 0.76532125 -0.29779978 0.21386134 -0.10682567 -0.70469257 -1.27726467
## [67] -0.50450863 0.19968719 0.39181244 -0.31095053 1.54885368 -0.34714271
## [73] -2.37086512 0.01804768 -0.82663708 -1.33785980 -0.03791656 -0.09592142
## [79] -0.52930509 -0.77187358 -1.00313831 -0.91319466 -0.49277719 0.44397271
## [85] 1.14344009 0.65048022 0.50833201 -0.84290666 0.25549230 -1.01348976
## [91] 0.80621553 0.67463819 0.13923279 0.24937504 2.17365007 0.03373820
## [97] 0.81496730 0.52293452 0.18806759 -0.31497023
qplot(rnorm(10000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# runif(n, min, max) generates a vector of n random samples
# from a uniform distribution whose limits are min and max.
runif(100, 0, 1)
## [1] 0.418469591 0.561151230 0.465792540 0.786275225 0.144390083 0.925961058
## [7] 0.657084666 0.703066631 0.071834422 0.984284761 0.083141750 0.588476214
## [13] 0.990862661 0.159720705 0.775142886 0.309792053 0.261499164 0.019645744
## [19] 0.015890926 0.043579119 0.304720480 0.709337480 0.030107162 0.634534175
## [25] 0.011311088 0.008566018 0.305042146 0.364608347 0.994999808 0.859505695
## [31] 0.991257707 0.800189355 0.577100459 0.845425456 0.822921794 0.142139739
## [37] 0.045782780 0.174550571 0.130172416 0.450726892 0.641249115 0.769479783
## [43] 0.745834810 0.783584967 0.744180604 0.103646253 0.807139959 0.835034265
## [49] 0.113758690 0.512117621 0.568449151 0.866547070 0.478700738 0.381051510
## [55] 0.855795536 0.757560557 0.117304625 0.279881398 0.167309735 0.977391176
## [61] 0.184686979 0.355559352 0.886610948 0.025888799 0.346062207 0.385148498
## [67] 0.301892403 0.220895139 0.022666838 0.475881494 0.501830950 0.376417655
## [73] 0.334891197 0.929900997 0.846852859 0.670005210 0.134692398 0.384538731
## [79] 0.625540269 0.298543758 0.414706733 0.611129413 0.110473520 0.333423811
## [85] 0.766405945 0.753000806 0.073391479 0.180365260 0.498357934 0.896524156
## [91] 0.973049854 0.048122177 0.460173925 0.800946199 0.310137282 0.627349771
## [97] 0.918671186 0.839990146 0.653250212 0.886067909
qplot(runif(10000, 0, 1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
6.5.7 Indexing or subsetting a Vector
You can access a particular element of a vector through an index between square brackets or indexing (subsetting) operator.
Positive integers return elements at the specified positions.
c(2,3,4,5,6,7)
x <-c(3,1)] x[
## [1] 4 2
- Negative integers omit elements at the specified positions:
-c(3,1)] x[
## [1] 3 5 6 7
- Logical vectors select elements where the corresponding logical value is TRUE. This logical indexing is very useful because we can subset a vector or dataframe based on conditions.
c(TRUE, TRUE, FALSE, FALSE, TRUE, TRUE)] x[
## [1] 2 3 6 7
> 3 x
## [1] FALSE FALSE TRUE TRUE TRUE TRUE
# This is called a logical indexing, which is a very powerful tool.
# > : greater than (Logical Operators)
> 3] x[x
## [1] 4 5 6 7
> 3 & x < 5] x[x
## [1] 4
# %in% operator
# v1 %in% v2 returns a logical vector indicating
# whether the elements of v1 are included in v2.
c(1,2,3) %in% c(2,3,4,5,6)
## [1] FALSE TRUE TRUE
c(1,2,3,4,5)
a <- a
## [1] 1 2 3 4 5
# we replace an element of a vector using the indexing and assignment operators.
3] <- 100
a[ a
## [1] 1 2 100 4 5
c(1,5)] <- 100
a[ a
## [1] 100 2 100 4 100
c(1,2,3,NA,5,6,NA)
a <- a
## [1] 1 2 3 NA 5 6 NA
# is.na indicates which elements are missing
is.na(a) # returns TRUE when missing
## [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE
# Type conversion: TRUE and FALSE will be converted into 1 and 0, respectively.
# This expression answers the question: How many NSs are in a?
sum(is.na(a))
## [1] 2
# !x = not x (negation)
!is.na(a) # returns TRUE when not missing
## [1] TRUE TRUE TRUE FALSE TRUE TRUE FALSE
# This expression answers the question: How many non-NSs are in a?
sum(!is.na(a))
## [1] 5
# logical indexing
is.na(a)] <- 999
a[ a
## [1] 1 2 3 999 5 6 999
# create a vector with names
c(x = 1, y = 2, z = 3)
a <- a
## x y z
## 1 2 3
# named vector can be indexed using their names
c("x", "z")] a[
## x z
## 1 3
# R uses a "recycling rule" by repeating the shorter vector
# In this example, R recycled c(TRUE, FALSE) to produce c(TRUE, FALSE, TRUE, FALSE)
c(TRUE, FALSE)
i <- c(1,2,3,4)
a <- a[i]
## [1] 1 3
# R uses a "recycling rule" by repeating the shorter vector
c(4,5,6,7)
v1 <- c(10,10)
v2 <-+v2 v1
## [1] 14 15 16 17
6.5.8 Arrange a vector
# sort(x, decreasing = FALSE)
# By default, sort() sorts ascending order.
sort(c(5,6,4))
## [1] 4 5 6
# sorts into descending order
sort(c(5,6,4), decreasing = TRUE)
## [1] 6 5 4
# rev() provides a reversed version of its argument
rev(c(5,6,4))
## [1] 4 6 5
# rank() returns the sample ranks of the elements in a vector
rank(c(5,6,4))
## [1] 2 3 1
# order() returns a permutation which rearranges
# its first argument into ascending or descending order.
# What this means is order(c(5,6,4))
# 1) first sorts a vector in ascending order to produce c(4,5,6)
# 2) and returns the indices of the sorted element in the original vector.
# e.g., we have 3 first b/c the index of 4 in the original vector is 3
# e.g., we have 1 first b/c the index of 5 in the original vector is 1
# e.g., we have 2 first b/c the index of 6 in the original vector is 2
order(c(5,6,4))
## [1] 3 1 2
# We use order() to sort a vector or dataframe
c(5,6,4)
a <-order(a)] a[
## [1] 4 5 6
# sort a dataframe
head(mtcars[order(mtcars$mpg), ])
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
## Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
## Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
6.5.9 Vectorization of Functions
- One of the most powerful aspects of R is the vectorization of functions.
- Many R functions can be applied to a vector of values producing an equal-sized vector of results.
c(1,4,25)
v <-sqrt(v)
## [1] 1 2 5
c(1,2,3)
v <-^2 v
## [1] 1 4 9
c(4,5,6,7)
v1 <- c(10,2,1,2)
v2 <-+v2 v1
## [1] 14 7 7 9
# R uses a "recycling rule" by repeating the shorter vector
c(4,5,6,7)
v1 <- c(10,2)
v2 <-+v2 v1
## [1] 14 7 16 9
# mean will be subtracted from every element of v1
c(1,2,3,4)
v1 <-- mean(v1) v1
## [1] -1.5 -0.5 0.5 1.5
6.5.10 Some more functions
# table() creates a frequency table
c(1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4)
a <-table(a)
## a
## 1 2 3 4
## 2 3 3 5
# unique() returns a vector of unique elements
unique(a)
## [1] 1 2 3 4
c(1,2,3,NA,5) a <-
# By default, mean() produces NA when there's NAs in a vector
mean(a)
## [1] NA
# na.rm = TRUE removes NAs before computation
mean(a, na.rm = TRUE)
## [1] 2.75
6.5.11 Generating Sequences
# creating a vector containing integers between 1 and 10
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
5:0
## [1] 5 4 3 2 1 0
seq(from=1, to=3, by=0.5)
## [1] 1.0 1.5 2.0 2.5 3.0
# rep() replicates each term in formula
rep(5,3)
## [1] 5 5 5
rep(1:2, 3)
## [1] 1 2 1 2 1 2
rep(1:2, each=3)
## [1] 1 1 1 2 2 2
# gl() generates sequences involving factors
# gl(k,n), k = the number of levels,
# n = the number of repetitions.
gl(5,3)
## [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
## Levels: 1 2 3 4 5
6.5.12 Exercise on vectors
mtcars
is a dataframe about fuel economy of various cars. In the dataset, mpg
represents miles per gallon. mtcars$mpg
allows us to access the mpg
variable in the mtcars
dataframe.
mtcars$mpg a <-
- Calculate the length of the vector
a
.
length(a)
## [1] 32
- Calculate the mean of
a
usingsum()
andlength()
functions.
sum(a)/length(a)
## [1] 20.09062
- Calculate the mean of
a
usingmean()
function.
mean(a)
## [1] 20.09062
- Calculate the variance of
a
usingsd()
function.
sd(a)^2
## [1] 36.3241
- Calculate the variance of
a
usingvar()
function.
var(a)
## [1] 36.3241
- Calculate the variance of
a
by directly calculating the following expression: [(a1−ˉa)2+(a2−ˉa)2+...(an−ˉa)2]/(n−1)=∑ni=1(ai−ˉa)2n−1, where a=(a1,a2,...,an) and ˉa=mean(a)
sum((a-mean(a))^2)/(length(a)-1)
## [1] 36.3241
- Standardize the vector
a
, i.e., z=a−ˉasd(a).
-mean(a))/sd(a) (a
## [1] 0.15088482 0.15088482 0.44954345 0.21725341 -0.23073453 -0.33028740
## [7] -0.96078893 0.71501778 0.44954345 -0.14777380 -0.38006384 -0.61235388
## [13] -0.46302456 -0.81145962 -1.60788262 -1.60788262 -0.89442035 2.04238943
## [19] 1.71054652 2.29127162 0.23384555 -0.76168319 -0.81145962 -1.12671039
## [25] -0.14777380 1.19619000 0.98049211 1.71054652 -0.71190675 -0.06481307
## [31] -0.84464392 0.21725341
- Use
scale()
function to standardizea
and compare the results with your manual calculation.
# check the help document of scale() by typing ?scale for more details
scale(a)
## [,1]
## [1,] 0.15088482
## [2,] 0.15088482
## [3,] 0.44954345
## [4,] 0.21725341
## [5,] -0.23073453
## [6,] -0.33028740
## [7,] -0.96078893
## [8,] 0.71501778
## [9,] 0.44954345
## [10,] -0.14777380
## [11,] -0.38006384
## [12,] -0.61235388
## [13,] -0.46302456
## [14,] -0.81145962
## [15,] -1.60788262
## [16,] -1.60788262
## [17,] -0.89442035
## [18,] 2.04238943
## [19,] 1.71054652
## [20,] 2.29127162
## [21,] 0.23384555
## [22,] -0.76168319
## [23,] -0.81145962
## [24,] -1.12671039
## [25,] -0.14777380
## [26,] 1.19619000
## [27,] 0.98049211
## [28,] 1.71054652
## [29,] -0.71190675
## [30,] -0.06481307
## [31,] -0.84464392
## [32,] 0.21725341
## attr(,"scaled:center")
## [1] 20.09062
## attr(,"scaled:scale")
## [1] 6.026948
- Calculate the difference between the largest and smallest numbers in
a
.
max(a)-min(a)
## [1] 23.5
# another solution
diff(range(a))
## [1] 23.5
- Normalize the vector
a
, i.e., n=(x−min(x))(max(x)−min(x)).
# your maximum value will be 1, and minimum value will be 0.
-min(a))/(max(a)-min(a)) (a
## [1] 0.4510638 0.4510638 0.5276596 0.4680851 0.3531915 0.3276596 0.1659574
## [8] 0.5957447 0.5276596 0.3744681 0.3148936 0.2553191 0.2936170 0.2042553
## [15] 0.0000000 0.0000000 0.1829787 0.9361702 0.8510638 1.0000000 0.4723404
## [22] 0.2170213 0.2042553 0.1234043 0.3744681 0.7191489 0.6638298 0.8510638
## [29] 0.2297872 0.3957447 0.1957447 0.4680851
- Plot the histogram of
a
usingqplot()
.
qplot(a)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# To set aesthetics, wrap in I()
qplot(a, color = I("red"), fill = I("blue"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
- How many elements in
a
are larger than 20? (uselength()
)
# creates a logical vector in which TRUE indicates the element that is larger than 20
>20 a
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [25] FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
# This is a logical indexing where the logical vector
# within the subsetting operator (i.e., []) will create a vector with elements larger than 20.
>20] a[a
## [1] 21.0 21.0 22.8 21.4 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4
length(a[a>20])
## [1] 14
- How many elements in
a
are larger than 20? (usesum()
)
# same result
# this happens because of "vectorization" and "type conversion"
sum(a>20)
## [1] 14
txhousing
is a tibble in ggplot2
containing information about the housing market in Texas provided by the TAMU real estate center. In the dataset, median
represents median sale price. txhousing$median
allows us to access the median
variable in the txhousing
tibble (or dataframe).
txhousing$median b <-
Calculate the length of the vector
b
.how many missing values (or NAs) are in
b
?Calculate the mean of
b
usingsum()
andlength()
functions.Calculate the mean of
b
usingmean()
function.Are the two means same? If not, Why? How do we get the same result?
Calculate the variance of
b
usingsd()
function.Calculate the variance of
b
usingvar()
function.Plot the histogram of
b
usingqplot()
.Create a new vector
c
by removing all missing fromb
.(Using
c
) What percentage of houses has median sale price larger than $200000?
6.6 Factors
6.6.1 What is a factor?
- Factors are used to represent categorical data (e.g., gender, states).
- Factors are stored as a vector of integer values associated with a set of character values (
levels
) to use when the factor is displayed. - Factor have two attributes
- the
class()
, “factor”, which make factors behave differently from regular integer vectors, and - the
levels()
, which defines the set of allowed values.
- the
6.6.2 Creating a factor
- The function
factor()
is used to encode a numeric or character vector as a factor.
# levels are the set of allowed values
factor(c(2,1,1,3,2,1,1))
f1 <- f1
## [1] 2 1 1 3 2 1 1
## Levels: 1 2 3
- Factors are built on top of integers, and have a levels attribute
typeof(f1)
## [1] "integer"
attributes(f1)
## $levels
## [1] "1" "2" "3"
##
## $class
## [1] "factor"
levels()
displays the levels of a factor
levels(f1)
## [1] "1" "2" "3"
- The factor’s level is a character vector.
# test for objects of type "character"
is.character(levels(f1))
## [1] TRUE
- More test functions in R
- We can change levels
levels(f1) <- c("one", "two", "three")
f1
## [1] two one one three two one one
## Levels: one two three
- By default, the level of a factor will be displayed in alphabetical order.
factor(c("Dec", "Apr", "Jan", "Mar"))
f2 <- f2
## [1] Dec Apr Jan Mar
## Levels: Apr Dec Jan Mar
sort(f2)
## [1] Apr Dec Jan Mar
## Levels: Apr Dec Jan Mar
levels
option can be used to change the order in which the levels will be displayed from their default sorted order
c(
month_levels <-"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
) factor(c("Dec", "Apr", "Jan", "Mar"), levels = month_levels)
f3 <- f3
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# In many cases, this is the result that we expect.
sort(f3)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
table(f3)
## f3
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 0 1 1 0 0 0 0 0 0 0 1
table(f2)
## f2
## Apr Dec Jan Mar
## 1 1 1 1
6.6.3 unordered vs ordered factor
- Although the levels of a factor created by the
factor()
function has an order for displaying, the factor created by thefactor()
is called an unordered factor in the sense that the factor does not have any meaningful ordering structure. Comparison operators will not work with the unordered factor. Sometimes, we want to specify the meaningful order of a factor by creating an ordered factor.
# the default level is in alphabetical order
factor(c("high", "low", "medium", "medium", "high"))
f4 <- f4
## [1] high low medium medium high
## Levels: high low medium
sort(f4)
## [1] high high low medium medium
## Levels: high low medium
factor(c("high", "low", "medium", "medium", "high"), levels = c("low", "medium", "high"))
f5 <- f5
## [1] high low medium medium high
## Levels: low medium high
sort(f5)
## [1] low medium medium high high
## Levels: low medium high
min(f5)
andf[1] < f[3]
will produce error.With
ordered = TRUE
option, the levels should be regarded as ordered.
factor(c("high", "low", "medium", "medium", "high"), levels = c("low", "medium", "high"), ordered = TRUE)
f6 <- f6
## [1] high low medium medium high
## Levels: low < medium < high
min(f6)
## [1] low
## Levels: low < medium < high
1] > f6[2] f6[
## [1] TRUE
ordered()
function also creates an ordered factor.
ordered(c("high", "low", "medium", "medium", "high"), levels = c("low", "medium", "high"))
f7 <- f7
## [1] high low medium medium high
## Levels: low < medium < high
6.6.4 Why factors?
- Factors are an efficient way to store character values, because each unique character value is stored only once, and the factor itself is stored as an integer vector.
- Factors prevent typo because they only allow us to input the pre-defined values.
- Factors allow us to encode ordering structure.
6.6.5 Some more comments
- Be careful. Many base R functions automatically convert character vectors into factors. To suppress this default behavior, use
stringsAsFactors = FALSE
option within a function. You can explicitly convert data type withas.character()
,as.double()
,as.integer()
, andas.logical()
.
6.6.6 Exercise on factors
- You have the following responses of a five-point likert scale survey item:
x <- c("Agree", "Disagree", "Neutral", "Agree" ,"Agree", "Strongly disagree", "Neutral")
. Create an ordered factor for the five point likert scale responses (Notice that you don’t have “Strongly agree” inx
, but include “Strongly agree” in your factor level).
c("Agree", "Disagree", "Neutral", "Agree" ,"Agree", "Strongly disagree", "Neutral") x <-
# you may want to this
factor(x, levels = c("Strongly disagree", "Disagree", "Neutral", "Agree", "Strongly agree"))
## [1] Agree Disagree Neutral Agree
## [5] Agree Strongly disagree Neutral
## Levels: Strongly disagree Disagree Neutral Agree Strongly agree
# not this
factor(x)
## [1] Agree Disagree Neutral Agree
## [5] Agree Strongly disagree Neutral
## Levels: Agree Disagree Neutral Strongly disagree
- Using the following character vector
x = c("male", "male", "female", "male", "female")
, create a factor with levels reversed from its default levels order.
c("male", "male", "female", "male", "female") x =
# by default, female become first
factor(x)
## [1] male male female male female
## Levels: female male
# What I've asked you is to change the default alphabetical order using levels options.
factor(x, levels = c("male", "female"))
## [1] male male female male female
## Levels: male female
- Run the following code and explain what the code is doing.
# I just wanted to introduce 'cut()` function
set.seed(7)
rnorm(100)
x <-cut(x, breaks = quantile(x))
## [1] (0.72,2.72] (-1.79,-0.559] (-1.79,-0.559] (-0.559,0.106] (-1.79,-0.559]
## [6] (-1.79,-0.559] (0.72,2.72] (-0.559,0.106] (0.106,0.72] (0.72,2.72]
## [11] (0.106,0.72] (0.72,2.72] (0.72,2.72] (0.106,0.72] (0.72,2.72]
## [16] (0.106,0.72] (-1.79,-0.559] (-0.559,0.106] (-0.559,0.106] (0.72,2.72]
## [21] (0.72,2.72] (0.106,0.72] (0.72,2.72] (-1.79,-0.559] (0.72,2.72]
## [26] (0.106,0.72] (0.72,2.72] (0.106,0.72] (-1.79,-0.559] (-0.559,0.106]
## [31] (-1.79,-0.559] (0.106,0.72] (0.106,0.72] (-0.559,0.106] (-0.559,0.106]
## [36] (-1.79,-0.559] (0.72,2.72] (-1.79,-0.559] (-0.559,0.106] (0.106,0.72]
## [41] (0.72,2.72] (-1.79,-0.559] (-0.559,0.106] (-1.79,-0.559] (-0.559,0.106]
## [46] (-0.559,0.106] (0.72,2.72] (0.106,0.72] (-0.559,0.106] (0.72,2.72]
## [51] (-0.559,0.106] (-0.559,0.106] (0.106,0.72] (0.72,2.72] (0.72,2.72]
## [56] (0.106,0.72] (-1.79,-0.559] (0.106,0.72] (0.106,0.72] (-1.79,-0.559]
## [61] (-0.559,0.106] (0.106,0.72] (0.106,0.72] (0.106,0.72] (0.106,0.72]
## [66] (0.72,2.72] (0.72,2.72] (0.72,2.72] (0.72,2.72] (0.106,0.72]
## [71] (0.106,0.72] (-1.79,-0.559] (-1.79,-0.559] (-1.79,-0.559] (-1.79,-0.559]
## [76] (-1.79,-0.559] (-1.79,-0.559] (-0.559,0.106] (-0.559,0.106] (0.72,2.72]
## [81] (0.106,0.72] (-0.559,0.106] (-0.559,0.106] (-0.559,0.106] (-1.79,-0.559]
## [86] (-0.559,0.106] (-0.559,0.106] (-0.559,0.106] <NA> (0.106,0.72]
## [91] (0.72,2.72] (-1.79,-0.559] (0.106,0.72] (0.106,0.72] (0.72,2.72]
## [96] (-1.79,-0.559] (0.72,2.72] (-1.79,-0.559] (-0.559,0.106] (-0.559,0.106]
## Levels: (-1.79,-0.559] (-0.559,0.106] (0.106,0.72] (0.72,2.72]
6.7 Lists
6.7.1 What is a list?
- A list is a one-dimensional heterogeneous data structure.
_ Because a list is a one-dimensional data structure, we can index the element of a list using a single number.
- Unlike a vector, a list is a heterogeneous data structure, meaning that the element of a list can be any object in R.
6.7.2 Creating a list
list()
is used to create a list.
list(1:6, "a", c(TRUE, TRUE, FALSE), c(1.2, 3.3, 4.6, 6.6))
x <- x
## [[1]]
## [1] 1 2 3 4 5 6
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE TRUE FALSE
##
## [[4]]
## [1] 1.2 3.3 4.6 6.6
# str() display the internal structure of an R object
str(x)
## List of 4
## $ : int [1:6] 1 2 3 4 5 6
## $ : chr "a"
## $ : logi [1:3] TRUE TRUE FALSE
## $ : num [1:4] 1.2 3.3 4.6 6.6
typeof(x)
## [1] "list"
6.7.3 Why lists?
- Because of its flexible structure, many R functions store their outputs as a list, and return the list.
# In R, lm() is a function that fits a regression model to data.
# In the following R expression, 'mpg' is a dependent variable
# and `disp` and `cyl` are independent variable.
lm(mpg ~ disp + cyl, data = mtcars)
fit <- fit
##
## Call:
## lm(formula = mpg ~ disp + cyl, data = mtcars)
##
## Coefficients:
## (Intercept) disp cyl
## 34.66099 -0.02058 -1.58728
typeof(fit)
## [1] "list"
str(fit)
## List of 12
## $ coefficients : Named num [1:3] 34.661 -0.0206 -1.5873
## ..- attr(*, "names")= chr [1:3] "(Intercept)" "disp" "cyl"
## $ residuals : Named num [1:32] -0.844 -0.844 -3.289 1.573 4.147 ...
## ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ effects : Named num [1:32] -113.65 -28.44 -6.81 2.04 4.06 ...
## ..- attr(*, "names")= chr [1:32] "(Intercept)" "disp" "cyl" "" ...
## $ rank : int 3
## $ fitted.values: Named num [1:32] 21.8 21.8 26.1 19.8 14.6 ...
## ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ assign : int [1:3] 0 1 2
## $ qr :List of 5
## ..$ qr : num [1:32, 1:3] -5.657 0.177 0.177 0.177 0.177 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## .. .. ..$ : chr [1:3] "(Intercept)" "disp" "cyl"
## .. ..- attr(*, "assign")= int [1:3] 0 1 2
## ..$ qraux: num [1:3] 1.18 1.09 1.19
## ..$ pivot: int [1:3] 1 2 3
## ..$ tol : num 0.0000001
## ..$ rank : int 3
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 29
## $ xlevels : Named list()
## $ call : language lm(formula = mpg ~ disp + cyl, data = mtcars)
## $ terms :Classes 'terms', 'formula' language mpg ~ disp + cyl
## .. ..- attr(*, "variables")= language list(mpg, disp, cyl)
## .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:3] "mpg" "disp" "cyl"
## .. .. .. ..$ : chr [1:2] "disp" "cyl"
## .. ..- attr(*, "term.labels")= chr [1:2] "disp" "cyl"
## .. ..- attr(*, "order")= int [1:2] 1 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(mpg, disp, cyl)
## .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:3] "mpg" "disp" "cyl"
## $ model :'data.frame': 32 obs. of 3 variables:
## ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## ..$ disp: num [1:32] 160 160 108 258 360 ...
## ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language mpg ~ disp + cyl
## .. .. ..- attr(*, "variables")= language list(mpg, disp, cyl)
## .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:3] "mpg" "disp" "cyl"
## .. .. .. .. ..$ : chr [1:2] "disp" "cyl"
## .. .. ..- attr(*, "term.labels")= chr [1:2] "disp" "cyl"
## .. .. ..- attr(*, "order")= int [1:2] 1 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(mpg, disp, cyl)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:3] "mpg" "disp" "cyl"
## - attr(*, "class")= chr "lm"
6.7.4 Subsetting a List
- Subsetting a list works in the same way as subsetting an atomic vector. Using
[ ]
will always return a list;[[ ]]
and$
let you pull out the components of the list.
list(stud.id=34453, # creat a list
my.lst <-stud.name="John",
stud.marks=c(14.3,12,15,19))
my.lst
## $stud.id
## [1] 34453
##
## $stud.name
## [1] "John"
##
## $stud.marks
## [1] 14.3 12.0 15.0 19.0
# [ ] extracts a sub-list
1] my.lst[
## $stud.id
## [1] 34453
typeof(my.lst[1])
## [1] "list"
# [[ ]] extracts the value of an individual element
1]] my.lst[[
## [1] 34453
typeof(my.lst[[1]])
## [1] "double"
# my.lst[[3]] will index the third element of a list, which is a numeric vector
# my.lst[[3]][2] will index the second element of the numeric vector
3]][2] my.lst[[
## [1] 12
# In the case of lists with named elements
# $ extracts the value of an individual element
$stud.id my.lst
## [1] 34453
typeof(my.lst$stud.id)
## [1] "double"
6.7.5 Exercise on lists
fit
is a list that contains the outputs of the lm()
function for linear regression. Explore the structure of the fit
object using str()
.
lm(mpg ~ disp + cyl, data = mtcars)
fit <- fit
##
## Call:
## lm(formula = mpg ~ disp + cyl, data = mtcars)
##
## Coefficients:
## (Intercept) disp cyl
## 34.66099 -0.02058 -1.58728
- Extract the coefficient of “Intercept” with indexing using a positive integer.
# fit$coefficient is a vector
$coefficients fit
## (Intercept) disp cyl
## 34.66099474 -0.02058363 -1.58727681
# So, we can subset the first element using the following expression
$coefficients[1] fit
## (Intercept)
## 34.66099
- Extract the coefficient of “Intercept” with indexing using a name.
# We can also use the name of element for indexing.
$coefficients["(Intercept)"] fit
## (Intercept)
## 34.66099
6.7.6 Data frames
- A data frame is a list of equal-length vectors.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
typeof(iris)
## [1] "list"
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
6.7.7 The apply family of functions
- The
apply()
family of functions refers toapply()
,lapply()
,sapply()
,vapply()
,mapply()
,rapply()
, andtapply()
. - Why do we need them?
- They will make your code much shorter by replacing your own copy and paste
# A motivating example: check the number of missing values in each column of the following data frame 'm'
data.frame(matrix(c(1,2,3,4,NA,6,7,NA,NA,NA,NA,NA), ncol = 4))
m <- m
## X1 X2 X3 X4
## 1 1 4 7 NA
## 2 2 NA NA NA
## 3 3 6 NA NA
sum(is.na(m$X1))
## [1] 0
sum(is.na(m$X2))
## [1] 1
sum(is.na(m$X3))
## [1] 2
sum(is.na(m$X4))
## [1] 3
lapply(X, FUN)
- X = a list object in R
- FUN = a function in R
lapply()
- takes a function (FUN)
- applies it to each element of a list (X)
- and returns the results in the form of a list

What the lapply() do
# This one line of code will still work even when the number of columns are 1000 or more.
lapply(m, function(x) sum(is.na(x)))
## $X1
## [1] 0
##
## $X2
## [1] 1
##
## $X3
## [1] 2
##
## $X4
## [1] 3
# lapply() returns a list, whereas sapply() returns a vector, matrix, or array.
sapply(m, function(x) sum(is.na(x)))
## X1 X2 X3 X4
## 0 1 2 3
6.7.8 Exercise
- How many columns in the
bfi
dataset have missing values more than 20?
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
head(bfi)
## A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4
## 61617 2 4 3 4 4 2 3 3 4 4 3 3 3 4 4 3 4 2 2 3 3 6 3 4
## 61618 2 4 5 2 5 5 4 4 3 4 1 1 6 4 3 3 3 3 5 5 4 2 4 3
## 61620 5 4 5 4 4 4 5 4 2 5 2 4 4 4 5 4 5 4 2 3 4 2 5 5
## 61621 4 4 6 5 5 4 4 3 5 5 5 3 4 4 4 2 5 2 4 1 3 3 4 3
## 61622 2 3 3 4 5 4 4 5 3 2 2 2 5 4 5 2 3 4 4 3 3 3 4 3
## 61623 6 6 5 6 5 6 6 6 1 3 2 1 6 5 6 3 5 2 2 3 4 3 5 6
## O5 gender education age
## 61617 3 1 NA 16
## 61618 3 2 NA 18
## 61620 2 2 NA 17
## 61621 5 2 NA 17
## 61622 3 1 NA 17
## 61623 1 2 3 21
6.7.9 More resources
- For more details about a vector, factor, and list, see Ch20 in R for Data Science (https://r4ds.had.co.nz/vectors.html).
- For more details about the apply family of functions, see a nice introduction in Data Camp (https://www.datacamp.com/community/tutorials/r-tutorial-apply-family).
6.8 Matrices and Arrays
- matrices and arrays are implemented as vectors with special attributes
- Adding a
dim()
attribute to an atomic vector allows it to behave like a multi-dimensional array.
1:6 a <-
a
## [1] 1 2 3 4 5 6
# Get or set specific attributes of an object.
attr(a, "dim") <- c(3,2)
a
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
# by default, a matrix is filled by column
matrix(1:6, ncol=3, nrow=2)
a <- a
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
matrix(1:6, ncol=3)
a <- a
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
# a matrix can be filled by row using `byrow = TRUE`
matrix(1:6, ncol=3, nrow=2, byrow = TRUE)
a <- a
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
attributes(a)
## $dim
## [1] 2 3
dim(a)
## [1] 2 3
array(1:12, c(2,3,2))
b <- b
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
dim(b)
## [1] 2 3 2
length()
generalises tonrow()
andncol()
for matrices, anddim()
for arrays.names()
generalises torownames()
andcolnames()
for matrices, anddimnames()
, a list of character vectors, for arrays.
matrix(c(10, 30, 40, 50, 43, 56, 21, 30), 2, 4, byrow = TRUE)
results <-colnames(results) <- c("1qrt", "2qrt", "3qrt", "4qrt")
rownames(results) <- c("store1", "store2")
results
## 1qrt 2qrt 3qrt 4qrt
## store1 10 30 40 50
## store2 43 56 21 30
6.8.1 Subsetting a Matrix and Array
- You can supply 1d index for each dimension.
matrix(1:9, nrow = 3)
a <-colnames(a) <- c("A", "B", "C")
a
## A B C
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
c(TRUE, FALSE, TRUE), c("B", "A")] a[
## B A
## [1,] 4 1
## [2,] 6 3
1, c(2,3)] a[
## B C
## 4 7
# If you omit any dimension, you obtain full columns or rows
2,] a[
## A B C
## 2 5 8
3] a[,
## [1] 7 8 9
a
## A B C
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
> 3 a
## A B C
## [1,] FALSE TRUE TRUE
## [2,] FALSE TRUE TRUE
## [3,] FALSE TRUE TRUE
>3] <- NA
a[a a
## A B C
## [1,] 1 NA NA
## [2,] 2 NA NA
## [3,] 3 NA NA
6.8.2 Exercise
mtcars
is a fuel economy dataset. Subset themtcars
dataset such that you only keepmpg
,cyl
, andgear
variables with 6 cylinders.Subset the
mtcars
dataset such that you only keepmpg
,cyl
,disp
,hp
,dart
,wt
,qsec
, andam
variables with 4 or 6 cylinders.Subset the
mtcars
dataset such that you only keepmpg
,cyl
,disp
,hp
,dart
,wt
,qsec
, andam
variables with 4 or 6 cylinders, andmpg
larger than 20.
6.8.3 Combine Matrices by Columns or Rows
# combine by columns
cbind(a,a)
## A B C A B C
## [1,] 1 NA NA 1 NA NA
## [2,] 2 NA NA 2 NA NA
## [3,] 3 NA NA 3 NA NA
# combine by rows
rbind(a,a)
## A B C
## [1,] 1 NA NA
## [2,] 2 NA NA
## [3,] 3 NA NA
## [4,] 1 NA NA
## [5,] 2 NA NA
## [6,] 3 NA NA
6.8.4 Names of the Columns and Rows of Matrices
colnames(a)
## [1] "A" "B" "C"
rownames(a) <- c("D", "E", "F")
rownames(a)
## [1] "D" "E" "F"
6.9 Data Frames
- A data frame is a list of equal-length vectors.
- A data frame is the most common way of storing data in R.
data.frame(x=1:3, y=c("a", "b", "c"))
df <- df
## x y
## 1 1 a
## 2 2 b
## 3 3 c
# display the internal structure of an R object
str(df)
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y: chr "a" "b" "c"
data.frame()
converts strings into factors by default.- This default setting can cause serious problems in some cases.
stringAsFactors = FALSE
suppresses this default setting.- Using
str()
to check data types is always a good practice.
data.frame(x=1:3, y=c("a", "b", "c"), stringsAsFactors = FALSE) df <-
str(df) # display the internal structure of an R object
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y: chr "a" "b" "c"
6.10 Control Flow
- Control flow = the order in which individual statement are executed (or evaluated)
6.10.1 if-else
- Selection
- if (condition) expression: If the condition is
TRUE
, the expression gets executed. - if (condition) expression1 else expression2: The
else
part is only executed if the condition ifFALSE
.
- if (condition) expression: If the condition is
-5
x <-if (x>0) {
print("Positive number")
else {
} print("Negative number")
}
## [1] "Negative number"
6.10.2 for
- for (value in sequence) {statements}
for loop
allows us to repeat (loop) through the elements in a vector and run the code inside the block within curly brackets.
for (i in 1:3) {
print(i^2)
}
## [1] 1
## [1] 4
## [1] 9
# count the number of even numbers
c(2,5,3,9,8,11,6)
x <- 0
count <-for (val in x) {
if(val %% 2 == 0) count = count+1
}print(count)
## [1] 3
6.11 Further reading
- Wickham, H. (2014). Advanced R. Chapman and Hall/CRC
- http://adv-r.had.co.nz
- This is a nice book to read after you become comfortable in base R (not required in this course)