Chapter 6 Base R

10월 1일 목요일, 202AIE17 송채은

6.1 Objects, variables, and assignment operator

x <- c(1,2,3)
x
## [1] 1 2 3

6.2 Functions

a function is a block of codes that is used to perform a single task when the function is called

  • A function requires
    • arguments : whose values will be used if the function is called
    • body : which is a group of R expressions contained in a curly brace ({ and })
  • Functions are a fundamental building block of R
  • Packages are a collection of functions made by others
  • our job is to build a pipeline of data flow by connecting many available functions

1. An Example of Functions

mean(c(1,2,3,4))
## [1] 2.5
a <- mean(c(1,2,3,4))

mean() will not work with NA

mean(c(1,2,NA,4))
## [1] NA

When na.rm = TRUE, NA will be removed before computation

mean(c(1,2,NA,4), na.rm = TRUE)
## [1] 2.333333

2. User-Defined Functions

se <- function(x) {
  v <- var(x)
  n <- length(x)
  return(sqrt(v/n))
}
mySample <- rnorm(n=100, mean=20, sd=4)
se(mySample)
## [1] 0.4158247

3. Exercise on functions

x1 <- rnorm(100, mean=0, sd=1)
x2 <- rnorm(100, mean=3, sd=2)
chaeeun <- mean(x2)-mean(x1)/sd(x1)
chaeeun
## [1] 2.820914

6.3 Operators

1 == 2
## [1] FALSE
"a" != "b"
## [1] TRUE
1 == 2 | "a" != "b"
## [1] TRUE
1 == 2 & "a" != "b"
## [1] FALSE

6.4 Data Structure

R has base structures and base data structures can be organized by their dimensionality element를 찾기 위한 Index의 갯수가 1개(1D), 2개(2D), n개(nD)

6.5 Vectors

Atomic Vectors(homogeneous)

  • All elements of an atomic vector must be the same type and usually created with c()
    • Logical (TRUE or FALSE) 논리
    • integer 정수
    • double 실수
    • character 문자
    • complex 복소수
    • raw

Lists (heterogeneous) their elements can be of any type and lists are created by list()

6.5.1 Vector Has Three Properties

1. Type

typeof() returns the type of an object

typeof(c(1,2,3))
## [1] "double"

2. Length

length() returns the number of elements in a vector

length(c(1,2,3))
## [1] 3

3. Attributes

attr() or attributes() returns additional arbitrary metadata

attributes(c(1,2,3))
## NULL
a <- c(x=1,y=2,z=3)
names(a)
## [1] "x" "y" "z"

modifying existing names

names(a) <- c("l", "m", "n")
a
## l m n 
## 1 2 3

names are attributes

attributes(a)
## $names
## [1] "l" "m" "n"
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
attributes(mtcars)
## $names
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"
## 
## $row.names
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"          "Hornet 4 Drive"      "Hornet Sportabout"  
##  [6] "Valiant"             "Duster 360"          "Merc 240D"           "Merc 230"            "Merc 280"           
## [11] "Merc 280C"           "Merc 450SE"          "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"            "Honda Civic"         "Toyota Corolla"     
## [21] "Toyota Corona"       "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"          "Pontiac Firebird"   
## [26] "Fiat X1-9"           "Porsche 914-2"       "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"         
## 
## $class
## [1] "data.frame"

6.5.2 Type Conversion

All elements of a vector must belong to the same base data type. If that is not true, R will automatically force it by type conversion

v <- c(4, 7, 23.5, 76.2, 80, "rrt")
v
## [1] "4"    "7"    "23.5" "76.2" "80"   "rrt"
typeof(v)
## [1] "character"

Functions can automatically convert data type TRUE = 1, FALSE = 2

sum(c(TRUE, FALSE, TRUE))
## [1] 2

can convert data type with as.character(), as.double(), as.integer(), and as.logical()

a <- c(1,2,3)
a
## [1] 1 2 3
b <- as.character(a)
b
## [1] "1" "2" "3"

6.5.3 Generate a vector

a <- c(1,2,3,4,5)
a
## [1] 1 2 3 4 5

c() also combine vectors

a <- c(1,2,3)
b <- c(4,5,6)
c <- c(a, b)
c
## [1] 1 2 3 4 5 6

k:n generates a vector whose elements are the sequence of numbers from k to n

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10

seq() generates regular sequence seq(from, to)

seq(1, 10)
##  [1]  1  2  3  4  5  6  7  8  9 10

seq(from, to, by)

seq(1, 10, 2)
## [1] 1 3 5 7 9

rep(x, times) replicates the values in x multiple times (x can be a number or vector)

rep(1, 5)
## [1] 1 1 1 1 1
rep(c(1,2), 5)
##  [1] 1 2 1 2 1 2 1 2 1 2
rep(c(1,2), each = 5)
##  [1] 1 1 1 1 1 2 2 2 2 2
rnorm(100)
##   [1]  1.05430020 -0.70747447 -0.32920300 -0.46654981  0.25314117  1.43733792  0.10544477 -1.05807539  0.43049233 -1.98590939
##  [11]  1.49579578 -0.04953275 -0.95942917 -0.65197615  0.69324827 -0.72920304 -0.86872850  0.87616007  0.70878155 -0.29516434
##  [21] -0.14410707  0.92548801 -0.44549329  0.19453602 -1.60115211  1.39350959  3.32606250 -0.79648802 -1.30996456 -0.46003636
##  [31] -0.53669533  0.82104467  1.78821753 -1.97608002  1.52531876 -0.25585319  1.21006074  1.95607085  0.04230925  0.46803578
##  [41]  2.01670378  1.19368046  1.20120153 -0.49669237 -0.18918695 -0.42362620  0.41901524 -0.39265277 -0.90423529 -1.68525596
##  [51] -0.99853779 -0.05946837 -0.12272501  1.19990084 -0.25277591  0.06012662 -1.08792087  0.76878385 -0.06246595 -2.07270115
##  [61] -1.31755700 -0.89706424  0.34031477  0.90086963 -0.11628450 -0.24003826 -1.18040086 -0.24690402 -0.83260657 -1.00306421
##  [71] -0.30850349 -0.34659285  0.43716883 -2.36077103 -0.23116019  1.87449880  1.38009796  0.95581840 -0.08384473  0.26942021
##  [81]  1.08039932  0.40561762  0.15912815  1.13572438  0.10386879  0.86585554  0.79493803 -0.43080560  0.77649804 -0.32221706
##  [91]  1.50550855  0.32032912  0.24727775 -0.82197481 -1.42758024  1.10942098  1.39620746  0.08895578  0.46403878 -0.54669577
library(ggplot2)
qplot(rnorm(10000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

runif(n, min, max) generates a vector of n random samples from a uniform distribution whose limits are min and max

runif(100, 0, 1)
##   [1] 0.472724686 0.307018266 0.720369834 0.565719134 0.260214073 0.291630986 0.882164101 0.051605914 0.455890893 0.165143798
##  [11] 0.851892580 0.040269077 0.366332942 0.673759001 0.057252024 0.706318196 0.217254666 0.262953555 0.404861972 0.751126933
##  [21] 0.126246933 0.472105791 0.134025459 0.689038690 0.783065907 0.164032331 0.101259435 0.565697048 0.060336350 0.765656369
##  [31] 0.511318344 0.724996882 0.869950069 0.912974228 0.905732265 0.743389372 0.374935023 0.532313017 0.094557391 0.724723988
##  [41] 0.379362820 0.882813673 0.996418042 0.331454388 0.511362910 0.083599654 0.792371807 0.218166791 0.266773995 0.006076652
##  [51] 0.064438633 0.565171633 0.112191624 0.796797916 0.464214930 0.866476413 0.754352485 0.269016077 0.259146014 0.664641577
##  [61] 0.988021544 0.914366963 0.661401265 0.768460279 0.666844471 0.408162299 0.735091421 0.749441098 0.768186237 0.223100694
##  [71] 0.946561704 0.962713837 0.938126939 0.832991195 0.182621158 0.988980525 0.530086909 0.550687216 0.148937527 0.271781892
##  [81] 0.011120157 0.151779840 0.667246365 0.367090994 0.024391824 0.755280936 0.867969546 0.033019193 0.211942921 0.705608915
##  [91] 0.724538469 0.230190331 0.617799667 0.790173059 0.995027459 0.348505143 0.956797696 0.023629957 0.851972289 0.025694436
qplot(runif(10000, 0, 1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

6.5.4 Indexing or subsetting a Vector

access a particular element of a vector through an index between square brackets or indexing (subsetting) operator

1. Positive integers

return elements at the specified positions

x <- c(2,3,4,5,6,7)
x[c(3,1)]
## [1] 4 2

2. Negative integers

omit elements at the specified positions

x[-c(3,1)]
## [1] 3 5 6 7

3. Logical vectors

select elements where the corresponding logical value is TRUE

x[c(TRUE, TRUE, FALSE, FALSE, TRUE, TRUE)]
## [1] 2 3 6 7
x > 3
## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE

This is called a logical indexing

x[x > 3]
## [1] 4 5 6 7
x[x > 3 & x < 5]
## [1] 4

v1 %in% v2 returns a logical vector indicating whether the elements of v1 are included in v2

c(1,2,3) %in% c(2,3,4,5,6)
## [1] FALSE  TRUE  TRUE
a <- c(1,2,3,4,5)
a
## [1] 1 2 3 4 5
a[3] <- 100
a
## [1]   1   2 100   4   5
a[c(1,5)] <- 100
a
## [1] 100   2 100   4 100
a <- c(1,2,3,NA,5,6,NA)
a
## [1]  1  2  3 NA  5  6 NA

is.na indicates which elements are missing

is.na(a)
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE

TRUE and FALSE will be converted into 1 and 0

sum(is.na(a))
## [1] 2
!is.na(a)
## [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
sum(!is.na(a))
## [1] 5
a[is.na(a)] <- 999
a
## [1]   1   2   3 999   5   6 999

create a vector with names

a <- c(x = 1, y = 2, z = 3)
a
## x y z 
## 1 2 3

named vector can be indexed using their names

a[c("x", "z")]
## x z 
## 1 3

R uses a “recycling rule” by repeating the shorter vector

i <- c(TRUE, FALSE)
a <- c(1,2,3,4)
a[i]
## [1] 1 3
v1 <- c(4,5,6,7)
v2 <- c(10,10)
v1 + v2
## [1] 14 15 16 17

6.5.5 Arrange a vector

sort() sorts ascending order

sort(c(5,6,4))
## [1] 4 5 6

sorts into descending order

sort(c(5,6,4), decreasing = TRUE)
## [1] 6 5 4

rev() provides a reversed version of its argument

rev(c(5,6,4))
## [1] 4 6 5

rank() returns the sample ranks of the elements in a vector

rank(c(5,6,4))
## [1] 2 3 1

order() returns a permutation which rearranges 1) first sorts a vector in ascending order to produce c(4,5,6) 2) and returns the indices of the sorted element in the original vector

order(c(5,6,4))
## [1] 3 1 2
a <- c(5,6,4)
a[order(a)]
## [1] 4 5 6
head(mtcars[order(mtcars$mpg), ])
##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
## Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8  301 335 3.54 3.570 14.60  0  1    5    8

6.5.6 Vectorization of Functions

Many R functions can be applied to a vector of values producing an equal-sized vector of result

v <- c(1,4,25)
sqrt(v)
## [1] 1 2 5
v <- c(1,2,3)
v^2
## [1] 1 4 9
v1 <- c(4,5,6,7)
v2 <- c(10,2,1,2)
v1+v2
## [1] 14  7  7  9

R uses a “recycling rule” by repeating the shorter vector

v1 <- c(4,5,6,7)
v2 <- c(10,2)
v1+v2
## [1] 14  7 16  9

mean will be subtracted from every element of v1

v1 <- c(1,2,3,4)
v1 - mean(v1)
## [1] -1.5 -0.5  0.5  1.5

6.5.7 Some more functions

table() creates a frequency table

a <- c(1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4)
table(a)
## a
## 1 2 3 4 
## 2 3 3 5

unique() returns a vector of unique elements

unique(a)
## [1] 1 2 3 4
a <- c(1,2,3,NA,5)

By default, mean() produces NA when there’s NAs in a vector

mean(a)
## [1] NA

na.rm = TRUE removes NAs before computation

mean(a, na.rm = TRUE)
## [1] 2.75

8. Generating Sequences

creating a vector containing integers between 1 and 10

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
5:0
## [1] 5 4 3 2 1 0
seq(from=1, to=3, by=0.5)
## [1] 1.0 1.5 2.0 2.5 3.0
seq(1,3,0.5)
## [1] 1.0 1.5 2.0 2.5 3.0

rep() replicates each term in formula

rep(5,3)
## [1] 5 5 5
rep(1:2, 3)
## [1] 1 2 1 2 1 2
rep(1:2, each=3)
## [1] 1 1 1 2 2 2

gl(k,n) generates sequences involving factors + k : the number of levels + n : the number of repetitions

gl(5,3)
##  [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
## Levels: 1 2 3 4 5

9. Exercise on vectors

a <- mtcars$mpg
length(a)
## [1] 32

Calculate the mean of a using sum() and length() functions

sum(a)/length(a)
## [1] 20.09062
mean(a)
## [1] 20.09062

Calculate the variance of a using sd() function

sd(a)^2
## [1] 36.3241

Calculate the variance of a using var() function

sum((a-mean(a))^2)/(length(a)-1)
## [1] 36.3241
(a-mean(a))/sd(a)
##  [1]  0.15088482  0.15088482  0.44954345  0.21725341 -0.23073453 -0.33028740 -0.96078893  0.71501778  0.44954345 -0.14777380
## [11] -0.38006384 -0.61235388 -0.46302456 -0.81145962 -1.60788262 -1.60788262 -0.89442035  2.04238943  1.71054652  2.29127162
## [21]  0.23384555 -0.76168319 -0.81145962 -1.12671039 -0.14777380  1.19619000  0.98049211  1.71054652 -0.71190675 -0.06481307
## [31] -0.84464392  0.21725341

Use scale() function to standardize a and compare the results with your manual calculation

scale(a)
##              [,1]
##  [1,]  0.15088482
##  [2,]  0.15088482
##  [3,]  0.44954345
##  [4,]  0.21725341
##  [5,] -0.23073453
##  [6,] -0.33028740
##  [7,] -0.96078893
##  [8,]  0.71501778
##  [9,]  0.44954345
## [10,] -0.14777380
## [11,] -0.38006384
## [12,] -0.61235388
## [13,] -0.46302456
## [14,] -0.81145962
## [15,] -1.60788262
## [16,] -1.60788262
## [17,] -0.89442035
## [18,]  2.04238943
## [19,]  1.71054652
## [20,]  2.29127162
## [21,]  0.23384555
## [22,] -0.76168319
## [23,] -0.81145962
## [24,] -1.12671039
## [25,] -0.14777380
## [26,]  1.19619000
## [27,]  0.98049211
## [28,]  1.71054652
## [29,] -0.71190675
## [30,] -0.06481307
## [31,] -0.84464392
## [32,]  0.21725341
## attr(,"scaled:center")
## [1] 20.09062
## attr(,"scaled:scale")
## [1] 6.026948

Calculate the difference between the largest and smallest numbers in a

max(a)-min(a)
## [1] 23.5
diff(range(a))
## [1] 23.5

Normalize the vector

(a-min(a))/(max(a)-min(a))
##  [1] 0.4510638 0.4510638 0.5276596 0.4680851 0.3531915 0.3276596 0.1659574 0.5957447 0.5276596 0.3744681 0.3148936 0.2553191
## [13] 0.2936170 0.2042553 0.0000000 0.0000000 0.1829787 0.9361702 0.8510638 1.0000000 0.4723404 0.2170213 0.2042553 0.1234043
## [25] 0.3744681 0.7191489 0.6638298 0.8510638 0.2297872 0.3957447 0.1957447 0.4680851
qplot(a)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

To set aesthetics, wrap in I()

qplot(a, color = I("red"), fill = I("blue"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

How many elements in a are larger than 20?

a>20
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [22] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE

within the subsetting operator (i.e., []) will create a vector with elements larger than 20

a[a>20]
##  [1] 21.0 21.0 22.8 21.4 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4
length(a[a>20])
## [1] 14

How many elements in a are larger than 20?

sum(a>20)
## [1] 14
head(txhousing)
## # A tibble: 6 x 9
##   city     year month sales   volume median listings inventory  date
##   <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
## 1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
## 2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
## 3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
## 4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
## 5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
## 6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
b <- txhousing$median
length(b)
## [1] 8602
sum(is.na(b))
## [1] 616
sum(b, na.rm = TRUE) / length(b)
## [1] 118955.8
mean(b, na.rm = TRUE)
## [1] 128131.4