4.1 More R basics

In order to implement some of the contents of this chapter we need to cover more R basics, mostly related with flexible plotting that is not implemented directly in R Commander. The R functions we will are also very useful for simplifying some R Commander approaches.

In the following sections, type – not copy and paste systematically – the code in the 'R Script' panel and send it to the output panel. Remember that you should get the same outputs (which are preceded by ## [1]).

4.1.1 Data frames revisited

# Let's begin importing the iris dataset
data(iris)

# names gives you the variables in the data frame
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"
## [5] "Species"

# The beginning of the data
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# So we can access variables by $or as in a matrix iris$Sepal.Length[1:10]
##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
iris[1:10, 1]
##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
iris[3, 1]
## [1] 4.7

# Information on the dimension of the data frame
dim(iris)
## [1] 150   5

# str gives the structure of any object in R
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ##$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ##$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... # Recall the species variable: it is a categorical variable (or factor), # not a numeric variable iris$Species[1:10]
##  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

# Factors can only take certain values
levels(iris$Species) ## [1] "setosa" "versicolor" "virginica" # If a file contains a variable with character strings as observations (either # encapsulated by quotation marks or not), the variable will become a factor # when imported into R Do the following: • Import auto.txt into R as the data frame auto. Check how the character strings in the file give rise to factor variables. • Get the dimensions of auto and show beginning of the data. • Retrieve the fifth observation of horsepower in two different ways. • Compute the levels of name. 4.1.2 Vector-related functions # The function seq creates sequences of numbers equally separated seq(0, 1, by = 0.1) ## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 seq(0, 1, length.out = 5) ## [1] 0.00 0.25 0.50 0.75 1.00 # You can short the latter argument seq(0, 1, l = 5) ## [1] 0.00 0.25 0.50 0.75 1.00 # Repeat number rep(0, 5) ## [1] 0 0 0 0 0 # Reverse a vector myVec <- c(1:5, -1:3) rev(myVec) ## [1] 3 2 1 0 -1 5 4 3 2 1 # Another way myVec[length(myVec):1] ## [1] 3 2 1 0 -1 5 4 3 2 1 # Count repetitions in your data table(iris$Sepal.Length)
##
## 4.3 4.4 4.5 4.6 4.7 4.8 4.9   5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9   6
##   1   3   1   4   2   5   6  10   9   4   1   6   7   6   8   7   3   6
## 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9   7 7.1 7.2 7.3 7.4 7.6 7.7 7.9
##   6   4   9   7   5   2   8   3   4   1   1   3   1   1   1   4   1
table(iris$Species) ## ## setosa versicolor virginica ## 50 50 50 Do the following: • Create the vector x = (0.3, 0.6, 0.9, 1.2). • Create a vector of length 100 ranging from 0 to 1 with entries equally separated. • Compute the amount of zeros and ones in x <- c(0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0). Check that they are the same as in rev(x). • Compute the vector (0.1, 1.1, 2.1, …, 100.1) in four different ways using seq and rev. Do the same but using : instead of seq. (Hint: add 0.1) 4.1.3 Logical conditions and subsetting # Relational operators: x < y, x > y, x <= y, x >= y, x == y, x!= y # They return TRUE or FALSE # Smaller than 0 < 1 ## [1] TRUE # Greater than 1 > 1 ## [1] FALSE # Greater or equal to 1 >= 1 # Remember: ">="" and not "=>"" ! ## [1] TRUE # Smaller or equal to 2 <= 1 # Remember: "<="" and not "=<"" ! ## [1] FALSE # Equal 1 == 1 # Tests equality. Remember: "=="" and not "="" ! ## [1] TRUE # Unequal 1 != 0 # Tests inequality ## [1] TRUE # TRUE is encoded as 1 and FALSE as 0 TRUE + 1 ## [1] 2 FALSE + 1 ## [1] 1 # In a vector-like fashion x <- 1:5 y <- c(0, 3, 1, 5, 2) x < y ## [1] FALSE TRUE FALSE TRUE FALSE x == y ## [1] FALSE FALSE FALSE FALSE FALSE x != y ## [1] TRUE TRUE TRUE TRUE TRUE # Subsetting of vectors x ## [1] 1 2 3 4 5 x[x >= 2] ## [1] 2 3 4 5 x[x < 3] ## [1] 1 2 # Easy way of work with parts of the data data <- data.frame(x = c(0, 1, 3, 3, 0), y = 1:5) data ## x y ## 1 0 1 ## 2 1 2 ## 3 3 3 ## 4 3 4 ## 5 0 5 # Data such that x is zero data0 <- data[data$x == 0, ]
data0
##   x y
## 1 0 1
## 5 0 5

# Data such that x is larger than 2
data2 <- data[data$x > 2, ] data2 ## x y ## 3 3 3 ## 4 3 4 # In an example iris$Sepal.Width[iris$Sepal.Width > 3] ## [1] 3.5 3.2 3.1 3.6 3.9 3.4 3.4 3.1 3.7 3.4 4.0 4.4 3.9 3.5 3.8 3.8 3.4 ## [18] 3.7 3.6 3.3 3.4 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 3.5 3.6 3.4 ## [35] 3.5 3.2 3.5 3.8 3.8 3.2 3.7 3.3 3.2 3.2 3.1 3.3 3.1 3.2 3.4 3.1 3.3 ## [52] 3.6 3.2 3.2 3.8 3.2 3.3 3.2 3.8 3.4 3.1 3.1 3.1 3.1 3.2 3.3 3.4 # Problem - what happened? data[x > 2, ] ## x y ## 3 3 3 ## 4 3 4 ## 5 0 5 # In an example summary(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ## summary(iris[iris$Sepal.Width > 3, ])
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width
##  Min.   :4.400   Min.   :3.100   Min.   :1.000   Min.   :0.1000
##  1st Qu.:5.000   1st Qu.:3.200   1st Qu.:1.450   1st Qu.:0.2000
##  Median :5.400   Median :3.400   Median :1.600   Median :0.4000
##  Mean   :5.684   Mean   :3.434   Mean   :2.934   Mean   :0.9075
##  3rd Qu.:6.400   3rd Qu.:3.600   3rd Qu.:5.000   3rd Qu.:1.8000
##  Max.   :7.900   Max.   :4.400   Max.   :6.700   Max.   :2.5000
##        Species
##  setosa    :42
##  versicolor: 8
##  virginica :17
##
##
##

# On the factor variable only makes sense == and !=
summary(iris[iris$Species == "setosa", ]) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 ## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 ## Median :5.000 Median :3.400 Median :1.500 Median :0.200 ## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246 ## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300 ## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600 ## Species ## setosa :50 ## versicolor: 0 ## virginica : 0 ## ## ## # Subset argument in lm lm(Sepal.Width ~ Petal.Length, data = iris, subset = Sepal.Width > 3) ## ## Call: ## lm(formula = Sepal.Width ~ Petal.Length, data = iris, subset = Sepal.Width > ## 3) ## ## Coefficients: ## (Intercept) Petal.Length ## 3.59439 -0.05455 lm(Sepal.Width ~ Petal.Length, data = iris, subset = iris$Sepal.Width > 3)
##
## Call:
## lm(formula = Sepal.Width ~ Petal.Length, data = iris, subset = iris$Sepal.Width > ## 3) ## ## Coefficients: ## (Intercept) Petal.Length ## 3.59439 -0.05455 # Both iris$Sepal.Width and Sepal.Width in subset are fine: data = iris
# tells R to look for Sepal.Width in the iris dataset

# Same thing for the subset field in R Commander's menus

# AND operator &
TRUE & TRUE
## [1] TRUE
TRUE & FALSE
## [1] FALSE
FALSE & FALSE
## [1] FALSE

# OR operator |
TRUE | TRUE
## [1] TRUE
TRUE | FALSE
## [1] TRUE
FALSE | FALSE
## [1] FALSE

# Both operators are useful for checking for ranges of data
y
## [1] 0 3 1 5 2
index1 <- (y <= 3) & (y > 0)
y[index1]
## [1] 3 1 2
index2 <- (y < 2) | (y > 4)
y[index2]
## [1] 0 1 5

# In an example
summary(iris[iris$Sepal.Width > 3 & iris$Sepal.Width < 3.5, ])
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width
##  Min.   :4.400   Min.   :3.100   Min.   :1.200   Min.   :0.100
##  1st Qu.:4.925   1st Qu.:3.125   1st Qu.:1.500   1st Qu.:0.200
##  Median :5.950   Median :3.200   Median :4.450   Median :1.400
##  Mean   :5.781   Mean   :3.245   Mean   :3.460   Mean   :1.145
##  3rd Qu.:6.700   3rd Qu.:3.400   3rd Qu.:5.375   3rd Qu.:2.075
##  Max.   :7.200   Max.   :3.400   Max.   :6.000   Max.   :2.500
##        Species
##  setosa    :20
##  versicolor: 8
##  virginica :14
##
##
## 

Do the following for the iris dataset:

• Compute the subset corresponding to Petal.Length either smaller than 1.5 or larger than 2. Save this dataset as irisPetal.
• Compute and summarize a linear regression of Sepal.Width into Petal.Width + Petal.Length for the dataset irisPetal. What is the R2? (Solution: 0.101)
• Check that the previous model is the same as regressing Sepal.Width into Petal.Width + Petal.Length for the dataset iris with the appropriate subset expression.
• Compute the variance for Petal.Width when Petal.Width is smaller or equal that 1.5 and larger than 0.3. (Solution: 0.1266541)

4.1.4 Plotting functions

# plot is the main function for plotting in R
# It has a different behavior depending on the kind of object that it receives

# For example, for a regression model, it produces diagnostic plots
mod <- lm(Sepal.Width ~ Sepal.Length, data = iris)
plot(mod, 1)


# How to plot some data
plot(iris$Sepal.Length, iris$Sepal.Width, main = "Sepal.Length vs Sepal.Width")


# Change the axis limits
plot(iris$Sepal.Length, iris$Sepal.Width, xlim = c(0, 10), ylim = c(0, 10))


# How to plot a curve (a parabola)
x <- seq(-1, 1, l = 50)
y <- x^2
plot(x, y)

plot(x, y, main = "A dotted parabola")

plot(x, y, main = "A parabola", type = "l")

plot(x, y, main = "A red and thick parabola", type = "l", col = "red", lwd = 3)


# Plotting a more complicated curve between -pi and pi
x <- seq(-pi, pi, l = 50)
y <- (2 + sin(10 * x)) * x^2
plot(x, y, type = "l") # Kind of rough...


# More detailed plot
x <- seq(-pi, pi, l = 500)
y <- (2 + sin(10 * x)) * x^2
plot(x, y, type = "l")


# Remember that we are joining points for creating a curve!

# For more options in the plot customization see
?plot
?par

# plot is a first level plotting function. That means that whenever is called,
# it creates a new plot. If we want to add information to an existing plot, we
# have to use a second level plotting function such as points, lines or abline

plot(x, y) # Create a plot
lines(x, x^2, col = "red") # Add lines
points(x, y + 10, col = "blue") # Add points
abline(a = 5, b = 1, col = "orange", lwd = 2) # Add a straight line y = a + b * x

4.1.5 Distributions

The operations on distributions described here are implemented in R Commander through the menu 'Distributions', but is convenient for you to grasp how are they working.

# R allows to sample [r], compute density/probability mass [d],
# compute distribution function [p] and compute quantiles [q] for several
# continuous and discrete distributions. The format employed is [rdpq]name,
# where name stands for:
# - norm -> Normal
# - unif -> Uniform
# - exp -> Exponential
# - t -> Student's t
# - f -> Snedecor's F
# - chisq -> Chi squared
# - pois -> Poisson
# - binom -> Binomial
# More distributions:
?Distributions

# Sampling from a Normal - 100 random points from a N(0, 1)
rnorm(n = 10, mean = 0, sd = 1)
##  [1] -1.8367426 -1.3366952 -0.4906582  1.0215158  0.1637865  2.5039127
##  [7]  1.3113124 -0.1352548  0.1846896 -0.6373963

# If you want to have always the same result, set the seed of the random number
# generator
set.seed(45678)
rnorm(n = 10, mean = 0, sd = 1)
##  [1]  1.4404800 -0.7195761  0.6709784 -0.4219485  0.3782196 -1.6665864
##  [7] -0.5082030  0.4433822 -1.7993868 -0.6179521

# Plotting the density of a N(0, 1) - the Gauss bell
x <- seq(-4, 4, l = 100)
y <- dnorm(x = x, mean = 0, sd = 1)
plot(x, y, type = "l")


# Plotting the distribution function of a N(0, 1)
x <- seq(-4, 4, l = 100)
y <- pnorm(q = x, mean = 0, sd = 1)
plot(x, y, type = "l")


# Computing the 95% quantile for a N(0, 1)
qnorm(p = 0.95, mean = 0, sd = 1)
## [1] 1.644854

# All distributions have the same syntax: rname(n,...), dname(x,...), dname(p,...)
# and qname(p,...), but the parameters in ... change. Look them in ?Distributions
# For example, here is the same for the uniform distribution

# Sampling from a U(0, 1)
set.seed(45678)
runif(n = 10, min = 0, max = 1)
##  [1] 0.9251342 0.3339988 0.2358930 0.3366312 0.7488829 0.9327177 0.3365313
##  [8] 0.2245505 0.6473663 0.0807549

# Plotting the density of a U(0, 1)
x <- seq(-2, 2, l = 100)
y <- dunif(x = x, min = 0, max = 1)
plot(x, y, type = "l")


# Computing the 95% quantile for a U(0, 1)
qunif(p = 0.95, min = 0, max = 1)
## [1] 0.95

# Sampling from a Bi(10, 0.5)
set.seed(45678)
samp <- rbinom(n = 200, size = 10, prob = 0.5)
table(samp) / 200
## samp
##     1     2     3     4     5     6     7     8     9
## 0.010 0.060 0.115 0.220 0.210 0.215 0.115 0.045 0.010

# Plotting the probability mass of a Bi(10, 0.5)
x <- 0:10
y <- dbinom(x = x, size = 10, prob = 0.5)
plot(x, y, type = "h") # Vertical bars


# Plotting the distribution function of a Bi(10, 0.5)
x <- 0:10
y <- pbinom(q = x, size = 10, prob = 0.5)
plot(x, y, type = "h")

Do the following:

• Compute the 90%, 95% and 99% quantiles of a F distribution with df1 = 1 and df2 = 5. (Answer: c(4.060420, 6.607891, 16.258177))
• Plot the distribution function of a U(0, 1). Does it make sense with its density function?
• Sample 100 points from a Poisson with lambda = 5.
• Sample 100 points from a U(−1, 1) and compute its mean.
• Plot the density of a t distribution with df = 1 (use a sequence spanning from -4 to 4). Add lines of different colors with the densities for df = 5, df = 10, df = 50 and df = 100. Do you see any pattern?

4.1.6 Defining functions

# A function is a way of encapsulating a block of code so it can be reused easily
# They are useful for simplifying repetitive tasks and organize the analysis
# For example, in Setion 3.7 we had to make use of simpleAnova for computing
# the simple ANOVA table in multiple regression.

# This is a silly function that takes x and y and returns its sum
add <- function(x, y) {
x + y
}

# Calling add - you need to run the definition of the function first!
## [1] 2
add(x = 1, y = 2)
## [1] 3

# A more complex function: computes a linear model and its posterior summary.
# Saves us a few keystrokes when computing a lm and a summary
lmSummary <- function(formula, data) {
model <- lm(formula = formula, data = data)
summary(model)
}

# Usage
lmSummary(Sepal.Length ~ Petal.Width, iris)
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.38822 -0.29358 -0.04393  0.26429  1.34521
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  4.77763    0.07293   65.51   <2e-16 ***
## Petal.Width  0.88858    0.05137   17.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.478 on 148 degrees of freedom
## Multiple R-squared:  0.669,  Adjusted R-squared:  0.6668
## F-statistic: 299.2 on 1 and 148 DF,  p-value: < 2.2e-16

# Recall: there is no variable called model in the workspace.
# The function works on its own workspace!
model
##
## Call:
## lm(formula = medv ~ crim + lstat + zn + nox, data = Boston)
##
## Coefficients:
## (Intercept)         crim        lstat           zn          nox
##    30.93462     -0.08297     -0.90940      0.03493      5.42234

# Add a line to a plot
addLine <- function(x, beta0, beta1) {
lines(x, beta0 + beta1 * x, lwd = 2, col = 2)
}

# Usage
plot(x, y)
addLine(x, beta0 = 0.1, beta1 = 0)