4.1 More R basics

In order to implement some of the contents of this chapter we need to cover more R basics, mostly related with flexible plotting that is not implemented directly in R Commander. The R functions we will are also very useful for simplifying some R Commander approaches.

In the following sections, type – not copy and paste systematically – the code in the 'R Script' panel and send it to the output panel. Remember that you should get the same outputs (which are preceded by ## [1]).

4.1.1 Data frames revisited

# Let's begin importing the iris dataset
data(iris)

# names gives you the variables in the data frame
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

# The beginning of the data
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# So we can access variables by $ or as in a matrix
iris$Sepal.Length[1:10]
##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
iris[1:10, 1]
##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
iris[3, 1]
## [1] 4.7

# Information on the dimension of the data frame
dim(iris)
## [1] 150   5

# str gives the structure of any object in R
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# Recall the species variable: it is a categorical variable (or factor),
# not a numeric variable
iris$Species[1:10]
##  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

# Factors can only take certain values
levels(iris$Species)
## [1] "setosa"     "versicolor" "virginica"

# If a file contains a variable with character strings as observations (either
# encapsulated by quotation marks or not), the variable will become a factor
# when imported into R

Do the following:

  • Import auto.txt into R as the data frame auto. Check how the character strings in the file give rise to factor variables.
  • Get the dimensions of auto and show beginning of the data.
  • Retrieve the fifth observation of horsepower in two different ways.
  • Compute the levels of name.

4.1.2 Vector-related functions

# The function seq creates sequences of numbers equally separated
seq(0, 1, by = 0.1)
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(0, 1, length.out = 5)
## [1] 0.00 0.25 0.50 0.75 1.00

# You can short the latter argument
seq(0, 1, l = 5)
## [1] 0.00 0.25 0.50 0.75 1.00

# Repeat number
rep(0, 5)
## [1] 0 0 0 0 0

# Reverse a vector
myVec <- c(1:5, -1:3)
rev(myVec)
##  [1]  3  2  1  0 -1  5  4  3  2  1

# Another way
myVec[length(myVec):1]
##  [1]  3  2  1  0 -1  5  4  3  2  1

# Count repetitions in your data
table(iris$Sepal.Length)
## 
## 4.3 4.4 4.5 4.6 4.7 4.8 4.9   5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9   6 6.1 6.2 
##   1   3   1   4   2   5   6  10   9   4   1   6   7   6   8   7   3   6   6   4 
## 6.3 6.4 6.5 6.6 6.7 6.8 6.9   7 7.1 7.2 7.3 7.4 7.6 7.7 7.9 
##   9   7   5   2   8   3   4   1   1   3   1   1   1   4   1
table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50

Do the following:

  • Create the vector \(x=(0.3, 0.6, 0.9, 1.2)\).
  • Create a vector of length 100 ranging from \(0\) to \(1\) with entries equally separated.
  • Compute the amount of zeros and ones in x <- c(0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0). Check that they are the same as in rev(x).
  • Compute the vector \((0.1, 1.1, 2.1, ..., 100.1)\) in four different ways using seq and rev. Do the same but using : instead of seq. (Hint: add 0.1)

4.1.3 Logical conditions and subsetting

# Relational operators: x < y, x > y, x <= y, x >= y, x == y, x!= y
# They return TRUE or FALSE

# Smaller than
0 < 1
## [1] TRUE

# Greater than
1 > 1
## [1] FALSE

# Greater or equal to
1 >= 1 # Remember: ">="" and not "=>"" !
## [1] TRUE

# Smaller or equal to
2 <= 1 # Remember: "<="" and not "=<"" !
## [1] FALSE

# Equal
1 == 1 # Tests equality. Remember: "=="" and not "="" !
## [1] TRUE

# Unequal
1 != 0 # Tests inequality
## [1] TRUE

# TRUE is encoded as 1 and FALSE as 0
TRUE + 1
## [1] 2
FALSE + 1
## [1] 1

# In a vector-like fashion
x <- 1:5
y <- c(0, 3, 1, 5, 2)
x < y
## [1] FALSE  TRUE FALSE  TRUE FALSE
x == y
## [1] FALSE FALSE FALSE FALSE FALSE
x != y
## [1] TRUE TRUE TRUE TRUE TRUE

# Subsetting of vectors
x
## [1] 1 2 3 4 5
x[x >= 2]
## [1] 2 3 4 5
x[x < 3]
## [1] 1 2

# Easy way of work with parts of the data
data <- data.frame(x = c(0, 1, 3, 3, 0), y = 1:5)
data
##   x y
## 1 0 1
## 2 1 2
## 3 3 3
## 4 3 4
## 5 0 5

# Data such that x is zero
data0 <- data[data$x == 0, ]
data0
##   x y
## 1 0 1
## 5 0 5

# Data such that x is larger than 2
data2 <- data[data$x > 2, ]
data2
##   x y
## 3 3 3
## 4 3 4

# In an example
iris$Sepal.Width[iris$Sepal.Width > 3]
##  [1] 3.5 3.2 3.1 3.6 3.9 3.4 3.4 3.1 3.7 3.4 4.0 4.4 3.9 3.5 3.8 3.8 3.4 3.7 3.6
## [20] 3.3 3.4 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 3.5 3.6 3.4 3.5 3.2 3.5 3.8
## [39] 3.8 3.2 3.7 3.3 3.2 3.2 3.1 3.3 3.1 3.2 3.4 3.1 3.3 3.6 3.2 3.2 3.8 3.2 3.3
## [58] 3.2 3.8 3.4 3.1 3.1 3.1 3.1 3.2 3.3 3.4

# Problem - what happened?
data[x > 2, ]
##   x y
## 3 3 3
## 4 3 4
## 5 0 5

# In an example
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
summary(iris[iris$Sepal.Width > 3, ])
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width    
##  Min.   :4.400   Min.   :3.100   Min.   :1.000   Min.   :0.1000  
##  1st Qu.:5.000   1st Qu.:3.200   1st Qu.:1.450   1st Qu.:0.2000  
##  Median :5.400   Median :3.400   Median :1.600   Median :0.4000  
##  Mean   :5.684   Mean   :3.434   Mean   :2.934   Mean   :0.9075  
##  3rd Qu.:6.400   3rd Qu.:3.600   3rd Qu.:5.000   3rd Qu.:1.8000  
##  Max.   :7.900   Max.   :4.400   Max.   :6.700   Max.   :2.5000  
##        Species  
##  setosa    :42  
##  versicolor: 8  
##  virginica :17  
##                 
##                 
## 

# On the factor variable only makes sense == and !=
summary(iris[iris$Species == "setosa", ])
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600  
##        Species  
##  setosa    :50  
##  versicolor: 0  
##  virginica : 0  
##                 
##                 
## 

# Subset argument in lm
lm(Sepal.Width ~ Petal.Length, data = iris, subset = Sepal.Width > 3)
## 
## Call:
## lm(formula = Sepal.Width ~ Petal.Length, data = iris, subset = Sepal.Width > 
##     3)
## 
## Coefficients:
##  (Intercept)  Petal.Length  
##      3.59439      -0.05455
lm(Sepal.Width ~ Petal.Length, data = iris, subset = iris$Sepal.Width > 3)
## 
## Call:
## lm(formula = Sepal.Width ~ Petal.Length, data = iris, subset = iris$Sepal.Width > 
##     3)
## 
## Coefficients:
##  (Intercept)  Petal.Length  
##      3.59439      -0.05455
# Both iris$Sepal.Width and Sepal.Width in subset are fine: data = iris
# tells R to look for Sepal.Width in the iris dataset

# Same thing for the subset field in R Commander's menus

# AND operator &
TRUE & TRUE
## [1] TRUE
TRUE & FALSE
## [1] FALSE
FALSE & FALSE
## [1] FALSE

# OR operator |
TRUE | TRUE
## [1] TRUE
TRUE | FALSE
## [1] TRUE
FALSE | FALSE
## [1] FALSE

# Both operators are useful for checking for ranges of data
y
## [1] 0 3 1 5 2
index1 <- (y <= 3) & (y > 0)
y[index1]
## [1] 3 1 2
index2 <- (y < 2) | (y > 4)
y[index2]
## [1] 0 1 5

# In an example
summary(iris[iris$Sepal.Width > 3 & iris$Sepal.Width < 3.5, ])
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.400   Min.   :3.100   Min.   :1.200   Min.   :0.100  
##  1st Qu.:4.925   1st Qu.:3.125   1st Qu.:1.500   1st Qu.:0.200  
##  Median :5.950   Median :3.200   Median :4.450   Median :1.400  
##  Mean   :5.781   Mean   :3.245   Mean   :3.460   Mean   :1.145  
##  3rd Qu.:6.700   3rd Qu.:3.400   3rd Qu.:5.375   3rd Qu.:2.075  
##  Max.   :7.200   Max.   :3.400   Max.   :6.000   Max.   :2.500  
##        Species  
##  setosa    :20  
##  versicolor: 8  
##  virginica :14  
##                 
##                 
## 

Do the following for the iris dataset:

  • Compute the subset corresponding to Petal.Length either smaller than 1.5 or larger than 2. Save this dataset as irisPetal.
  • Compute and summarize a linear regression of Sepal.Width into Petal.Width + Petal.Length for the dataset irisPetal. What is the \(R^2\)? (Solution: 0.101)
  • Check that the previous model is the same as regressing Sepal.Width into Petal.Width + Petal.Length for the dataset iris with the appropriate subset expression.
  • Compute the variance for Petal.Width when Petal.Width is smaller or equal that 1.5 and larger than 0.3. (Solution: 0.1266541)

4.1.4 Plotting functions

# plot is the main function for plotting in R
# It has a different behavior depending on the kind of object that it receives

# For example, for a regression model, it produces diagnostic plots
mod <- lm(Sepal.Width ~ Sepal.Length, data = iris)
plot(mod, 1)


# How to plot some data
plot(iris$Sepal.Length, iris$Sepal.Width, main = "Sepal.Length vs Sepal.Width")


# Change the axis limits
plot(iris$Sepal.Length, iris$Sepal.Width, xlim = c(0, 10), ylim = c(0, 10))


# How to plot a curve (a parabola)
x <- seq(-1, 1, l = 50)
y <- x^2
plot(x, y)

plot(x, y, main = "A dotted parabola")

plot(x, y, main = "A parabola", type = "l")

plot(x, y, main = "A red and thick parabola", type = "l", col = "red", lwd = 3)


# Plotting a more complicated curve between -pi and pi
x <- seq(-pi, pi, l = 50)
y <- (2 + sin(10 * x)) * x^2
plot(x, y, type = "l") # Kind of rough...


# More detailed plot
x <- seq(-pi, pi, l = 500)
y <- (2 + sin(10 * x)) * x^2
plot(x, y, type = "l")


# Remember that we are joining points for creating a curve!

# For more options in the plot customization see
?plot
## Help on topic 'plot' was found in the following packages:
## 
##   Package               Library
##   base                  /Library/Frameworks/R.framework/Resources/library
##   graphics              /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library
## 
## 
## Using the first match ...
?par

# plot is a first level plotting function. That means that whenever is called,
# it creates a new plot. If we want to add information to an existing plot, we
# have to use a second level plotting function such as points, lines or abline

plot(x, y) # Create a plot
lines(x, x^2, col = "red") # Add lines
points(x, y + 10, col = "blue") # Add points
abline(a = 5, b = 1, col = "orange", lwd = 2) # Add a straight line y = a + b * x

4.1.5 Distributions

The operations on distributions described here are implemented in R Commander through the menu 'Distributions', but is convenient for you to grasp how are they working.

# R allows to sample [r], compute density/probability mass [d],
# compute distribution function [p] and compute quantiles [q] for several
# continuous and discrete distributions. The format employed is [rdpq]name,
# where name stands for:
# - norm -> Normal
# - unif -> Uniform
# - exp -> Exponential
# - t -> Student's t
# - f -> Snedecor's F
# - chisq -> Chi squared
# - pois -> Poisson
# - binom -> Binomial
# More distributions:
?Distributions

# Sampling from a Normal - 100 random points from a N(0, 1)
rnorm(n = 10, mean = 0, sd = 1)
##  [1] -1.8367426 -1.3366952 -0.4906582  1.0215158  0.1637865  2.5039127
##  [7]  1.3113124 -0.1352548  0.1846896 -0.6373963

# If you want to have always the same result, set the seed of the random number
# generator
set.seed(45678)
rnorm(n = 10, mean = 0, sd = 1)
##  [1]  1.4404800 -0.7195761  0.6709784 -0.4219485  0.3782196 -1.6665864
##  [7] -0.5082030  0.4433822 -1.7993868 -0.6179521

# Plotting the density of a N(0, 1) - the Gauss bell
x <- seq(-4, 4, l = 100)
y <- dnorm(x = x, mean = 0, sd = 1)
plot(x, y, type = "l")


# Plotting the distribution function of a N(0, 1)
x <- seq(-4, 4, l = 100)
y <- pnorm(q = x, mean = 0, sd = 1)
plot(x, y, type = "l")


# Computing the 95% quantile for a N(0, 1)
qnorm(p = 0.95, mean = 0, sd = 1)
## [1] 1.644854

# All distributions have the same syntax: rname(n,...), dname(x,...), dname(p,...)  
# and qname(p,...), but the parameters in ... change. Look them in ?Distributions
# For example, here is the same for the uniform distribution

# Sampling from a U(0, 1)
set.seed(45678)
runif(n = 10, min = 0, max = 1)
##  [1] 0.9251342 0.3339988 0.2358930 0.3366312 0.7488829 0.9327177 0.3365313
##  [8] 0.2245505 0.6473663 0.0807549

# Plotting the density of a U(0, 1)
x <- seq(-2, 2, l = 100)
y <- dunif(x = x, min = 0, max = 1)
plot(x, y, type = "l")


# Computing the 95% quantile for a U(0, 1)
qunif(p = 0.95, min = 0, max = 1)
## [1] 0.95

# Sampling from a Bi(10, 0.5)
set.seed(45678)
samp <- rbinom(n = 200, size = 10, prob = 0.5)
table(samp) / 200
## samp
##     1     2     3     4     5     6     7     8     9 
## 0.010 0.060 0.115 0.220 0.210 0.215 0.115 0.045 0.010

# Plotting the probability mass of a Bi(10, 0.5)
x <- 0:10
y <- dbinom(x = x, size = 10, prob = 0.5)
plot(x, y, type = "h") # Vertical bars


# Plotting the distribution function of a Bi(10, 0.5)
x <- 0:10
y <- pbinom(q = x, size = 10, prob = 0.5)
plot(x, y, type = "h")

Do the following:

  • Compute the 90%, 95% and 99% quantiles of a \(F\) distribution with df1 = 1 and df2 = 5. (Answer: c(4.060420, 6.607891, 16.258177))
  • Plot the distribution function of a \(U(0,1)\). Does it make sense with its density function?
  • Sample 100 points from a Poisson with lambda = 5.
  • Sample 100 points from a \(U(-1,1)\) and compute its mean.
  • Plot the density of a \(t\) distribution with df = 1 (use a sequence spanning from -4 to 4). Add lines of different colors with the densities for df = 5, df = 10, df = 50 and df = 100. Do you see any pattern?

4.1.6 Defining functions

# A function is a way of encapsulating a block of code so it can be reused easily
# They are useful for simplifying repetitive tasks and organize the analysis
# For example, in Section 3.7 we had to make use of simpleAnova for computing
# the simple ANOVA table in multiple regression.

# This is a silly function that takes x and y and returns its sum
add <- function(x, y) {
  x + y
}

# Calling add - you need to run the definition of the function first!
add(1, 1)
## [1] 2
add(x = 1, y = 2)
## [1] 3

# A more complex function: computes a linear model and its posterior summary.
# Saves us a few keystrokes when computing a lm and a summary
lmSummary <- function(formula, data) {
  model <- lm(formula = formula, data = data)
  summary(model)
}

# Usage
lmSummary(Sepal.Length ~ Petal.Width, iris)
## 
## Call:
## lm(formula = formula, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.38822 -0.29358 -0.04393  0.26429  1.34521 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.77763    0.07293   65.51   <2e-16 ***
## Petal.Width  0.88858    0.05137   17.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.478 on 148 degrees of freedom
## Multiple R-squared:  0.669,  Adjusted R-squared:  0.6668 
## F-statistic: 299.2 on 1 and 148 DF,  p-value: < 2.2e-16

# Recall: there is no variable called model in the workspace.
# The function works on its own workspace!
model
## 
## Call:
## lm(formula = medv ~ crim + lstat + zn + nox, data = Boston)
## 
## Coefficients:
## (Intercept)         crim        lstat           zn          nox  
##    30.93462     -0.08297     -0.90940      0.03493      5.42234

# Add a line to a plot
addLine <- function(x, beta0, beta1) {
  lines(x, beta0 + beta1 * x, lwd = 2, col = 2)
}

# Usage
plot(x, y)
addLine(x, beta0 = 0.1, beta1 = 0)