17.4 Quantitative prediction tasks

Goal: Predicting some value on a scale.

The two parts to a model:

  1. Defining a family of models that express a precise, but generic, pattern that you want to capture (e.g., as an equation that contains free parameters). For instance, linear models (that predict y from x) have the general form y = p1 + p2 * x and two parameters: an intercept p1 and a slope p2.

  2. Generating a fitted model by finding the model from the family that is the closest to our data (assuming some distance measure).

Thus, the “best” model is not “true,” but the “closest” based on some criterion of distance.

Code from Chapter 23: Model basics of the r4ds book (Wickham & Grolemund, 2017):

# setup:
library(tidyverse)
library(modelr)
library(ds4psy)
library(unikn)
options(na.action = na.warn)

# data:
dat <- modelr::sim1

Figure 17.1 shows a scatterplot of data:

A scatterplot of the to-be-predicted data.

Figure 17.1: A scatterplot of the to-be-predicted data.

Random models

Generate n linear models of form y = p1 + p2 * x:

n <- 100
set.seed(246)  # for reproducible randomness. 

# Sample n random values for p1 and p2: 
models <- tibble(
  p1 = runif(n, -20, 20),
  p2 = runif(n,  -5,  5)
)

ggplot(sim1, aes(x, y)) + 
  geom_abline(aes(intercept = p1, slope = p2), data = models, alpha = 1/5) +
  geom_point(color = "steelblue", size = 2, alpha = 2/3) + 
  labs(title = "The data to predict and random linear models", caption = "Data from modelr::sim1") + 
  theme_ds4psy()

Note usage of set.seed() for reproducible randomness.

Some random linear models appear to be ok, but most of them are pretty bad (i.e., do not describe the locations and relations of the data points very well).

We can improve the quality of our models by narrowing down the range of parameters from which we sample:

set.seed(123)  # for reproducible randomness. 

# Sample n random values for p1 and p2: 
models <- tibble(
  p1 = runif(n, 0, 10),
  p2 = runif(n, 0, 5)
)

ggplot(sim1, aes(x, y)) + 
  geom_abline(aes(intercept = p1, slope = p2), data = models, alpha = 1/5) +
  geom_point(color = "steelblue", size = 2, alpha = 2/3) + 
  labs(title = "The data to predict and random linear models", caption = "Data from modelr::sim1") + 
  theme_ds4psy()

The new set of models is much better. But to determine which is the best candidate among them, we need some way of comapring them to the data in a quantitative fashion.

To measure the performance of a model, we need to quantify the distance between the model and the actual data points.

A function that computes the predictions of a model (i.e., the model’s y values for some input data data$x), given the model parameters mp:

# mp ... model parameters

# Model predictions: 
model1 <- function(model, data) {
  model[1] + model[2] * data$x 
}

# Generate predictions for various model parameters and dat:
model1(c(6, 3), dat)
#>  [1]  9  9  9 12 12 12 15 15 15 18 18 18 21 21 21 24 24 24 27 27 27 30 30 30 33
#> [26] 33 33 36 36 36
model1(c(6, 2), dat)
#>  [1]  8  8  8 10 10 10 12 12 12 14 14 14 16 16 16 18 18 18 20 20 20 22 22 22 24
#> [26] 24 24 26 26 26
model1(c(5, 3), dat)
#>  [1]  8  8  8 11 11 11 14 14 14 17 17 17 20 20 20 23 23 23 26 26 26 29 29 29 32
#> [26] 32 32 35 35 35
model1(c(5, 2), dat)
#>  [1]  7  7  7  9  9  9 11 11 11 13 13 13 15 15 15 17 17 17 19 19 19 21 21 21 23
#> [26] 23 23 25 25 25

Adding a distance function: root of the mean-squared deviation: The following function computes the difference between actual and predicted, square them, average them, and the take the square root:

distance <- function(model, data) {
  diff <- data$y - model1(model, data)
  sqrt(mean(diff ^ 2))
}

# Measure distance for models and dat: 
distance(c(6, 3), dat)
#> [1] 7.803283
distance(c(6, 2), dat)
#> [1] 2.60544
distance(c(5, 3), dat)
#> [1] 6.920964
distance(c(5, 2), dat)
#> [1] 2.190166

Alternative distance function (using 2 model parameters as 2 distinct arguments and only using the data of dat):

dist <- function(p1, p2) {
  distance(c(p1, p2), data = dat)
}

# Checks: 
all.equal(dist(6, 3), distance(c(6, 3), dat))
#> [1] TRUE
all.equal(dist(6, 2), distance(c(6, 2), dat))
#> [1] TRUE

We can now compute the distance of each model in models:

# using a for loop:
md <- rep(NA, nrow(models))

for (i in 1:nrow(models)){
  md[i] <- dist(models$p1[i], models$p2[i])  
}
md

# Or directly:
md_2 <- purrr::map2_dbl(models$p1, models$p2, dist)

# Verify equality:
all.equal(md, md_2)

In one step: Compute the distance of all models (generated above) and arrange by lowest distance:

models <- models %>% 
  mutate(dist = purrr::map2_dbl(p1, p2, dist)) %>%
  arrange(dist)

models  # with a column for dist
#> # A tibble: 100 x 3
#>       p1    p2  dist
#>    <dbl> <dbl> <dbl>
#>  1  4.15  2.05  2.13
#>  2  1.88  2.33  2.41
#>  3  2.32  2.20  2.43
#>  4  6.53  1.61  2.48
#>  5  5.51  2.05  2.50
#>  6  6.93  1.60  2.50
#>  7  6.56  1.85  2.52
#>  8  6.68  1.56  2.56
#>  9  3.80  1.86  2.64
#> 10  6.41  1.54  2.67
#> # … with 90 more rows

We can now select the top_n models (i.e., those with the lowest overall distance to the sim1 data) and compare them to the data:

# Plot the top models:
top_n <- 10

# subset of models to plot:
m2p <- models[1:top_n, ]  # base R solution (given that models are sorted by -dist)
m2p <- dplyr::filter(models, rank(dist) <= top_n)  # dplyr solution
# m2p

pt <- paste0("The data to predict and the top ", top_n, " random linear models")

ggplot(sim1, aes(x, y)) + 
  geom_abline(aes(intercept = p1, slope = p2, color = dist), 
              data = m2p, alpha = 4/5) +
  geom_point(color = "steelblue", size = 2, alpha = 2/3) + 
  labs(title = pt, caption = "Data from modelr::sim1") + 
       scale_color_gradient(low = "gold", high = Bordeaux) + 
  theme_ds4psy()

An alternative way of looking at the models is by plotting the parameter pair p1 and p2 of each model as a scatterplot. To gain an impression about a model’s performance, we can color the corresponding point by -dist (i.e., models with a lower distance value will be colored in a lighter color). Additionally, the following command first draws the subset of the top_n models in a larger size and brighter color, effectively highlighting those models in the full scatterplot:

ggplot(models, aes(p1, p2)) +
  geom_point(data = filter(models, rank(dist) <= top_n), 
             size = 4, color = Seegruen) +
  geom_point(aes(color = dist), size = 2) +
  scale_color_gradient(low = "gold", high = Bordeaux) + 
  theme_ds4psy()

Minimizing distance

An important part of quantitative modeling is to find parameters that would optimize (i.e., maximize or minimize) some measure.

There are multiple ways of doing this. Here, we explore two different methods (both from the stats package that comes with R): General intuition: Given the scatterplot of our data (shown in Figure 17.1), we have a pretty good idea that we are looking for a linear model.

  1. General-purpose optimization (using optim())

  2. Fitting a linear model (using lm())

  • ad 1: Using optim() to determine the best parameter pair:

Intuition: Given a fn = distance and data, optim() will use an algorithm to find the parameter combination par that minimize the value of distance.

best <- optim(par = c(0, 0), fn = distance, data = sim1)
best$par # vector of the 2 best parameters (i.e., minimizing fn for data)
#> [1] 4.222248 2.051204

ggplot(sim1, aes(x, y)) + 
  geom_point(color = "steelblue", size = 2, alpha = 2/3) + 
  geom_abline(intercept = best$par[1], slope = best$par[2], color = "gold") + 
  labs(title = "The best linear model to predict the data", 
       caption = "Data from modelr::sim1") + 
  theme_ds4psy()

  • ad 2: Using lm() for fitting a linear model:
sim1_mod <- lm(y ~ x, data = sim1)
coef(sim1_mod)
#> (Intercept)           x 
#>    4.220822    2.051533

For fitting other datasets and types of functions, using glm() or nlm() may be indicated.

+++ here now +++

17.4.1 ToDo 2

Generalize and use a more meaningful dataset:

  • Candidate data: father.son or galton (both from the UsingR package)
  • analysis