19.4 Quantitative prediction tasks
Forecasting some probability vs. predicting values by some model (e.g., linear regression model).
Goal: Predicting some value on a scale.
The two parts to a model:
Defining a family of models that express a precise, but generic, pattern that you want to capture (e.g., as an equation that contains free parameters). For instance, linear models (that predict
y
fromx
) have the general formy = p1 + p2 * x
and two parameters: an interceptp1
and a slopep2
.Generating a fitted model by finding the model from the family that is the closest to our data (assuming some distance measure).
Thus, the “best” model is not “true,” but the “closest” based on some criterion of distance.
Code from Chapter 23: Model basics of the r4ds book (Wickham & Grolemund, 2017):
# setup:
library(tidyverse)
library(modelr)
library(ds4psy)
library(unikn)
options(na.action = na.warn)
# data:
<- modelr::sim1 dat
Figure 19.1 shows a scatterplot of data
:

Figure 19.1: A scatterplot of the to-be-predicted data
.
Random models
Generate n
linear models of form y = p1 + p2 * x
:
<- 100
n set.seed(246) # for reproducible randomness.
# Sample n random values for p1 and p2:
<- tibble(
models p1 = runif(n, -20, 20),
p2 = runif(n, -5, 5)
)
ggplot(sim1, aes(x, y)) +
geom_abline(aes(intercept = p1, slope = p2), data = models, alpha = 1/5) +
geom_point(color = "steelblue", size = 2, alpha = 2/3) +
labs(title = "The data to predict and random linear models", caption = "Data from modelr::sim1") +
theme_ds4psy()
Note usage of set.seed()
for reproducible randomness.
Some random linear models appear to be ok, but most of them are pretty bad (i.e., do not describe the locations and relations of the data points very well).
We can improve the quality of our models by narrowing down the range of parameters from which we sample:
set.seed(123) # for reproducible randomness.
# Sample n random values for p1 and p2:
<- tibble(
models p1 = runif(n, 0, 10),
p2 = runif(n, 0, 5)
)
ggplot(sim1, aes(x, y)) +
geom_abline(aes(intercept = p1, slope = p2), data = models, alpha = 1/5) +
geom_point(color = "steelblue", size = 2, alpha = 2/3) +
labs(title = "The data to predict and random linear models", caption = "Data from modelr::sim1") +
theme_ds4psy()
The new set of models
is much better. But to determine which is the best candidate among them, we need some way of comapring them to the data in a quantitative fashion.
To measure the performance of a model, we need to quantify the distance between the model and the actual data points.
A function that computes the predictions of a model (i.e., the model’s y
values for some input data data$x
), given the model parameters mp
:
# mp ... model parameters
# Model predictions:
<- function(model, data) {
model1 1] + model[2] * data$x
model[
}
# Generate predictions for various model parameters and dat:
model1(c(6, 3), dat)
#> [1] 9 9 9 12 12 12 15 15 15 18 18 18 21 21 21 24 24 24 27 27 27 30 30 30 33
#> [26] 33 33 36 36 36
model1(c(6, 2), dat)
#> [1] 8 8 8 10 10 10 12 12 12 14 14 14 16 16 16 18 18 18 20 20 20 22 22 22 24
#> [26] 24 24 26 26 26
model1(c(5, 3), dat)
#> [1] 8 8 8 11 11 11 14 14 14 17 17 17 20 20 20 23 23 23 26 26 26 29 29 29 32
#> [26] 32 32 35 35 35
model1(c(5, 2), dat)
#> [1] 7 7 7 9 9 9 11 11 11 13 13 13 15 15 15 17 17 17 19 19 19 21 21 21 23
#> [26] 23 23 25 25 25
Adding a distance function: root of the mean-squared deviation: The following function computes the difference between actual and predicted, square them, average them, and the take the square root:
<- function(model, data) {
distance <- data$y - model1(model, data)
diff sqrt(mean(diff ^ 2))
}
# Measure distance for models and dat:
distance(c(6, 3), dat)
#> [1] 7.803283
distance(c(6, 2), dat)
#> [1] 2.60544
distance(c(5, 3), dat)
#> [1] 6.920964
distance(c(5, 2), dat)
#> [1] 2.190166
Alternative distance function (using 2 model parameters as two distinct arguments and only using the data of dat
):
<- function(p1, p2) {
dist distance(c(p1, p2), data = dat)
}
# Checks:
all.equal(dist(6, 3), distance(c(6, 3), dat))
#> [1] TRUE
all.equal(dist(6, 2), distance(c(6, 2), dat))
#> [1] TRUE
We can now compute the distance of each model in models
:
# using a for loop:
<- rep(NA, nrow(models))
md
for (i in 1:nrow(models)){
<- dist(models$p1[i], models$p2[i])
md[i]
}
md
# Or directly:
<- purrr::map2_dbl(models$p1, models$p2, dist)
md_2
# Verify equality:
all.equal(md, md_2)
In one step: Compute the distance of all models
(generated above) and arrange by lowest distance:
<- models %>%
models mutate(dist = purrr::map2_dbl(p1, p2, dist)) %>%
arrange(dist)
# with a column for dist
models #> # A tibble: 100 x 3
#> p1 p2 dist
#> <dbl> <dbl> <dbl>
#> 1 4.15 2.05 2.13
#> 2 1.88 2.33 2.41
#> 3 2.32 2.20 2.43
#> 4 6.53 1.61 2.48
#> 5 5.51 2.05 2.50
#> 6 6.93 1.60 2.50
#> 7 6.56 1.85 2.52
#> 8 6.68 1.56 2.56
#> 9 3.80 1.86 2.64
#> 10 6.41 1.54 2.67
#> # … with 90 more rows
We can now select the top_n
models (i.e., those with the lowest overall distance to the sim1
data) and compare them to the data:
# Plot the top models:
<- 10
top_n
# subset of models to plot:
<- models[1:top_n, ] # base R solution (given that models are sorted by -dist)
m2p <- dplyr::filter(models, rank(dist) <= top_n) # dplyr solution
m2p # m2p
<- paste0("The data to predict and the top ", top_n, " random linear models")
pt
ggplot(sim1, aes(x, y)) +
geom_abline(aes(intercept = p1, slope = p2, color = dist),
data = m2p, alpha = 4/5) +
geom_point(color = "steelblue", size = 2, alpha = 2/3) +
labs(title = pt, caption = "Data from modelr::sim1") +
scale_color_gradient(low = "gold", high = Bordeaux) +
theme_ds4psy()
An alternative way of looking at the models is by plotting the parameter pair p1
and p2
of each model as a scatterplot.
To gain an impression about a model’s performance, we can color the corresponding point by -dist
(i.e., models with a lower distance value will be colored in a lighter color).
Additionally, the following command first draws the subset of the top_n
models in a larger size and brighter color, effectively highlighting those models in the full scatterplot:
ggplot(models, aes(p1, p2)) +
geom_point(data = filter(models, rank(dist) <= top_n),
size = 4, color = Seegruen) +
geom_point(aes(color = dist), size = 2) +
scale_color_gradient(low = "gold", high = Bordeaux) +
theme_ds4psy()
Grid search
Rather than generating model parametes randomly (within constraints), we could also generate an evenly spaced grid of points (within the same constraints).
Instead of trying lots of random models, we could be more systematic and generate an evenly spaced grid of parameters. The following creates such a grid search for \(20\times20\) parameter pairs:
<- expand.grid(
grids p1 = seq(0, 10, length = 20),
p2 = seq(0, 5, length = 20)
%>%
) mutate(dist = purrr::map2_dbl(p1, p2, dist)) %>%
arrange(dist)
# Data with grid of linear models:
ggplot(sim1, aes(x, y)) +
geom_abline(aes(intercept = p1, slope = p2, color = dist),
data = grids, alpha = 1/5) +
geom_point(color = "steelblue", size = 2, alpha = 2/3) +
labs(title = "Grid search of linear models",
caption = "Data from modelr::sim1") +
scale_color_gradient(low = "gold", high = Bordeaux) +
theme_ds4psy()
# Parameters (as scatterplot):
ggplot(grids, aes(p1, p2)) +
geom_point(data = filter(grids, rank(dist) <= top_n),
size = 4, colour = Seegruen) +
geom_point(aes(colour = dist), size = 2) +
scale_color_gradient(low = "gold", high = Bordeaux) +
theme_ds4psy()
We could repeat this process for a finer grid to incrementally zoom in to the best parameters.
Minimizing distance
An important part of quantitative modeling is to find parameters that would optimize (i.e., maximize or minimize) some measure.
There are multiple ways of doing this.
If we are considering a one-dimensional range of values, the optimize()
function (of the stats package)
searches this range (from some lower
to some upper
value) for a minimum (or maximum) of the first argument of some function f
:
curve(expr = (x - 1/3)^2, from = -1, to = 2, col = "deepskyblue", lwd = 2, ylab = "y")
# As a function:
<- function(x, a) {(x - a)^2} # x as 1st parameter (to be optimized)
fn
<- optimize(fn, c(-1, 2), a = 1/3))
(mn #> $minimum
#> [1] 0.3333333
#>
#> $objective
#> [1] 0
<- optimize(fn, c(-1, 2), a = 1/3, maximum = TRUE))
(mx #> $maximum
#> [1] 1.999924
#>
#> $objective
#> [1] 2.777525
<- optimize(fn, c(-1, 1), a = 1/3, maximum = TRUE)) # Note limited range
(mx_2 #> $maximum
#> [1] -0.999959
#>
#> $objective
#> [1] 1.777668
# Not as lists:
# (mx <- unlist(mx)) # as named vector
points(x = mn$minimum, y = mn$objective, col = "olivedrab3", pch = 16, cex = 2) # using mn
points(x = mx$maximum, y = mx$objective, col = "brown2", pch = 17, cex = 2) # using mx
points(x = mx_2$maximum, y = mx_2$objective, col = "steelblue", pch = 18, cex = 2) # using mx_2
Examples based on the documentation of optimize()
:
# Example 2:
<- function(x) {x^3 * (x - 1)}
fn plot(fn, col = "deepskyblue", lwd = 2)
# Minimum:
<- optimize(fn, lower = 0, upper = 2))
(mn #> $minimum
#> [1] 0.7500078
#>
#> $objective
#> [1] -0.1054687
points(x = mn$minimum, y = mn$objective, col = "olivedrab3", pch = 16, cex = 2) # using mx
# See where a function is evaluated:
optimize(function(x) x^3 * (print(x) - 1), lower = 0, upper = 2)
#> [1] 0.763932
#> [1] 1.236068
#> [1] 0.472136
#> [1] 0.6666667
#> [1] 0.7777988
#> [1] 0.7471242
#> [1] 0.7502268
#> [1] 0.7499671
#> [1] 0.7500078
#> [1] 0.7500485
#> [1] 0.7500078
#> $minimum
#> [1] 0.7500078
#>
#> $objective
#> [1] -0.1054687
# Warning:
# A "wrong" solution with an unlucky interval and piecewise constant f():
<- function(x) ifelse(x > -1, ifelse(x < 4, exp(-1/abs(x - 1)), 10), 10)
f <- function(x) { print(x); f(x) }
fp
plot(f, -2,5, ylim = 0:1, col = "red")
optimize(fp, c(-4, 20)) # fails to find the minimum
#> [1] 5.167184
#> [1] 10.83282
#> [1] 14.33437
#> [1] 16.49845
#> [1] 17.83592
#> [1] 18.66253
#> [1] 19.1734
#> [1] 19.48913
#> [1] 19.68427
#> [1] 19.80487
#> [1] 19.8794
#> [1] 19.92547
#> [1] 19.95393
#> [1] 19.97153
#> [1] 19.9824
#> [1] 19.98913
#> [1] 19.99328
#> [1] 19.99585
#> [1] 19.99743
#> [1] 19.99841
#> [1] 19.99902
#> [1] 19.99939
#> [1] 19.99963
#> [1] 19.99977
#> [1] 19.99986
#> [1] 19.99991
#> [1] 19.99995
#> [1] 19.99995
#> $minimum
#> [1] 19.99995
#>
#> $objective
#> [1] 10
optimize(fp, c(-7, 20)) # finds minimum
#> [1] 3.313082
#> [1] 9.686918
#> [1] -0.6261646
#> [1] 1.244956
#> [1] 1.250965
#> [1] 0.771827
#> [1] 0.2378417
#> [1] 1.000451
#> [1] 0.9906964
#> [1] 0.9955736
#> [1] 0.9980122
#> [1] 0.9992315
#> [1] 0.9998411
#> [1] 0.9996083
#> [1] 0.9994644
#> [1] 0.9993754
#> [1] 0.9993204
#> [1] 0.9992797
#> [1] 0.9992797
#> $minimum
#> [1] 0.9992797
#>
#> $objective
#> [1] 0
optimize(fp, c(-2, 5))
#> [1] 0.6737621
#> [1] 2.326238
#> [1] -0.3475242
#> [1] 0.9935203
#> [1] 1.075221
#> [1] 1.034342
#> [1] 1.013931
#> [1] 1.003726
#> [1] 0.998623
#> [1] 1.001174
#> [1] 0.9998987
#> [1] 1.000538
#> [1] 1.000294
#> [1] 1.000143
#> [1] 1.00005
#> [1] 0.999992
#> [1] 0.9999513
#> [1] 0.9999513
#> $minimum
#> [1] 0.9999513
#>
#> $objective
#> [1] 0
Here, we explore two different methods (both from the stats package that comes with R):
General intuition: Given the scatterplot of our data
(shown in Figure 19.1), we have a pretty good idea that we are looking for a linear model.
General-purpose optimization (using
optim()
)Fitting a linear model (using
lm()
)
- ad 1: Using
optim()
to determine the best parameter pair:
Intuition: Given a fn = distance
and data
, optim()
will use an algorithm to find the parameter combination par
that minimize the value of distance
.
# sim1
<- optim(par = c(0, 0), fn = distance, data = sim1)
best $par # vector of the 2 best parameters (i.e., minimizing fn for data)
best#> [1] 4.222248 2.051204
ggplot(sim1, aes(x, y)) +
geom_point(color = "steelblue", size = 2, alpha = 2/3) +
geom_abline(intercept = best$par[1], slope = best$par[2], color = "gold") +
labs(title = "The best linear model to predict the data",
caption = "Data from modelr::sim1") +
theme_ds4psy()
- ad 2: Using
lm()
for fitting a linear model:
<- lm(y ~ x, data = sim1)
sim1_mod coef(sim1_mod)
#> (Intercept) x
#> 4.220822 2.051533
For fitting other datasets and types of functions, using glm()
or nlm()
may be indicated.
+++ here now +++
19.4.1 ToDo 2
Generalize and use a more meaningful dataset:
Candidate data:
father.son
orgalton
(both from the UsingR package)