15 Overview of the Generalized Linear Model

Along with Kruschke’s text, in this part if the project we’re moving away from simple Bernoulli coin flipping examples to more complicated analyses of the type we’d actually see in applied data analysis. As Kruschke explained, we’ll be using a

versatile family of models known as the generalized linear model (GLM; McCullagh & Nelder, 1989; Nelder & Wedderburn, 1972). This family of models comprises the traditional “off the shelf” analyses such as \(t\) tests, analysis of variance (ANOVA), multiple linear regression, logistic regression, log-linear models, etc. (p. 420)

15.1 Types of variables

“To understand the GLM and its many specific cases, we must build up a variety of component concepts regarding relationships between variables and how variables are measured in the first place (p. 420).”

15.1.1 Predictor and predicted variables.

It’s worth repeating the second paragraph of this subsection in its entirety.

The key mathematical difference between predictor and predicted variables is that the likelihood function expresses the probability of values of the predicted variable as a function of values of the predictor variable. The likelihood function does not describe the probabilities of values of the predictor variable. The value of the predictor variable comes from outside the system being modeled, whereas the value of the predicted variable depends on the value of the predictor variable. (p. 420)

This is one of those fine points that can be easy to miss when you’re struggling through the examples in this book or chest-deep in the murky waters of your own real-world data problem. But write this down on a sticky note and put it in your sock drawer or something. There are good reasons to fret about the distributional properties of your predictor variables–rules of thumb about the likelihood aren’t among them.

15.1.2 Scale types: metric, ordinal, nominal, and count.

I don’t know that I’m interested in detailing the content of this section. But it’s worth while considering what Kruschke wrote in its close.

Why we care: We care about the scale type because the likelihood function must specify a probability distribution on the appropriate scale. If the scale has two nominal values, then a Bernoulli likelihood function may be appropriate. If the scale is metric, then a normal distribution may be appropriate as a probability distribution to describe the data. Whenever we are choosing a model for data, we must answer the question, What kind of scale are we dealing with? (p. 423, emphasis in the original)

15.2 Linear combination of predictors

“The core of the GLM is expressing the combined influence of predictors as their weighted sum. The following sections build this idea by scaffolding from the simplest intuitive cases” (p. 423).

15.2.1 Linear function of a single metric predictor.

“A linear function is the generic, ‘vanilla,’ off-the-shelf dependency that is used in statistical models” (p. 424). Its basic form is

\[y = \beta_0 + \beta_1 x,\]

where \(y\) is the variable being predicted and \(x\) is the predictor. \(\beta_0\) is the intercept (i.e., the expected value when \(x\) is zero) and \(\beta_1\) is the expected increase in \(y\) after a one-unit increase in \(x\).

We’ll fire up the tidyverse to make the left panel of Figure 15.1.

Did you notice how we followed the form of the basic linear function when we defined the values for y_1 and y_2? It’s that simple! Here’s the right panel.

Summary of why we care. The likelihood function includes the form of the dependency of \(y\) on \(x\). When \(y\) and \(x\) are metric variables, the simplest form of dependency, both mathematically and intuitively, is one that preserves proportionality. The mathematical expression of this relation is a so-called linear function. The usual mathematical expression of a line is the \(y\)-intercept form, but sometimes a more intuitive expression is the \(x\) threshold form. Linear functions form the core of the GLM. (pp. 424–425, emphasis in the original)

15.2.2 Additive combination of metric predictors.

The linear combination of \(K\) predictors has the general form

\[\begin{align*} y & = \beta_0 + \beta_1 x_1 + \dots + \beta_K x_K \\ & = \beta_0 + \sum_{k = 1}^K \beta_k x_k. \end{align*}\]

In the special case where \(K = 0\), you have an intercept-only model,

\[y = \beta_0,\]

in which \(\beta_0\) simply models the mean of \(y\).

We won’t be able to reproduce the wireframe plots of Figure 15.2, exactly. But we can use some geom_raster() tricks from back in Chapter 10 to express the third y dimension as fill gradients.

Here we’ve captures some of Kruschke’s grid aesthetic by keeping the by argument within seq() somewhat coarse and omitting the interpolate = T argument from geom_raster(). If you’d prefer smoother fill transitions for the y-values, set by = .1 and interpolate = T.

15.2.3 Nonadditive interaction of metric predictors.

As it turns out, “the combined influence of two predictors does not have to be additive” (p. 427). Let’s explore what that can look like with our version of Figure 15.3.

Did you notice all those * operators in the mutate() code? We’re no longer just additive. We’re square in multiplicative town!

There is a subtlety in the use of the term “linear” that can sometimes cause confusion in this context. The interactions shown in Figure 15.3 are not linear on the two predictors \(x_1\) and \(x_2\). But if the product of the two predictors, \(x_1 x_2\), is thought of as a third predictor, then the model is linear on the three predictors, because the predicted value of \(y\) is a weighted additive combination of the three predictors. This reconceptualization can be useful for implementing nonlinear interactions in software for linear models, but we will not be making that semantic leap to a third predictor, and instead we will think of a nonadditive combination of two predictors.

A nonadditive interaction of predictors does not have to be multiplicative. Other types of interaction are possible. (p. 428, emphasis in the original)

15.2.4 Nominal predictors.

15.2.4.1 Linear model for a single nominal predictor.

Instead of representing the value of the nominal predictor by a single scalar value \(x\), we will represent the nominal predictor by a vector \(\vec{x} = \langle x_{[1]},...,x_{[J]} \rangle\) where \(J\) is the number of categories that the predictor has…

We will denote the baseline value of the prediction as \(\beta_0\). The deflection for the \(j\)th level of the predictor is denoted \(\beta_{[j]}\). Then the predicted value is

\[\begin{align*} y & = \beta_0 + \beta_{[1]} x_{[1]} + \dots + \beta_{[J]} x_{[J]} \\ & = \beta_0 + \vec{\beta} \cdot \vec{x} \end{align*}\]

where the notation \(\vec{\beta} \cdot \vec{x}\) is sometimes called the ‘dot product’ of the vectors. (p. 429)

15.2.4.2 Additive combination of nominal predictors.

Both panels in Figure 15.4 are going to require three separate data objects. Here’s the code for the top panel.

Here’s the code for the bottom panel.

Before we make our versions of the two panels in Figure 15.5, we should note that the values in the prose are at odds with the values implied in the figure. For simplicity, our versions of Figure 15.5 will match up with the values in the original figure, not the prose. For a look at what the figure might have looked like had it been based on the values in the prose, check out Kruschke’s Corrigenda.

Here’s the code for the left panel of Figure 15.5.

Did you notice our filter() trick in the code?

15.3 Linking from combined predictors to noisy predicted data

15.3.1 From predictors to predicted central tendency.

After the predictor variables are combined, they need to be mapped to the predicted variable. This mathematical mapping is called the (inverse) link function, and is denoted by \(f()\) in the following equation:

\[y = f(\operatorname{lin}(x))\]

Until now, we have been assuming that the link function is merely the identity function, \(f(\operatorname{lin}(x)) = \operatorname{lin}(x)\). (p. 436, emphasis in the original)

Yet as we’ll see, many models use links other than the identity function.

15.3.1.1 The logistic function.

The logistic link function follows the form

\[y = \operatorname{logistic}(x) = \frac{1}{ \big (1 + \exp (-x) \big )}.\]

We can write the logistic function for a univariable metric predictor as

\[y = \operatorname{logistic}(x; \beta_0, \beta_1) = \frac{1}{\big (1 + \exp (-(\beta_0 + \beta_1)) \big )}.\]

And if we prefer to parameterize it in terms of gain \(\gamma\) and threshold \(\theta\), it’d be

\[y = \operatorname{logistic}(x; \gamma, \theta) = \frac{1}{ \big (1 + \exp (-\gamma (x - \theta)) \big )}.\]

We can make the sexy logistic curves of Figure 15.6 with stat_function(), into which we’ll plug our very own custom make_logistic() function. Here’s the left panel.

For kicks, we’ll take a different approach for the right panel. Instead of pumping values through stat_function() within our plot code, we’ll use our make_logistic() function within mutate() before beginning the plot code.

I don’t know that one plotting approach is better than the other. It’s good to have options.

To make our two-dimensional version of the wireframe plots of Figure 15.7, we’ll first want to define the logistic() function.

Now we’ll just extend the same method we used for Figures 15.2 and 15.3.

“The threshold determines the position of the logistical cliff. In other words, the threshold determines the x values for which \(y = 0.5\)… The gain determines the steepness of the logistical cliff” (pp. 437–439, emphasis in the original).

15.3.1.2 The cumulative normal function.

The cumulative normal is denoted \(\Phi (x; \mu, \sigma)\), where x is a real number and where \(\mu\) and \(\sigma\) are parameter values, called the mean and standard deviation of the normal distribution. The parameter \(\mu\) governs the point at which the cumulative normal, \(\Phi(x)\), equals 0.5. In other words, \(\mu\) plays the same role as the threshold in the logistic. The parameter \(\sigma\) governs the steepness of the cumulative normal function at \(x = \mu\), but inversely, such that a smaller value of \(\sigma\) corresponds to a steeper cumulative normal. (p. 440, emphasis in the original)

Here we plot the standard normal density in the top panel of Figure 15.8.

Did you notice our data = . %>% filter(x <=1) trick in geom_ribbon()? Anyway, here’s the analogous cumulative normal function depicted in the bottom panel of Figure 15.8.

Terminology: The inverse of the cumulative normal is called the probit function. (“Probit” stands for “probability unit”; Bliss, 1934). The probit function maps a value \(p\), for \(0.0 \leq p \leq 1.0\), onto the infinite real line, and a graph of the probit function looks very much like the logit function. (p. 439, emphasis in the original)

15.3.2 From predicted central tendency to noisy data.

In the real world, there is always variation in \(y\) that we cannot predict from \(x\). This unpredictable “noise” in \(y\) might be deterministically caused by sundry factors we have neither measured nor controlled, or the noise might be caused by inherent non-determinism in \(y\). It does not matter either way because in practice the best we can do is predict the probability that \(y\) will have any particular value, dependent upon \(x\)

To make this notion of probabilistic tendency precise, we need to specify a probability distribution for \(y\) that depends on \(f (\operatorname{lin} (x))\). To keep the notation tractable, first define \(\mu = f (\operatorname{lin} (x))\). The value \(\mu\) represents the central tendency of the predicted \(y\) values, which might or might not be the mean. With this notation, we then denote the probability distribution of \(y\) as some to-be-specified probability density function, abbreviated as “pdf”:

\(y \sim \operatorname{pdf} \big ( \mu, [\text{scale, shape, etc.}] \big )\)

As indicated by the bracketed terms after \(\mu\), the pdf might have various additional parameters that control the distribution’s scale (i.e., standard deviation), shape, etc.

The form of the pdf depends on the measurement scale of the predicted variable. (pp. 440–441, emphasis in the original)

The top panel of Figure 15.9 is tricky. One way to make those multiple densities tipped on their sides is with ggridges::geom_ridgeline() followed by coord_flip(). However, explore a different approach that’ll come in handy in many of the plots we’ll be making in later chapters. The method requires we make a data set with the necessary coordinates for the side-tipped densities. Let’s walk through that step slowly.

## Classes 'tbl_df', 'tbl' and 'data.frame':    400 obs. of  6 variables:
##  $ x      : num  -7.54 -7.55 -7.55 -7.56 -7.57 ...
##  $ y_mean : num  -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 ...
##  $ ll     : num  -10.6 -10.6 -10.6 -10.6 -10.6 ...
##  $ ul     : num  0.614 0.614 0.614 0.614 0.614 ...
##  $ y      : num  -10.6 -10.5 -10.4 -10.3 -10.2 ...
##  $ density: num  0.00388 0.00454 0.0053 0.00617 0.00715 ...

In case it’s not clear, we’ll be plotting with the x and y columns. Think of the other columns as showing our work. But now we’ve got those curves data, we’re ready to simulate the points and plot.

We’ll revisit this method in Chapter 17. The wireframe plots at the bottom of Figure 15.9 and in Figure 15.10 are outside of our ggplot2 purview.

15.4 Formal expression of the GLM

We can write the GLM as

\[\begin{align*} y & \sim \operatorname{pdf} \big (\mu, [\text{parameters}] \big ), \text{where} \\ \mu & = f \big (\operatorname{lin}(x), [\text{parameters}] \big ). \end{align*}\]

As has been previously explained, the predictors \(x\) are combined in the linear function \(\operatorname{lin}(x)\), and the function \(f\) in [the first equation] is called the inverse link function. The data, \(y\), are distributed around the central tendency \(\mu\) according to the probability density function labeled “pdf.” (p. 444)

15.4.1 Cases of the GLM.

When a client brings an application to a [statistical] consultant, one of the first things the consultant does is find out from the client which data are supposed to be predictors and which data are supposed to be predicted, and the measurement scales of the data… When you are considering how to analyze data, your first task is to be your own consultant and find out which data are predictors, which are predicted, and what measurement scales they are. (pp. 445–446)

Soak this last bit in. It’s gold. I’ve found it to be true in my exchanges with other researchers and with my own data problems, too.

Session info

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.3    
## [5] readr_1.3.1     tidyr_1.0.0     tibble_2.1.3    ggplot2_3.2.1  
## [9] tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2        cellranger_1.1.0  pillar_1.4.2     
##  [4] compiler_3.6.0    tools_3.6.0       zeallot_0.1.0    
##  [7] digest_0.6.21     viridisLite_0.3.0 lubridate_1.7.4  
## [10] jsonlite_1.6      evaluate_0.14     lifecycle_0.1.0  
## [13] nlme_3.1-139      gtable_0.3.0      lattice_0.20-38  
## [16] pkgconfig_2.0.3   rlang_0.4.1       cli_1.1.0        
## [19] rstudioapi_0.10   yaml_2.2.0        haven_2.1.0      
## [22] xfun_0.10         withr_2.1.2       xml2_1.2.0       
## [25] httr_1.4.0        knitr_1.23        hms_0.4.2        
## [28] generics_0.0.2    vctrs_0.2.0       grid_3.6.0       
## [31] tidyselect_0.2.5  glue_1.3.1.9000   R6_2.4.0         
## [34] readxl_1.3.1      rmarkdown_1.13    modelr_0.1.4     
## [37] magrittr_1.5      ellipsis_0.3.0    backports_1.1.5  
## [40] scales_1.0.0      htmltools_0.4.0   rvest_0.3.4      
## [43] assertthat_0.2.1  colorspace_1.4-1  labeling_0.3     
## [46] stringi_1.4.3     lazyeval_0.2.2    munsell_0.5.0    
## [49] broom_0.5.2       crayon_1.3.4