Examples and applications
Case study: The Challenger disaster
The Challenger disaster occurred on the 28th January of 1986, when the NASA Space Shuttle orbiter Challenger broke apart and disintegrated at 73 seconds into its flight, leading to the deaths of its seven crew members. The accident deeply shocked the US society, in part due to the attention the mission had received because of the presence of Christa McAuliffe, who would have been the first astronaut-teacher. Because of this, NASA TV broadcasted live the launch to US public schools, which resulted in millions of school children witnessing the accident. The accident had serious consequences for the NASA credibility and resulted in an interruption of 32 months in the shuttle program. The Presidential Rogers Commission (formed by astronaut Neil A. Armstrong and Nobel laureate Richard P. Feynman, among others) was created to investigate the disaster.
The Rogers Commission elaborated a report (Presidential Commission on the Space Shuttle Challenger Accident 1986) with all the findings. The commission determined that the disintegration began with the failure of an O-ring seal in the solid rocket motor due to the unusual cold temperatures (-0.6 Celsius degrees) during the launch. This failure produced a breach of burning gas through the solid rocket motor that compromised the whole shuttle structure, resulting in its disintegration due to the extreme aerodynamic forces. The problematic with O-rings was something known: the night before the launch, there was a three-hour teleconference between motor engineers and NASA management, discussing the effect of low temperature forecasted for the launch on the O-ring performance. The conclusion, influenced by Figure 4.2a, was:
“Temperature data [are] not conclusive on predicting primary O-ring blowby.”
The Rogers Commission noted a major flaw in Figure 4.2a: the flights with zero incidents were excluded from the plot because it was felt that these flights did not contribute any information about the temperature effect (Figure 4.2b). The Rogers Commission concluded:
“A careful analysis of the flight history of O-ring performance would have revealed the correlation of O-ring damage in low temperature”.
The purpose of this case study, inspired by Dalal, Fowlkes, and Hoadley (1989), is to quantify what was the influence of the temperature in the probability of having at least one incident related with the O-rings. Specifically, we want to address the following questions:
- Q1. Is the temperature associated with O-ring incidents?
- Q2. In which way was the temperature affecting the probability of O-ring incidents?
- Q3. What was the predicted probability of an incidient in an O-ring for the temperature of the launch day?
To try to answer these questions we have the challenger
dataset (download). The dataset contains (shown in Table 4.1) information regarding the state of the solid rocket boosters after launch for 23 flights. Each row has, among others, the following variables:
fail.field
, fail.nozzle
: binary variables indicating whether there was an incident with the O-rings in the field joints or in the nozzles of the solid rocket boosters. 1
codifies an incident and 0
its absence. On the analysis, we focus on the O-rings of the field joint as being the most determinants for the accident.temp
: temperature in the day of launch. Measured in Celsius degrees.pres.field
, pres.nozzle
: leak-check pressure tests of the O-rings. These tests assured that the rings would seal the joint.
Table 4.1: The challenger
dataset.1 | 12/04/81 | 0 | 0 | 18.9 |
2 | 12/11/81 | 1 | 0 | 21.1 |
3 | 22/03/82 | 0 | 0 | 20.6 |
5 | 11/11/82 | 0 | 0 | 20.0 |
6 | 04/04/83 | 0 | 1 | 19.4 |
7 | 18/06/83 | 0 | 0 | 22.2 |
8 | 30/08/83 | 0 | 0 | 22.8 |
9 | 28/11/83 | 0 | 0 | 21.1 |
41-B | 03/02/84 | 1 | 1 | 13.9 |
41-C | 06/04/84 | 1 | 1 | 17.2 |
41-D | 30/08/84 | 1 | 1 | 21.1 |
41-G | 05/10/84 | 0 | 0 | 25.6 |
51-A | 08/11/84 | 0 | 0 | 19.4 |
51-C | 24/01/85 | 1 | 1 | 11.7 |
51-D | 12/04/85 | 0 | 1 | 19.4 |
51-B | 29/04/85 | 0 | 1 | 23.9 |
51-G | 17/06/85 | 0 | 1 | 21.1 |
51-F | 29/07/85 | 0 | 0 | 27.2 |
51-I | 27/08/85 | 0 | 0 | 24.4 |
51-J | 03/10/85 | 0 | 0 | 26.1 |
61-A | 30/10/85 | 1 | 0 | 23.9 |
61-B | 26/11/85 | 0 | 1 | 24.4 |
61-C | 12/01/86 | 1 | 1 | 14.4 |
Let’s begin the analysis by replicating Figures 4.2a and 4.2b and checking that linear regression is not the right tool for answering Q1–Q3. For that, we make two scatterplots of nfails.field
(number of total incidents in the field joints) versus temp
, the first one excluding the launches without incidents (subset = nfails.field > 0
) and the second one for all the data. Doing it through R Commander
as we saw in Chapter 2, you should get something similar to:
scatterplot(nfails.field ~ temp, reg.line = lm, smooth = FALSE, spread = FALSE, boxplots = FALSE, data = challenger, subset = nfails.field > 0)
scatterplot(nfails.field ~ temp, reg.line = lm, smooth = FALSE, spread = FALSE, boxplots = FALSE, data = challenger)
There is a fundamental problem in using linear regression for this data: the response is not continuous. As a consequence, there is no linearity and the errors around the mean are not normal (indeed, they are strongly non normal). Let’s check this with the corresponding diagnostic plots:
mod <- lm(nfails.field ~ temp, data = challenger)par(mfrow = 1:2)plot(mod, 1)plot(mod, 2)
Albeit linear regression is not the adequate tool for this data, it is able to detect the obvious difference between the two plots:
- The trend for launches with incidents is flat, hence suggesting there is no dependence on the temperature (Figure 4.2a). This was one of the arguments behind NASA’s decision of launching the rocket at a temperature of -0.6 degrees.
- However, the trend for all launches indicates a clear negative dependence between temperature and number of incidents! (Figure 4.2b). Think about it in this way: the minimum temperature for a launch without incidents ever recorded was above 18 degrees, and the Challenger was launched at -0.6 without clearly knowing the effects of such low temperatures.
Instead of trying to predict the number of incidents, we will concentrate on modelling the
probability of expecting at least one incident given the temperature, a simpler but also revealing approach. In other words, we look to estimate the following curve:
\[p(x)=\mathbb{P}(\text{incident}=1|\text{temperature}=x)\] from
fail.field
and
temp
. This probability can not be properly modeled as a linear function like
\(\beta_0+\beta_1x\), since inevitably will fall outside
\([0,1]\) for some value of
\(x\) (some will have negative probabilities or probabilities larger than one). The technique that solves this problem is the
logistic regression. The idea behind is quite simple: transform a linear model
\(\beta_0+\beta_1x\) – which is aimed for a response in
\(\mathbb{R}\) – so that it yields a value in
\([0,1]\). This is achieved by the
logistic function\[\begin{align}\text{logistic}(t)=\frac{e^t}{1+e^t}=\frac{1}{1+e^{-t}}.\tag{4.1}\end{align}\]The logistic model in this case is \[\mathbb{P}(\text{incident}=1|\text{temperature}=x)=\text{logistic}\left(\beta_0+\beta_1x\right)=\frac{1}{1+e^{-(\beta_0+\beta_1x)}},\] with \(\beta_0\) and \(\beta_1\) unknown. Let’s fit the model to the data by estimating \(\beta_0\) and \(\beta_1\).
In order to fit a logistic regression to the data, go to 'Statistics' -> 'Fit models' -> 'Generalized linear model...'
. A window like Figure 4.3 will pop-up, which you should fill as indicated.
A code like this will be generated:
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)summary(nasa)## ## Call:## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.0566 -0.7575 -0.3818 0.4571 2.2195 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 7.5837 3.9146 1.937 0.0527 .## temp -0.4166 0.1940 -2.147 0.0318 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 20.335 on 21 degrees of freedom## AIC: 24.335## ## Number of Fisher Scoring iterations: 5exp(coef(nasa)) # Exponentiated coefficients ("odds ratios")## (Intercept) temp ## 1965.9743592 0.6592539
The summary of the logistic model is notably different from the linear regression, as the methodology behind is quite different. Nevertheless, we have tests for the significance of each coefficient. Here we obtain that temp
is significantly different from zero, at least at a level \(\alpha=0.05\). Therefore we can conclude that the temperature is indeed affecting the probability of an incident with the O-rings (answers Q1).
The precise interpretation of the coefficients will be given in the next section. For now, the coefficient of temp
, \(\hat\beta_1\), can be regarded the “correlation between the temperature and the probability of having at least one incident”. These correlation, as evidenced by the sign of \(\hat\beta_1\), is negative. Let’s plot the fitted logistic curve to see that indeed the probability of incident and temperature are negatively correlated:
# Plot dataplot(challenger$temp, challenger$fail.field, xlim = c(-1, 30), xlab = "Temperature", ylab = "Incident probability")# Draw the fitted logistic curvex <- seq(-1, 30, l = 200)y <- exp(-(nasa$coefficients[1] + nasa$coefficients[2] * x))y <- 1 / (1 + y)lines(x, y, col = 2, lwd = 2)# The Challengerpoints(-0.6, 1, pch = 16)text(-0.6, 1, labels = "Challenger", pos = 4)
At the sight of this curve and the summary of the model we can conclude that the temperature was increasing the probability of an O-ring incident (Q2). Indeed, the confidence intervals for the coefficients show a significative negative correlation at level \(\alpha=0.05\):
confint(nasa, level = 0.95)## 2.5 % 97.5 %## (Intercept) 1.3364047 17.7834329## temp -0.9237721 -0.1089953
Finally, the probability of having at least one incident with the O-rings in the launch day was \(0.9996\) according to the fitted logistic model (Q3). This is easily obtained:
predict(nasa, newdata = data.frame(temp = -0.6), type = "response")## 1 ## 0.999604
Be aware that type = "response"
has a different meaning in logistic regression. As you can see it does not return a CI for the prediction as in linear models. Instead, type = "response"
means that the probability should be returned, instead of the value of the link function, which is returned with type = "link"
(the default).
Recall that there is a serious problem of extrapolation in the prediction, which makes it less precise (or more variable). But this extrapolation, together with the evidences raised by a simple analysis like we did, should have been strong arguments for postponing the launch.
To conclude this section, we refer to a funny and comprehensive exposition by Juan Cuesta (University of Cantabria) on the flawed statistical analysis that contributed to the Challenger disaster.
Model formulation and estimation by maximum likelihood
As saw in Section
3.2, the multiple linear model described the relation between the random variables
\(X_1,\ldots,X_k\) and
\(Y\) by assuming the linear relation
\[\begin{align*}Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_k X_k + \varepsilon.\end{align*}\]Since we assume
\(\mathbb{E}[\varepsilon|X_1=x_1,\ldots,X_k=x_k]=0\), the previous equation was equivalent to
\[\begin{align}\mathbb{E}[Y|X_1=x_1,\ldots,X_k=x_k]=\beta_0+\beta_1x_1+\ldots+\beta_kx_k,\tag{4.2}\end{align}\]where \(\mathbb{E}[Y|X_1=x_1,\ldots,X_k=x_k]\) is the mean of \(Y\) for a particular value of the set of predictors. As remarked in Section 3.3, it was a necessary condition that \(Y\) was continuous in order to satisfy the normality of the errors, hence the linear model assumptions. Or in other words, the linear model is designed for a continuous response.
The situation when \(Y\) is discrete (naturally ordered values; e.g. number of fails, number of students) or categorical (non-ordered categories; e.g. territorial divisions, ethnic groups) requires a special treatment. The simplest situation is when \(Y\) is binary (or dichotomic): it can only take two values, codified for convenience as \(1\) (success) and \(0\) (failure). For example, in the Challenger case study we used fail.field
as an indicator of whether “there was at least an incident with the O-rings” (1
= yes, 0
= no). For binary variables there is no fundamental distinction between the treatment of discrete and categorical variables.
More formally, a binary variable is known as a
Bernoulli variable, which is the simplest non-trivial random variable. We say that
\(Y\sim\mathrm{Ber}(p)\),
\(0\leq p\leq1\), if
\[Y=\left\{\begin{array}{ll}1,&\text{with probability }p,\\0,&\text{with probability }1-p,\end{array}\right.\] or, equivalently, if
\(\mathbb{P}[Y=1]=p\) and
\(\mathbb{P}[Y=0]=1-p\), which can be written compactly as
\[\begin{align}\mathbb{P}[Y=y]=p^y(1-p)^{1-y},\quad y=0,1.\tag{4.3}\end{align}\]Recall that a binomial variable with size \(n\) and probability \(p\), \(\mathrm{Bi}(n,p)\), is obtained by summing \(n\) independent \(\mathrm{Ber}(p)\) (so \(\mathrm{Ber}(p)\) is the same as \(\mathrm{Bi}(1,p)\)). This is why we need to use a family = "binomial"
in glm
, to indicate that the response is binomial.
A Bernoulli variable Y is completely determined by the probability p. So do its mean and variance:
- 𝔼[Y]=p × 1 + (1 − p)×0 = p
- 𝕍ar[Y]=p(1 − p)
In particular, recall that ℙ[Y = 1]=𝔼[Y]=p.
This is something relatively uncommon (or a 𝒩(μ, σ^{2}), μ determines the mean and σ^{2} the variance) that has important consequences for the logistic model: we do not need a σ^{2}.
Are these Bernoulli variables? If so, which is the value of p and what could the codes 0 and 1 represent?
- The toss of a fair coin.
- A variable with mean p and variance p(1 − p).
- The roll of a dice.
- A binary variable with mean 0.5 and variance 0.45.
- The winner of an election with two candidates.
Assume then that \(Y\) is a binary/Bernoulli variable and that \(X_1,\ldots,X_k\) are predictors associated to \(Y\) (no particular assumptions on them). The purpose in logistic regression is to estimate \[p(x_1,\ldots,x_k)=\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]=\mathbb{E}[Y|X_1=x_1,\ldots,X_k=x_k],\] this is, how the probability of \(Y=1\) is changing according to particular values, denoted by \(x_1,\ldots,x_k\), of the random variables \(X_1,\ldots,X_k\). \(p(x_1,\ldots,x_k)=\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]\) stands for the conditional probability of \(Y=1\) given \(X_1,\ldots,X_k\). At sight of (4.2), a tempting possibility is to consider the model \[p(x_1,\ldots,x_k)=\beta_0+\beta_1x_1+\ldots+\beta_kx_k.\] However, such a model will run into serious problems inevitably: negative probabilities and probabilities larger than one. The solution is to consider a function to encapsulate the value of \(z=\beta_0+\beta_1x_1+\ldots+\beta_kx_k\), in \(\mathbb{R}\), and map it back to \([0,1]\). There are several alternatives to do so, based on distribution functions \(F:\mathbb{R}\longrightarrow[0,1]\) that deliver \(y=F(z)\in[0,1]\) (see Figure 4.5). Different choices of \(F\) give rise to different models:
- Uniform. Truncate \(z\) to \(0\) and \(1\) when \(z<0\) and \(z>1\), respectively.
- Logit. Consider the logistic distribution function: \[F(z)=\mathrm{logistic}(z)=\frac{e^z}{1+e^z}=\frac{1}{1+e^{-z}}.\]
- Probit. Consider the normal distribution function, this is, \(F=\Phi\).
The
logistic transformation is the most employed due to its
tractability, interpretability and smoothness. Its inverse,
\(F^{-1}:[0,1]\longrightarrow\mathbb{R}\), known as the
logit function, is
\[\mathrm{logit}(p)=\mathrm{logistic}^{-1}(p)=\log\frac{p}{1-p}.\] This is a
link function, this is, a function that maps a given space (in this case
\([0,1]\)) into
\(\mathbb{R}\). The term link function is employed in
generalized linear models, which follow exactly the same philosophy of the logistic regression – mapping the domain of
\(Y\) to
\(\mathbb{R}\) in order to apply there a linear model. We will concentrate here exclusively on the logit as a link function. Therefore, the
logistic model is
\[\begin{align}p(x_1,\ldots,x_k)=\mathrm{logistic}(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)}}.\tag{4.4}\end{align}\]The linear form inside the exponent has a clear interpretation:
- If \(\beta_0+\beta_1x_1+\ldots+\beta_kx_k=0\), then \(p(x_1,\ldots,x_k)=\frac{1}{2}\) (\(Y=1\) and \(Y=0\) are equally likely).
- If \(\beta_0+\beta_1x_1+\ldots+\beta_kx_k<0\), then \(p(x_1,\ldots,x_k)<\frac{1}{2}\) (\(Y=1\) less likely).
- If \(\beta_0+\beta_1x_1+\ldots+\beta_kx_k>0\), then \(p(x_1,\ldots,x_k)>\frac{1}{2}\) (\(Y=1\) more likely).
To be more precise on the interpretation of the coefficients
\(\beta_0,\ldots,\beta_k\) we need to introduce the
odds. The
odds is an equivalent way of expressing the distribution of probabilities in a binary variable. Since
\(\mathbb{P}[Y=1]=p\) and
\(\mathbb{P}[Y=0]=1-p\), both the success and failure probabilities can be inferred from
\(p\). Instead of using
\(p\) to characterize the distribution of
\(Y\), we can use
\[\begin{align}\mathrm{odds}(Y)=\frac{p}{1-p}=\frac{\mathbb{P}[Y=1]}{\mathbb{P}[Y=0]}.\tag{4.5}\end{align}\]The odds is the ratio between the probability of success and the probability of failure. It is extensively used in betting due to its better interpretability. For example, if a horse \(Y\) has a probability \(p=2/3\) of winning a race (\(Y=1\)), then the odds of the horse is \[\text{odds}=\frac{p}{1-p}=\frac{2/3}{1/3}=2.\] This means that the horse has a probability of winning that is twice larger than the probability of losing. This is sometimes written as a \(2:1\) or \(2 \times 1\) (spelled “two-to-one”). Conversely, if the odds of \(Y\) is given, we can easily know what is the probability of success \(p\), using the inverse of (4.5): \[p=\mathbb{P}[Y=1]=\frac{\text{odds}(Y)}{1+\text{odds}(Y)}.\] For example, if the odds of the horse were \(5\), that would correspond to a probability of winning \(p=5/6\).
Recall that the odds is a number in [0, +∞]. The 0 and +∞ values are attained for p = 0 and p = 1, respectively. The log-odds (or logit) is a number in [ − ∞, +∞].
We can rewrite
(4.4) in terms of the odds
(4.5). If we do so, we have:
\[\begin{align}\mathrm{odds}(Y|X_1=x_1,\ldots,X_k=x_k)&=\frac{p(x_1,\ldots,x_k)}{1-p(x_1,\ldots,x_k)}\nonumber\\&=e^{\beta_0+\beta_1x_1+\ldots+\beta_kx_k}\nonumber\\&=e^{\beta_0}e^{\beta_1x_1}\ldots e^{\beta_kx_k}.\tag{4.6}\end{align}\]or, taking logarithms, the
log-odds (or logit)
\[\begin{align}\log(\mathrm{odds}(Y|X_1=x_1,\ldots,X_k=x_k))=\beta_0+\beta_1x_1+\ldots+\beta_kx_k.\tag{4.7}\end{align}\]The conditional log-odds (4.7) plays here the role of the conditional mean for multiple linear regression. Therefore, we have an analogous interpretation for the coefficients:
- \(\beta_0\): is the log-odds when \(X_1=\ldots=X_k=0\).
- \(\beta_j\), \(1\leq j\leq k\): is the additive increment of the log-odds for an increment of one unit in \(X_j=x_j\), provided that the remaining variables \(X_1,\ldots,X_{j-1},X_{j+1},\ldots,X_k\) do not change.
The log-odds is not so easy to interpret as the odds. For that reason, an equivalent way of interpreting the coefficients, this time based on (4.6), is:
- \(e^{\beta_0}\): is the odds when \(X_1=\ldots=X_k=0\).
- \(e^{\beta_j}\), \(1\leq j\leq k\): is the multiplicative increment of the odds for an increment of one unit in \(X_j=x_j\), provided that the remaining variables \(X_1,\ldots,X_{j-1},X_{j+1},\ldots,X_k\) do not change. If the increment in \(X_j\) is of \(r\) units, then the multiplicative increment in the odds is \((e^{\beta_j})^r\).
As a consequence of this last interpretation, we have:
If
\(\beta_j>0\) (respectively,
\(\beta_j<0\)) then
\(e^{\beta_j}>1\) (
\(e^{\beta_j}<1\)) in
(4.6). Therefore, an increment of one unit in
\(X_j\), provided that the remaining variables
\(X_1,\ldots,X_{j-1},X_{j+1},\ldots,X_k\) do not change, results in an increment (decrement) of the odds, this is, in an increment (decrement) of
\(\mathbb{P}[Y=1]\).
Since the relationship between p(X_{1}, …, X_{k}) and X_{1}, …, X_{k} is not linear, β_{j} does not correspond to the change in p(X_{1}, …, X_{k}) associated with a one-unit increase in X_{j}.
Let’s visualize this concepts quickly with the output of the Challenger case study:
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)summary(nasa)## ## Call:## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.0566 -0.7575 -0.3818 0.4571 2.2195 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 7.5837 3.9146 1.937 0.0527 .## temp -0.4166 0.1940 -2.147 0.0318 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 20.335 on 21 degrees of freedom## AIC: 24.335## ## Number of Fisher Scoring iterations: 5exp(coef(nasa)) # Exponentiated coefficients ("odds ratios")## (Intercept) temp ## 1965.9743592 0.6592539# Plot dataplot(challenger$temp, challenger$fail.field, xlim = c(-1, 30), xlab = "Temperature", ylab = "Incident probability")# Draw the fitted logistic curvex <- seq(-1, 30, l = 200)y <- exp(-(nasa$coefficients[1] + nasa$coefficients[2] * x))y <- 1 / (1 + y)lines(x, y, col = 2, lwd = 2)# The Challengerpoints(-0.6, 1, pch = 16)text(-0.6, 1, labels = "Challenger", pos = 4)
The exponentials of the estimated coefficients are:
- \(e^{\hat\beta_0}=1965.974\). This means that, when the temperature is zero, the fitted odds is \(1965.974\), so the probability of having an incident (\(Y=1\)) is \(1965.974\) times larger than the probability of not having an incident (\(Y=0\)). In other words, the probability of having an incident at temperature zero is \(\frac{1965.974}{1965.974+1}=0.999\).
- \(e^{\hat\beta_1}=0.659\). This means that each Celsius degree increment in the temperature multiplies the fitted odds by a factor of \(0.659\approx\frac{2}{3}\), hence reducing it.
The estimation of
\(\boldsymbol{\beta}=(\beta_0,\beta_1,\ldots,\beta_k)\) from a sample
\((\mathbf{X}_{1},Y_1),\ldots,(\mathbf{X}_{n},Y_n)\) is different than in linear regression. It is not based on minimizing the RSS but on the principle of
Maximum Likelihood Estimation (MLE). MLE is based on the following
leitmotiv:
what are the coefficients \(\boldsymbol{\beta}\) that make the sample more likely? Or in other words,
what coefficients make the model more probable, based on the sample. Since
\(Y_i\sim \mathrm{Ber}(p(\mathbf{X}_i))\),
\(i=1,\ldots,n\), the likelihood of
\(\boldsymbol{\beta}\) is
\[\begin{align}\text{lik}(\boldsymbol{\beta})=\prod_{i=1}^np(\mathbf{X}_i)^{Y_i}(1-p(\mathbf{X}_i))^{1-Y_i}.\tag{4.8}\end{align}\]\(\text{lik}(\boldsymbol{\beta})\) is the probability of the data based on the model. Therefore, it is a number between \(0\) and \(1\). Its detailed interpretation is the following:
- \(\prod_{i=1}^n\) appears because the sample elements are assumed to be independent and we are computing the probability of observing the whole sample \((\mathbf{X}_{1},Y_1),\ldots,(\mathbf{X}_{n},Y_n)\). This probability is equal to the product of the probabilities of observing each \((\mathbf{X}_{i},Y_i)\).
- \(p(\mathbf{X}_i)^{Y_i}(1-p(\mathbf{X}_i))^{1-Y_i}\) is the probability of observing \((\mathbf{X}_{i},Y_i)\), as given by (4.3). Remember that \(p\) depends on \(\boldsymbol{\beta}\) due to (4.4).
Usually, the log-likelihood is considered instead of the likelihood for stability reasons – the estimates obtained are exactly the same and are \[\hat{\boldsymbol{\beta}}=\arg\max_{\boldsymbol{\beta}\in\mathbb{R}^{k+1}}\log \text{lik}(\boldsymbol{\beta}).\] Unfortunately, due to the non-linearity of the optimization problem there are no explicit expressions for \(\hat{\boldsymbol{\beta}}\). These have to be obtained numerically by means of an iterative procedure (the number of iterations required is printed in the output of summary
). In low sample situations with perfect classification, the iterative procedure may not converge.
Figure
4.6 shows how the log-likelihood changes with respect to the values for
\((\beta_0,\beta_1)\) in three data patterns.
The data of the illustration has been generated with the following code:
# Dataset.seed(34567)x <- rnorm(50, sd = 1.5)y1 <- -0.5 + 3 * xy2 <- 0.5 - 2 * xy3 <- -2 + 5 * xy1 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y1)))y2 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y2)))y3 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y3)))# DatadataMle <- data.frame(x = x, y1 = y1, y2 = y2, y3 = y3)
Let’s check that indeed the coefficients given by glm
are the ones that maximize the likelihood given in the animation of Figure 4.6. We do so for y ~ x1
.
mod <- glm(y1 ~ x, family = "binomial", data = dataMle)mod$coefficients## (Intercept) x ## -0.1691947 2.4281626
For the regressions y ~ x2
and y ~ x3
, do the following:
- Check that \(\boldsymbol{\beta}\) is indeed maximizing the likelihood as compared with Figure 4.6.
- plot the fitted logistic curve and compare it with the one in Figure 4.6.
In linear regression we relied on least squares estimation, in other words, the minimization of the RSS. Why do we need MLE in logistic regression and not least squares? The answer is two-fold:
- MLE is asymptotically optimal when estimating unknown parameters in a model. That means that when the sample size n is large, it is guaranteed to perform better than any other estimation method. Therefore, considering a least squares approach for logistic regression will result in suboptimal estimates.
- In multiple linear regression, due to the normality assumption, MLE and least squares estimation coincide. So MLE is hidden under the form of the least squares, which is a more intuitive estimation procedure. Indeed, the maximized likelihood \(\text{lik}(\hat{\boldsymbol{\beta}})\) in the linear model and the RSS are intimately related.
As in the linear model, the inclusion of a new predictor changes the coefficient estimates of the logistic model.
Inference for model parameters
The assumptions on which the logistic model is constructed allow to specify what is the asymptotic distribution of the random vector \(\hat{\boldsymbol{\beta}}\). Again, the distribution is derived conditionally on the sample predictors \(\mathbf{X}_1,\ldots,\mathbf{X}_n\). In other words, we assume that the randomness of \(Y\) comes only from \(Y|(X_1=x_1,\ldots,X_k=x_k)\sim\mathrm{Ber}(p(\mathbf{x}))\) and not from the predictors. To denote this, we employ lowercase for the sample predictors \(\mathbf{x}_1,\ldots,\mathbf{x}_n\).
There is an important difference between the inference results for the linear model and for logistic regression:
- In linear regression the inference is exact. This is due to the nice properties of the normal, least squares estimation and linearity. As a consequence, the distributions of the coefficients are perfectly known assuming that the assumptions hold.
- In logistic regression the inference is asymptotic. This means that the distributions of the coefficients are unknown except for large sample sizes \(n\), for which we have approximations. The reason is the more complexity of the model in terms of non-linearity. This is the usual situation for the majority of the regression models.
Distributions of the fitted coefficients
The distribution of
\(\hat{\boldsymbol{\beta}}\) is given by the asymptotic theory of MLE:
\[\begin{align}\hat{\boldsymbol{\beta}}\sim\mathcal{N}_{k+1}\left(\boldsymbol\beta,I(\hat{\boldsymbol{\beta}})^{-1}\right)\tag{4.10}\end{align}\]where
\(\sim\) must be understood as
approximately distributed as […] when \(n\to\infty\) for the rest of this chapter.
\(I(\boldsymbol\beta)\) is known as the
Fisher information matrix, and receives that name because
it measures the information available in the sample for estimating \(\boldsymbol\beta\). Therefore, the
larger the matrix is, the more precise is the estimation of
\(\boldsymbol\beta\), because that results in smaller variances in
(4.10). The inverse of the Fisher information matrix is
\[\begin{align}I(\hat{\boldsymbol{\beta}})^{-1}=(\mathbf{X}^T\mathbf{V}\mathbf{X})^{-1},\tag{4.11}\end{align}\]where \(\mathbf{V}\) is a diagonal matrix containing the different variances for each \(Y_i\) (remember that \(p(\mathbf{x})=1/(1+e^{-(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)})\)): \[\mathbf{V}=\begin{pmatrix}p(\mathbf{X}_1)(1-p(\mathbf{X}_1)) & & &\\& p(\mathbf{X}_2)(1-p(\mathbf{X}_2)) & & \\& & \ddots & \\& & & p(\mathbf{X}_n)(1-p(\mathbf{X}_n))\end{pmatrix}\] In the case of the multiple linear regression, \(I(\hat{\boldsymbol{\beta}})^{-1}=\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\) (see (3.6)), so the presence of \(\mathbf{V}\) here is revealing the heteroskedasticity of the model.
The interpretation of (4.10) and (4.11) give some useful insights on what concepts affect the quality of the estimation:
- Bias. The estimates are asymptotically unbiased.
Variance. It depends on:
- Sample size \(n\). Hidden inside \(\mathbf{X}^T\mathbf{V}\mathbf{X}\). As \(n\) grows, the precision of the estimators increases.
- Weighted predictor sparsity \((\mathbf{X}^T\mathbf{V}\mathbf{X})^{-1}\). The more sparse the predictor is (small \(|(\mathbf{X}^T\mathbf{V}\mathbf{X})^{-1}|\)), the more precise \(\hat{\boldsymbol{\beta}}\) is.
The precision of \(\hat{\boldsymbol{\beta}}\) is affected by the value of \(\boldsymbol{\beta}\), which is hidden inside
\(\mathbf{V}\). This contrasts sharply with the linear model, where the precision of the least squares estimator was not affected by the value of the unknown coefficients (see
(3.6)). The reason is partially due to the
heteroskedasticity of logistic regression, which implies a dependence of the variance of
\(Y\) in the logistic curve, hence in
\(\boldsymbol{\beta}\).
Similar to linear regression, the problem with
(4.10) and
(4.11) is that
\(\mathbf{V}\) is unknown in practice because it depends on
\(\boldsymbol{\beta}\). Plugging-in the estimate
\(\hat{\boldsymbol{\beta}}\) to
\(\boldsymbol{\beta}\) in
\(\mathbf{V}\) results in
\(\hat{\mathbf{V}}\). Now we can use
\(\hat{\mathbf{V}}\) to get
\[\begin{align}\frac{\hat\beta_j-\beta_j}{\hat{\mathrm{SE}}(\hat\beta_j)}\sim \mathcal{N}(0,1),\quad\hat{\mathrm{SE}}(\hat\beta_j)^2=v_j^2\tag{4.12}\end{align}\]where \[v_j\text{ is the $j$-th element of the diagonal of }(\mathbf{X}^T\hat{\mathbf{V}}\mathbf{X})^{-1}.\] The LHS of (3.7) is the Wald statistic for \(\beta_j\), \(j=0,\ldots,k\). They are employed for building confidence intervals and hypothesis tests.
Confidence intervals for the coefficients
Thanks to
(4.12), we can have the
\(100(1-\alpha)\%\) CI for the coefficient
\(\beta_j\),
\(j=0,\ldots,k\):
\[\begin{align}\left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}\right)\tag{4.13}\end{align}\]where \(z_{\alpha/2}\) is the \(\alpha/2\)-upper quantile of the \(\mathcal{N}(0,1)\). In case we are interested in the CI for \(e^{\beta_j}\), we can just simply take the exponential on the above CI. So the \(100(1-\alpha)\%\) CI for \(e^{\beta_j}\), \(j=0,\ldots,k\), is \[e^{\left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}\right)}.\] Of course, this CI is not the same as \(\left(e^{\hat\beta_j}\pm e^{\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}}\right)\), which is not a CI for \(e^{\hat\beta_j}\).
Let’s see how we can compute the CIs. We return to the challenger
dataset, so in case you do not have it loaded, you can download it here. We analyse the CI for the coefficients of fail.field ~ temp
.
# Fit modelnasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)# Confidence intervals at 95%confint(nasa)## Waiting for profiling to be done...## 2.5 % 97.5 %## (Intercept) 1.3364047 17.7834329## temp -0.9237721 -0.1089953# Confidence intervals at other levelsconfint(nasa, level = 0.90)## Waiting for profiling to be done...## 5 % 95 %## (Intercept) 2.2070301 15.7488590## temp -0.8222858 -0.1513279# Confidence intervals for the factors affecting the oddsexp(confint(nasa))## Waiting for profiling to be done...## 2.5 % 97.5 %## (Intercept) 3.8053375 5.287456e+07## temp 0.3970186 8.967346e-01
In this example, the 95% confidence interval for \(\beta_0\) is \((1.3364, 17.7834)\) and for \(\beta_1\) is \((-0.9238, -0.1090)\). For \(e^{\beta_0}\) and \(e^{\beta_1}\), the CIs are \((3.8053, 5.2874\times10^7)\) and \((0.3070, 0.8967)\), respectively. Therefore, we can say with a 95% confidence that:
- When
temp
=0, the probability of fail.field
=1 is significantly lager than the probability of fail.field
=0 (using the CI for \(\beta_0\)). Indeed, fail.field
=1 is between \(3.8053\) and \(5.2874\times10^7\) more likely than fail.field
=0 (using the CI for \(e^{\beta_0}\)). temp
has a significantly negative effect in the probability of fail.field
=1 (using the CI for \(\beta_1\)). Indeed, each unit increase in temp
produces a reduction of the odds of fail.field
by a factor between \(0.3070\) and \(0.8967\) (using the CI for \(e^{\beta_1}\)).
Compute and interpret the CIs for the exponentiated coefficients, at level \(\alpha=0.05\), for the following regressions (challenger
dataset):
fail.field ~ temp + pres.field
fail.nozzle ~ temp + pres.nozzle
fail.field ~ temp + pres.nozzle
fail.nozzle ~ temp + pres.field
The interpretation of the variables is given above Table
4.1.
Testing on the coefficients
The distributions in
(4.12) also allow to conduct a formal hypothesis test on the coefficients
\(\beta_j\),
\(j=0,\ldots,k\). For example, the test for significance:
\[\begin{align*}H_0:\beta_j=0\end{align*}\]for
\(j=0,\ldots,k\). The test of
\(H_0:\beta_j=0\) with
\(1\leq j\leq k\) is specially interesting, since it allows to answer whether
the variable \(X_j\) has a significant effect on \(\mathbb{P}[Y=1]\). The statistic used for testing for significance is the Wald statistic
\[\begin{align*}\frac{\hat\beta_j-0}{\hat{\mathrm{SE}}(\hat\beta_j)},\end{align*}\]which is asymptotically distributed as a \(\mathcal{N}(0,1)\) under the (veracity of) the null hypothesis. \(H_0\) is tested against the bilateral alternative hypothesis \(H_1:\beta_j\neq 0\).
The tests for significance are built-in in the summary
function. However, a note of caution is required when applying the rule of thumb:
Is the CI for \(\beta_j\) below (above) \(0\) at level \(\alpha\)?
- Yes \(\rightarrow\) reject \(H_0\) at level \(\alpha\).
- No \(\rightarrow\) the criterion is not conclusive.
The significances given in summary
and the output of confint
are slightly incoherent and the previous rule of thumb does not apply. The reason is because MASS
’s confint
is using a more sophisticated method (profile likelihood) to estimate the standard error of \(\hat\beta_j\), \(\hat{\mathrm{SE}}(\hat\beta_j)\), and not the asymptotic distribution behind Wald statistic.
By changing confint
to R
’s default confint.default
, the results of the latter will be completely equivalent to the significances in summary
, and the rule of thumb still be completely valid. For the contents of this course we prefer confint.default
due to its better interpretability.
To illustrate this we consider the regression of fail.field ~ temp + pres.field
:
# Significances with asymptotic approximation for the standard errorsnasa2 <- glm(fail.field ~ temp + pres.field, family = "binomial", data = challenger)summary(nasa2)## ## Call:## glm(formula = fail.field ~ temp + pres.field, family = "binomial", ## data = challenger)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.2109 -0.6081 -0.4292 0.3498 2.0913 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 6.642709 4.038547 1.645 0.1000 ## temp -0.435032 0.197008 -2.208 0.0272 *## pres.field 0.009376 0.008821 1.063 0.2878 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 19.078 on 20 degrees of freedom## AIC: 25.078## ## Number of Fisher Scoring iterations: 5# CIs with asymptotic approximation - coherent with summaryconfint.default(nasa2, level = 0.90)## 5 % 95 %## (Intercept) -0.000110501 13.28552771## temp -0.759081468 -0.11098301## pres.field -0.005132393 0.02388538confint.default(nasa2, level = 0.99)## 0.5 % 99.5 %## (Intercept) -3.75989977 17.04531697## temp -0.94249107 0.07242659## pres.field -0.01334432 0.03209731# CIs with profile likelihood - incoherent with summaryconfint(nasa2, level = 0.90) # intercept still significant## Waiting for profiling to be done...## 5 % 95 %## (Intercept) 0.945372123 14.93392497## temp -0.845250023 -0.16532086## pres.field -0.004184814 0.02602181confint(nasa2, level = 0.99) # temp still significant## Waiting for profiling to be done...## 0.5 % 99.5 %## (Intercept) -1.86541750 21.49637422## temp -1.17556090 -0.04317904## pres.field -0.01164943 0.03836968
For the previous exercise, check the differences of using confint
or confint.default
for computing the CIs.
Prediction
Prediction in logistic regression focuses mainly on predicting the values of the logistic curve \[p(x_1,\ldots,x_k)=\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)}}\] by means of \[\hat p(x_1,\ldots,x_k)=\hat{\mathbb{P}}[Y=1|X_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+e^{-(\hat\beta_0+\hat\beta_1x_1+\ldots+\hat\beta_kx_k)}}.\] From the perspective of the linear model, this is the same as predicting the conditional mean (not the conditional response) of the response, but this time this conditional mean is also a conditional probability. The prediction of the conditional response is not so interesting since it follows immediately from \(\hat p(x_1,\ldots,x_k)\): \[\hat{Y}|(X_1=x_1,\ldots,X_k=x_k)=\left\{\begin{array}{ll}1,&\text{with probability }\hat p(x_1,\ldots,x_k),\\0,&\text{with probability }1-\hat p(x_1,\ldots,x_k).\end{array}\right.\] As a consequence, we can predict \(Y\) as \(1\) if \(\hat p(x_1,\ldots,x_k)>\frac{1}{2}\) and as \(0\) if \(\hat p(x_1,\ldots,x_k)<\frac{1}{2}\).
Let’s focus then on how to make predictions and compute CIs in practice with predict
. Similarly to the linear model, the objects required for predict
are: first, the output of glm
; second, a data.frame
containing the locations \(\mathbf{x}=(x_1,\ldots,x_k)\) where we want to predict \(p(x_1,\ldots,x_k)\). However, there are two differences with respect to the use of predict
for lm
:
- The argument
type
. type = "link"
, gives the predictions in the log-odds, this is, returns \(\log\frac{\hat p(x_1,\ldots,x_k)}{1-\hat p(x_1,\ldots,x_k)}\). type = "response"
gives the predictions in the probability space \([0,1]\), this is, returns \(\hat p(x_1,\ldots,x_k)\). - There is no
interval
argument for using predict
for glm
. That means that there is no easy way of computing CIs for prediction.
Since it is a bit cumbersome to compute by yourself the CIs, we can code the function predictCIsLogistic
so that it compute them automatically for you, see below.
# Data for which we want a prediction# Important! You have to name the column with the predictor name!newdata <- data.frame(temp = -0.6)# Prediction of the conditional log-odds - the defaultpredict(nasa, newdata = newdata, type = "link")## 1 ## 7.833731# Prediction of the conditional probabilitypredict(nasa, newdata = newdata, type = "response")## 1 ## 0.999604# Function for computing the predictions and CIs for the conditional probabilitypredictCIsLogistic <- function(object, newdata, level = 0.95) { # Compute predictions in the log-odds pred <- predict(object = object, newdata = newdata, se.fit = TRUE) # CI in the log-odds za <- qnorm(p = (1 - level) / 2) lwr <- pred$fit + za * pred$se.fit upr <- pred$fit - za * pred$se.fit # Transform to probabilities fit <- 1 / (1 + exp(-pred$fit)) lwr <- 1 / (1 + exp(-lwr)) upr <- 1 / (1 + exp(-upr)) # Return a matrix with column names "fit", "lwr" and "upr" result <- cbind(fit, lwr, upr) colnames(result) <- c("fit", "lwr", "upr") return(result)}# Simple callpredictCIsLogistic(nasa, newdata = newdata)## fit lwr upr## 1 0.999604 0.4838505 0.9999999# The CI is large because there is no data around temp = -0.6 and# that makes the prediction more variable (and also because we only# have 23 observations)
For the challenger
dataset, do the following:
- Regress
fail.nozzle
on temp
and pres.nozzle
. - Compute the predicted probability of
fail.nozzle=1
for temp
=15 and pres.nozzle
=200. What is the predicted probability for fail.nozzle=0
? - Compute the confidence interval for the two predicted probabilities at level 95%.
Finally, Figure 4.9 gives an interactive visualization of the CIs for the conditional probability in simple logistic regression. Their interpretation is very similar to the CIs for the conditional mean in the simple linear model, see Section 2.6 and Figure 2.23.
Deviance and model fit
The
deviance is a key concept in logistic regression. Intuitively, it measures the
deviance of the fitted logistic model with respect to a perfect model for \(\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]\). This perfect model, known as the
saturated model, denotes an abstract model that fits perfectly the sample, this is, the model such that
\[\hat{\mathbb{P}}[Y=1|X_1=X_{i1},\ldots,X_k=X_{ik}]=Y_i,\quad i=1,\ldots,n.\] This model assigns probability
\(0\) or
\(1\) to
\(Y\) depending on the actual value of
\(Y_i\). To clarify this concept, Figure
4.10 shows a saturated model and a fitted logistic regression.
More precisely, the deviance is defined as the difference of likelihoods between the fitted model and the saturated model: \[D=-2\log\text{lik}(\hat{\boldsymbol{\beta}})+2\log\text{lik}(\text{saturated model}).\] Since the likelihood of the saturated model is exactly one, then the deviance is simply another expression of the likelihood: \[D=-2\log\text{lik}(\hat{\boldsymbol{\beta}}).\] As a consequence, the deviance is always larger or equal than zero, being zero only if the fit is perfect.
A benchmark for evaluating the magnitude of the deviance is the null deviance, \[D_0=-2\log\text{lik}(\hat{\beta}_0),\] which is the deviance of the worst model, the one fitted without any predictor, and the perfect model: \[Y|(X_1=x_1,\ldots,X_k=x_k)\sim \mathrm{Ber}(\mathrm{logistic}(\beta_0)).\] In this case, \(\hat\beta_0=\mathrm{logit}(\frac{m}{n})=\log\frac{\frac{m}{n}}{1-\frac{m}{n}}\) where \(m\) is the number of \(1\)’s in \(Y_1,\ldots,Y_n\) (see Figure 4.10).
The null deviance serves for comparing how much the model has improved by adding the predictors
\(X_1,\ldots,X_k\). This can be done by means of the
\(R^2\) statistic, which is a
generalization of the determination coefficient in multiple linear regression:
\[\begin{align}R^2=1-\frac{D}{D_0}=1-\frac{\text{deviance(fitted logistic, saturated model)}}{\text{deviance(null model, saturated model)}}.\tag{4.14}\end{align}\]This global measure of fit is similar indeed shares some important properties with the determination coefficient in linear regression:
- It is a quantity between \(0\) and \(1\).
- If the fit is perfect, then \(D=0\) and \(R^2=1\). If the predictors do not add anything to the regression, then \(D=D_0\) and \(R^2=0\).
In logistic regression, R^{2} does not have the same interpretation as in linear regression:
- Is not the percentage of variance explained by the logistic model, but rather a ratio indicating how close is the fit to being perfect or the worst.
- It is not related to any correlation coefficient.
The
\(R^2\) in
(4.14) is valid for the whole family of
generalized linear models, for which linear and logistic regression are particular cases. The connexion between
(4.14) and the determination coefficient is given by the expressions of the deviance and null the deviance for the linear model:
\[D=\mathrm{SSE}\text{ (or $D=\mathrm{RSS}$) and }D_0=\mathrm{SST}.\]Let’s see how these concepts are given by the summary
function:
# Summary of modelnasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)summaryLog <- summary(nasa)summaryLog # 'Residual deviance' is the deviance; 'Null deviance' is the null deviance## ## Call:## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.0566 -0.7575 -0.3818 0.4571 2.2195 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 7.5837 3.9146 1.937 0.0527 .## temp -0.4166 0.1940 -2.147 0.0318 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 20.335 on 21 degrees of freedom## AIC: 24.335## ## Number of Fisher Scoring iterations: 5# Null model - only interceptnull <- glm(fail.field ~ 1, family = "binomial", data = challenger)summaryNull <- summary(null)summaryNull## ## Call:## glm(formula = fail.field ~ 1, family = "binomial", data = challenger)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.8519 -0.8519 -0.8519 1.5425 1.5425 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.8267 0.4532 -1.824 0.0681 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 28.267 on 22 degrees of freedom## AIC: 30.267## ## Number of Fisher Scoring iterations: 4# Computation of the R^2 with a function - useful for repetitive computationsr2Log <- function(model) { summaryLog <- summary(model) 1 - summaryLog$deviance / summaryLog$null.deviance}# R^2r2Log(nasa)## [1] 0.280619r2Log(null)## [1] -2.220446e-16
Another way of evaluating the model fit is its predictive accuracy. The motivation is that most of the times we are interested simply in classifying, for an observation of the predictors, the value of \(Y\) as either \(0\) or \(1\), but not in predicting the value of \(p(x_1,\ldots,x_k)=\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]\). The classification in prediction is simply done by the rule \[\hat{Y}=\left\{\begin{array}{ll}1,&\hat{p}(x_1,\ldots,x_k)>\frac{1}{2},\\0,&\hat{p}(x_1,\ldots,x_k)<\frac{1}{2}.\end{array}\right.\] The overall predictive accuracy can be summarized with the hit matrix
\(Y=0\) | Correct\(_{0}\) | Incorrect\(_{01}\) |
\(Y=1\) | Incorrect\(_{10}\) | Correct\(_{1}\) |
and with the hit ratio \(\frac{\text{Correct}_0+\text{Correct}_1}{n}\). The hit matrix is easily computed with the table
function. The function, whenever called with two vectors, computes the cross-table between the two vectors.
# Fitted probabilities for Y = 1nasa$fitted.values## 1 2 3 4 5 6 ## 0.42778935 0.23014393 0.26910358 0.32099837 0.37772880 0.15898364 ## 7 8 9 10 11 12 ## 0.12833090 0.23014393 0.85721594 0.60286639 0.23014393 0.04383877 ## 13 14 15 16 17 18 ## 0.37772880 0.93755439 0.37772880 0.08516844 0.23014393 0.02299887 ## 19 20 21 22 23 ## 0.07027765 0.03589053 0.08516844 0.07027765 0.82977495# Classified Y'syHat <- nasa$fitted.values > 0.5# Hit matrix:# - 16 correctly classified as 0# - 4 correclty classified as 1# - 3 incorrectly classified as 0tab <- table(challenger$fail.field, yHat)tab## yHat## FALSE TRUE## 0 16 0## 1 3 4# Hit ratio (ratio of correct classification)(16 + 4) / 23 # Manually## [1] 0.8695652sum(diag(tab)) / sum(tab) # Automatically## [1] 0.8695652
It is important to recall that the hit matrix will be always biased towards unrealistc good classification rates if it is computed in the same sample used for fitting the logistic model. A familiar analogy is asking to your mother (data) whether you (model) are a good-looking human being (good predictive accuracy) – the answer will be highly positively biased. To get a fair hit matrix, the right approach is to split randomly the sample into two: a training dataset, used for fitting the model, and a test dataset, used for evaluating the predictive accuracy.
Model selection and multicollinearity
The same discussion we did in Section 3.7 is applicable to logistic regression with small changes:
- The deviance of the model (reciprocally the likelihood and the \(R^2\)) always decreases (increase) with the inclusion of more predictors – no matter whether they are significant or not.
- The excess of predictors in the model is paid by a larger variability in the estimation of the model which results in less precise prediction.
- Multicollinearity may hide significant variables, change the sign of them and result in an increase of the variability of the estimation at the expense of little improvement in .
The use of information criteria, stepwise
and vif
allow to efficiently fight back these issues. Let’s review them quickly from the perspective of logistic regression.
First, remember that the BIC/AIC information criteria are based on a
balance between the model fitness, given by the likelihood, and its complexity. In the logistic regression, the BIC is
\[\begin{align*}\text{BIC}(\text{model}) &= -2\log \text{lik}(\hat{\boldsymbol{\beta}}) + (k + 1)\times\log n\\&=D+(k+1)\times \log n,\end{align*}\]where \(\text{lik}(\hat{\boldsymbol{\beta}})\) is the likelihood of the model. The AIC replaces \(\log n\) by \(2\), hence penalizing less model complexity. The BIC and AIC can be computed in R
through the functions BIC
and AIC
, and we can check manually that they match with its definition.
# Modelsnasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)nasa2 <- glm(fail.field ~ temp + pres.field, family = "binomial", data = challenger)# nasasummary1 <- summary(nasa)summary1## ## Call:## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.0566 -0.7575 -0.3818 0.4571 2.2195 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 7.5837 3.9146 1.937 0.0527 .## temp -0.4166 0.1940 -2.147 0.0318 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 20.335 on 21 degrees of freedom## AIC: 24.335## ## Number of Fisher Scoring iterations: 5BIC(nasa)## [1] 26.60584summary1$deviance + 2 * log(dim(challenger)[1])## [1] 26.60584AIC(nasa)## [1] 24.33485summary1$deviance + 2 * 2## [1] 24.33485# nasa2summary2 <- summary(nasa2)summary2## ## Call:## glm(formula = fail.field ~ temp + pres.field, family = "binomial", ## data = challenger)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.2109 -0.6081 -0.4292 0.3498 2.0913 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 6.642709 4.038547 1.645 0.1000 ## temp -0.435032 0.197008 -2.208 0.0272 *## pres.field 0.009376 0.008821 1.063 0.2878 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 28.267 on 22 degrees of freedom## Residual deviance: 19.078 on 20 degrees of freedom## AIC: 25.078## ## Number of Fisher Scoring iterations: 5BIC(nasa2)## [1] 28.48469summary2$deviance + 3 * log(dim(challenger)[1])## [1] 28.48469AIC(nasa2)## [1] 25.07821summary2$deviance + 3 * 2## [1] 25.07821
Second, stepwise
works analogously to the linear regression situation. Here is an illustration for a binary variable that measures whether a Boston suburb (Boston
dataset) is wealth or not. The binary variable is medv > 25
: it is TRUE
(1
) for suburbs with median house value larger than 25000$) and FALSE
(0
) otherwise. The cutoff 25000$ corresponds to the 25% richest suburbs.
# Boston datasetdata(Boston)# Model whether a suburb has a median house value larger than 25000$mod <- glm(I(medv > 25) ~ ., data = Boston, family = "binomial")summary(mod)## ## Call:## glm(formula = I(medv > 25) ~ ., family = "binomial", data = Boston)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.3498 -0.2806 -0.0932 -0.0006 3.3781 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 5.312511 4.876070 1.090 0.275930 ## crim -0.011101 0.045322 -0.245 0.806503 ## zn 0.010917 0.010834 1.008 0.313626 ## indus -0.110452 0.058740 -1.880 0.060060 . ## chas 0.966337 0.808960 1.195 0.232266 ## nox -6.844521 4.483514 -1.527 0.126861 ## rm 1.886872 0.452692 4.168 3.07e-05 ***## age 0.003491 0.011133 0.314 0.753853 ## dis -0.589016 0.164013 -3.591 0.000329 ***## rad 0.318042 0.082623 3.849 0.000118 ***## tax -0.010826 0.004036 -2.682 0.007314 ** ## ptratio -0.353017 0.122259 -2.887 0.003884 ** ## black -0.002264 0.003826 -0.592 0.554105 ## lstat -0.367355 0.073020 -5.031 4.88e-07 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 563.52 on 505 degrees of freedom## Residual deviance: 209.11 on 492 degrees of freedom## AIC: 237.11## ## Number of Fisher Scoring iterations: 7r2Log(mod)## [1] 0.628923# With BIC - wnds up with only the significant variables and a similar R^2modBIC <- stepwise(mod, trace = 0)## ## Direction: backward/forward## Criterion: BICsummary(modBIC)## ## Call:## glm(formula = I(medv > 25) ~ indus + rm + dis + rad + tax + ptratio + ## lstat, family = "binomial", data = Boston)## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.3077 -0.2970 -0.0947 -0.0005 3.2552 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.556433 3.948818 0.394 0.693469 ## indus -0.143236 0.054771 -2.615 0.008918 ** ## rm 1.950496 0.441794 4.415 1.01e-05 ***## dis -0.426830 0.111572 -3.826 0.000130 ***## rad 0.301060 0.076542 3.933 8.38e-05 ***## tax -0.010240 0.003631 -2.820 0.004800 ** ## ptratio -0.404964 0.112086 -3.613 0.000303 ***## lstat -0.384823 0.069121 -5.567 2.59e-08 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 563.52 on 505 degrees of freedom## Residual deviance: 215.03 on 498 degrees of freedom## AIC: 231.03## ## Number of Fisher Scoring iterations: 7r2Log(modBIC)## [1] 0.6184273
Finally, multicollinearity can also be present in logistic regression. Despite the nonlinear logistic curve, the predictors are combined linearly in (4.4). Due to this, if two or more predictors are highly correlated between them, the fit of the model will be compromised since the individual linear effect of each predictor is hard to disentangle from the rest of correlated predictors.
In addition to inspecting the correlation matrix and look for high correlations, a powerful tool to detect multicollinearity is the generalized Variance Inflation Factor (gVIF) of each coefficient \(\hat\beta_j\). This is a measure of how much the variability in the estimation has increased due to the addition of the predictor \(X_j\). The next rule of thumb gives direct insight into which predictors are multicollinear:
- gVIF close to 1: absence of multicollinearity.
- gVIF larger than 5 or 10: problematic amount of multicollinearity. Advised to remove the predictor with largest gVIF.
Here is an example illustrating the use of gVIF, through vif
, in practice. It also shows also how the simple inspection of the covariance matrix is not enough for detecting collinearity in tricky situations.
# Create predictors with multicollinearity: x4 depends on the restset.seed(45678)x1 <- rnorm(100)x2 <- 0.5 * x1 + rnorm(100)x3 <- 0.5 * x2 + rnorm(100)x4 <- -x1 + x2 + rnorm(100, sd = 0.25)# Responsez <- 1 + 0.5 * x1 + 2 * x2 - 3 * x3 - x4y <- rbinom(n = 100, size = 1, prob = 1/(1 + exp(-z)))data <- data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = x4, y = y)# Correlations - none seems suspiciuscor(data)## x1 x2 x3 x4 y## x1 1.0000000 0.38254782 0.2142011 -0.5261464 0.20198825## x2 0.3825478 1.00000000 0.5167341 0.5673174 0.07456324## x3 0.2142011 0.51673408 1.0000000 0.2500123 -0.49853746## x4 -0.5261464 0.56731738 0.2500123 1.0000000 -0.11188657## y 0.2019882 0.07456324 -0.4985375 -0.1118866 1.00000000# Generalized variance inflation factors anormal: largest for x4, we remove itmodMultiCo <- glm(y ~ x1 + x2 + x3 + x4, family = "binomial")vif(modMultiCo)## x1 x2 x3 x4 ## 27.84756 36.66514 4.94499 36.78817# Without x4modClean <- glm(y ~ x1 + x2 + x3, family = "binomial")# Comparisonsummary(modMultiCo)## ## Call:## glm(formula = y ~ x1 + x2 + x3 + x4, family = "binomial")## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.4743 -0.3796 0.1129 0.4052 2.3887 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.2527 0.4008 3.125 0.00178 ** ## x1 -3.4269 1.8225 -1.880 0.06007 . ## x2 6.9627 2.1937 3.174 0.00150 ** ## x3 -4.3688 0.9312 -4.691 2.71e-06 ***## x4 -5.0047 1.9440 -2.574 0.01004 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 132.81 on 99 degrees of freedom## Residual deviance: 59.76 on 95 degrees of freedom## AIC: 69.76## ## Number of Fisher Scoring iterations: 7summary(modClean)## ## Call:## glm(formula = y ~ x1 + x2 + x3, family = "binomial")## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.0952 -0.4144 0.1839 0.4762 2.5736 ## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.9237 0.3221 2.868 0.004133 ** ## x1 1.2803 0.4235 3.023 0.002502 ** ## x2 1.7946 0.5290 3.392 0.000693 ***## x3 -3.4838 0.7491 -4.651 3.31e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 132.813 on 99 degrees of freedom## Residual deviance: 68.028 on 96 degrees of freedom## AIC: 76.028## ## Number of Fisher Scoring iterations: 6r2Log(modMultiCo)## [1] 0.5500437r2Log(modClean)## [1] 0.4877884# Genralized variance inflation factors normalvif(modClean)## x1 x2 x3 ## 1.674300 2.724351 3.743940
For the Boston
dataset, do the following:
- Compute the hit matrix and hit ratio for the regression
I(medv > 25) ~ .
(hint: do table(medv > 25, …)
). - Fit
I(medv > 25) ~ .
but now using only the first 300 observations of Boston
, the training dataset (hint: use subset
). - For the previous model, predict the probability of the responses and classify them into
0
or 1
in the last 206 observations, the testing dataset (hint: use predict
on that subset). - Compute the hit matrix and hit ratio for the new predictions. Check that the hit ratio is smaller than the one in the first point. The hit ratio on the testing dataset, and not the first hit rate, is an estimator of how well the model is going to classify future observations.