4.3 Model formulation and estimation by maximum likelihood

As we saw in Section 3.2, the multiple linear model described the relation between the random variables X1,,Xk and Y by assuming the linear relation Y=β0+β1X1++βkXk+ε. Since we assume E[ε|X1=x1,,Xk=xk]=0, the previous equation was equivalent to (4.2)E[Y|X1=x1,,Xk=xk]=β0+β1x1++βkxk, where E[Y|X1=x1,,Xk=xk] is the mean of Y for a particular value of the set of predictors. As remarked in Section 3.3, it was a necessary condition that Y was continuous in order to satisfy the normality of the errors, hence the linear model assumptions. Or in other words, the linear model is designed for a continuous response.

The situation when Y is discrete (naturally ordered values; e.g. number of fails, number of students) or categorical (non-ordered categories; e.g. territorial divisions, ethnic groups) requires a special treatment. The simplest situation is when Y is binary (or dichotomic): it can only take two values, codified for convenience as 1 (success) and 0 (failure). For example, in the Challenger case study we used fail.field as an indicator of whether “there was at least an incident with the O-rings” (1 = yes, 0 = no). For binary variables there is no fundamental distinction between the treatment of discrete and categorical variables.

More formally, a binary variable is known as a Bernoulli variable, which is the simplest non-trivial random variable. We say that YBer(p), 0p1, if Y={1,with probability p,0,with probability 1p, or, equivalently, if P[Y=1]=p and P[Y=0]=1p, which can be written compactly as (4.3)P[Y=y]=py(1p)1y,y=0,1. Recall that a binomial variable with size n and probability p, Bi(n,p), is obtained by summing n independent Ber(p) (so Ber(p) is the same as Bi(1,p)). This is why we need to use a family = "binomial" in glm, to indicate that the response is binomial.

A Bernoulli variable Y is completely determined by the probability p. So do its mean and variance:

  • E[Y]=p×1+(1p)×0=p
  • Var[Y]=p(1p)

In particular, recall that P[Y=1]=E[Y]=p.

This is something relatively uncommon (on a N(μ,σ2), μ determines the mean and σ2 the variance) that has important consequences for the logistic model: we do not need a σ2.

Are these Bernoulli variables? If so, which is the value of p and what could the codes 0 and 1 represent?

  • The toss of a fair coin.
  • A variable with mean p and variance p(1p).
  • The roll of a dice.
  • A binary variable with mean 0.5 and variance 0.45.
  • The winner of an election with two candidates.

Assume then that Y is a binary/Bernoulli variable and that X1,,Xk are predictors associated to Y (no particular assumptions on them). The purpose in logistic regression is to estimate p(x1,,xk)=P[Y=1|X1=x1,,Xk=xk]=E[Y|X1=x1,,Xk=xk], this is, how the probability of Y=1 is changing according to particular values, denoted by x1,,xk, of the random variables X1,,Xk. p(x1,,xk)=P[Y=1|X1=x1,,Xk=xk] stands for the conditional probability of Y=1 given X1,,Xk. At sight of (4.2), a tempting possibility is to consider the model p(x1,,xk)=β0+β1x1++βkxk. However, such a model will run into serious problems inevitably: negative probabilities and probabilities larger than one. A solution is to consider a function to encapsulate the value of z=β0+β1x1++βkxk, in R, and map it back to [0,1]. There are several alternatives to do so, based on distribution functions F:R[0,1] that deliver y=F(z)[0,1] (see Figure 4.5). Different choices of F give rise to different models:

  • Uniform. Truncate z to 0 and 1 when z<0 and z>1, respectively.
  • Logit. Consider the logistic distribution function: F(z)=logistic(z)=ez1+ez=11+ez.
  • Probit. Consider the normal distribution function, this is, F=Φ.
Different transformations mapping the response of a simple linear regression \(z=\beta_0+\beta_1x\) to \([0,1]\).

Figure 4.5: Different transformations mapping the response of a simple linear regression z=β0+β1x to [0,1].

The logistic transformation is the most employed due to its tractability, interpretability and smoothness. Its inverse, F1:[0,1]R, known as the logit function, is logit(p)=logistic1(p)=logp1p. This is a link function, this is, a function that maps a given space (in this case [0,1]) into R. The term link function is employed in generalized linear models, which follow exactly the same philosophy of the logistic regression – mapping the domain of Y to R in order to apply there a linear model. We will concentrate here exclusively on the logit as a link function. Therefore, the logistic model is (4.4)p(x1,,xk)=logistic(β0+β1x1++βkxk)=11+e(β0+β1x1++βkxk). The linear form inside the exponent has a clear interpretation:

  • If β0+β1x1++βkxk=0, then p(x1,,xk)=12 (Y=1 and Y=0 are equally likely).
  • If β0+β1x1++βkxk<0, then p(x1,,xk)<12 (Y=1 less likely).
  • If β0+β1x1++βkxk>0, then p(x1,,xk)>12 (Y=1 more likely).

To be more precise on the interpretation of the coefficients β0,,βk we need to introduce the odds. The odds is an equivalent way of expressing the distribution of probabilities in a binary variable. Since P[Y=1]=p and P[Y=0]=1p, both the success and failure probabilities can be inferred from p. Instead of using p to characterize the distribution of Y, we can use (4.5)odds(Y)=p1p=P[Y=1]P[Y=0]. The odds is the ratio between the probability of success and the probability of failure. It is extensively used in betting29 due to its better interpretability. For example, if a horse Y has a probability p=2/3 of winning a race (Y=1), then the odds of the horse is odds=p1p=2/31/3=2. This means that the horse has a probability of winning that is twice larger than the probability of losing. This is sometimes written as a 2:1 or 2×1 (spelled “two-to-one”). Conversely, if the odds of Y is given, we can easily know what is the probability of success p, using the inverse of (4.5): p=P[Y=1]=odds(Y)1+odds(Y). For example, if the odds of the horse were 5, that would correspond to a probability of winning p=5/6.

Recall that the odds is a number in [0,+]. The 0 and + values are attained for p=0 and p=1, respectively. The log-odds (or logit) is a number in [,+].

We can rewrite (4.4) in terms of the odds (4.5). If we do so, we have: odds(Y|X1=x1,,Xk=xk)=p(x1,,xk)1p(x1,,xk)=eβ0+β1x1++βkxk(4.6)=eβ0eβ1x1eβkxk or, taking logarithms, the log-odds (or logit) (4.7)log(odds(Y|X1=x1,,Xk=xk))=β0+β1x1++βkxk. The conditional log-odds (4.7) plays here the role of the conditional mean for multiple linear regression. Therefore, we have an analogous interpretation for the coefficients:

  • β0: is the log-odds when X1==Xk=0.
  • βj, 1jk: is the additive increment of the log-odds for an increment of one unit in Xj=xj, provided that the remaining variables X1,,Xj1,Xj+1,,Xk do not change.

The log-odds is not so easy to interpret as the odds. For that reason, an equivalent way of interpreting the coefficients, this time based on (4.6), is:

  • eβ0: is the odds when X1==Xk=0.
  • eβj, 1jk: is the multiplicative increment of the odds for an increment of one unit in Xj=xj, provided that the remaining variables X1,,Xj1,Xj+1,,Xk do not change. If the increment in Xj is of r units, then the multiplicative increment in the odds is (eβj)r.
As a consequence of this last interpretation, we have:
If βj>0 (respectively, βj<0) then eβj>1 (eβj<1) in (4.6). Therefore, an increment of one unit in Xj, provided that the remaining variables X1,,Xj1,Xj+1,,Xk do not change, results in an increment (decrement) of the odds, this is, in an increment (decrement) of P[Y=1].

Since the relationship between p(X1,,Xk) and X1,,Xk is not linear, βj does not correspond to the change in p(X1,,Xk) associated with a one-unit increase in Xj.

Let’s visualize this concepts quickly with the output of the Challenger case study:

nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)
summary(nasa)
## 
## Call:
## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0566  -0.7575  -0.3818   0.4571   2.2195  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   7.5837     3.9146   1.937   0.0527 .
## temp         -0.4166     0.1940  -2.147   0.0318 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 20.335  on 21  degrees of freedom
## AIC: 24.335
## 
## Number of Fisher Scoring iterations: 5
exp(coef(nasa)) # Exponentiated coefficients ("odds ratios")
##  (Intercept)         temp 
## 1965.9743592    0.6592539

# Plot data
plot(challenger$temp, challenger$fail.field, xlim = c(-1, 30), xlab = "Temperature",
     ylab = "Incident probability")

# Draw the fitted logistic curve
x <- seq(-1, 30, l = 200)
y <- exp(-(nasa$coefficients[1] + nasa$coefficients[2] * x))
y <- 1 / (1 + y)
lines(x, y, col = 2, lwd = 2)

# The Challenger
points(-0.6, 1, pch = 16)
text(-0.6, 1, labels = "Challenger", pos = 4)

The exponentials of the estimated coefficients are:

  • eβ^0=1965.974. This means that, when the temperature is zero, the fitted odds is 1965.974, so the probability of having an incident (Y=1) is 1965.974 times larger than the probability of not having an incident (Y=0). In other words, the probability of having an incident at temperature zero is 1965.9741965.974+1=0.999.
  • eβ^1=0.659. This means that each Celsius degree increment in the temperature multiplies the fitted odds by a factor of 0.65923, hence reducing it.

The estimation of β=(β0,β1,,βk) from a sample (X1,Y1),,(Xn,Yn) is different than in linear regression. It is not based on minimizing the RSS but on the principle of Maximum Likelihood Estimation (MLE). MLE is based on the following leitmotiv: what are the coefficients β that make the sample more likely? Or in other words, what coefficients make the model more probable, based on the sample. Since YiBer(p(Xi)), i=1,,n, the likelihood of β is (4.8)lik(β)=i=1np(Xi)Yi(1p(Xi))1Yi. lik(β) is the probability of the data based on the model. Therefore, it is a number between 0 and 1. Its detailed interpretation is the following:

  • i=1n appears because the sample elements are assumed to be independent and we are computing the probability of observing the whole sample (X1,Y1),,(Xn,Yn). This probability is equal to the product of the probabilities of observing each (Xi,Yi).
  • p(Xi)Yi(1p(Xi))1Yi is the probability of observing (Xi,Yi), as given by (4.3). Remember that p depends on β due to (4.4).

Usually, the log-likelihood is considered instead of the likelihood for stability reasons – the estimates obtained are exactly the same and are β^=argmaxβRk+1loglik(β). Unfortunately, due to the non-linearity of the optimization problem there are no explicit expressions for β^. These have to be obtained numerically by means of an iterative procedure (the number of iterations required is printed in the output of summary). In low sample situations with perfect classification, the iterative procedure may not converge.

Figure 4.6 shows how the log-likelihood changes with respect to the values for (β0,β1) in three data patterns.

Figure 4.6: The logistic regression fit and its dependence on β0 (horizontal displacement) and β1 (steepness of the curve). Recall the effect of the sign of β1 in the curve: if positive, the logistic curve has an ‘s’ form; if negative, the form is a reflected ‘s.’ Application also available here.

The data of the illustration has been generated with the following code:

# Data
set.seed(34567)
x <- rnorm(50, sd = 1.5)
y1 <- -0.5 + 3 * x
y2 <- 0.5 - 2 * x
y3 <- -2 + 5 * x
y1 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y1)))
y2 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y2)))
y3 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y3)))

# Data
dataMle <- data.frame(x = x, y1 = y1, y2 = y2, y3 = y3)

Let’s check that indeed the coefficients given by glm are the ones that maximize the likelihood given in the animation of Figure 4.6. We do so for y ~ x1.

mod <- glm(y1 ~ x, family = "binomial", data = dataMle)
mod$coefficients
## (Intercept)           x 
##  -0.1691947   2.4281626

For the regressions y ~ x2 and y ~ x3, do the following:

  • Check that the true β is close to maximizing the likelihood computed in Figure 4.6.
  • Plot the fitted logistic curve and compare it with the one in Figure 4.6.

In linear regression we relied on least squares estimation, in other words, the minimization of the RSS. Why do we need MLE in logistic regression and not least squares? The answer is two-fold:

  1. MLE is asymptotically optimal when estimating unknown parameters in a model. That means that when the sample size n is large, it is guaranteed to perform better than any other estimation method. Therefore, considering a least squares approach for logistic regression will result in suboptimal estimates.
  2. In multiple linear regression, due to the normality assumption, MLE and least squares estimation coincide. So MLE is hidden under the form of the least squares, which is a more intuitive estimation procedure. Indeed, the maximized likelihood lik(β^) in the linear model and the RSS are intimately related.

As in the linear model, the inclusion of a new predictor changes the coefficient estimates of the logistic model.


  1. Recall that the result of a bet is binary: you either win or lose the bet.↩︎