5.2 Model formulation and estimation

For simplicity, we first study the logistic regression and then study the general case of a generalized linear model.

5.2.1 Logistic regression

As we saw in Section 2.2, the multiple linear model described the relation between the random variables X1,,Xp and Y by assuming a linear relation in the conditional expectation:

(5.1)E[Y|X1=x1,,Xp=xp]=β0+β1x1++βpxp.

In addition, it made three more assumptions on the data (see Section 2.3), which resulted in the following one-line summary of the linear model:

Y|(X1=x1,,Xp=xp)N(β0+β1x1++βpxp,σ2).

Recall that a necessary condition for the linear model to hold is that Y is continuous, in order to satisfy the normality of the errors. Therefore, the linear model is designed for a continuous response.

The situation when Y is discrete (naturally ordered values) or categorical (non-ordered categories) requires a different treatment. The simplest situation is when Y is binary: it can only take two values, codified for convenience as 1 (success) and 0 (failure). For binary variables there is no fundamental distinction between the treatment of discrete and categorical variables. Formally, a binary variable is referred to as a Bernoulli variable:144 YBer(p), 0p1145, if

Y={1,with probability p,0,with probability 1p,

or, equivalently, if

(5.2)P[Y=y]=py(1p)1y,y=0,1.

Recall that a Bernoulli variable is completely determined by the probability p. Therefore, so do its mean and variance:

E[Y]=P[Y=1]=pandVar[Y]=p(1p).

Assume then that Y is a Bernoulli variable and that X1,,Xp are predictors associated to Y. The purpose in logistic regression is to model

(5.3)E[Y|X1=x1,,Xp=xp]=P[Y=1|X1=x1,,Xp=xp],

that is, to model how the conditional expectation of Y or, equivalently, the conditional probability of Y=1, is changing according to particular values of the predictors. At sight of (5.1), a tempting possibility is to consider the model

E[Y|X1=x1,,Xp=xp]=β0+β1x1++βpxp=:η.

However, such a model will run into serious problems inevitably: negative probabilities and probabilities larger than one may happen.

A solution is to consider a link function g to encapsulate the value of E[Y|X1=x1,,Xp=xp] and map it back to R. Or, alternatively, a function g1 that takes ηR and maps it to [0,1], the support of E[Y|X1=x1,,Xp=xp]. There are several link functions g with associated g1. Each link generates a different model:

  • Uniform link. Based on the truncation g1(η)=η1{0<η<1}+1{η1}.
  • Probit link. Based on the normal cdf, this is, g1(η)=Φ(η).
  • Logit link. Based on the logistic cdf:146

g1(η)=logistic(η):=eη1+eη=11+eη.

Transformations \(g^{-1}\) associated to different link functions. The transformations \(g^{-1}\) map the response of a linear regression \(\eta=\beta_0+\beta_1x_1+\cdots+\beta_px_p\) to \(\lbrack 0,1\rbrack.\)

Figure 5.4: Transformations g1 associated to different link functions. The transformations g1 map the response of a linear regression η=β0+β1x1++βpxp to [0,1].

The logistic transformation is the most employed due to its tractability, interpretability, and smoothness.147 Its inverse, g:[0,1]R, is known as the logit function:

logit(p):=logistic1(p)=log(p1p).

In conclusion, with the logit link function we can map the domain of Y to R in order to apply a linear model. The logistic model can be then equivalently stated as

(5.4)E[Y|X1=x1,,Xp=xp]=logistic(η)=11+eη,

or as

(5.5)logit(E[Y|X1=x1,,Xp=xp])=η

where recall that

η=β0+β1x1++βpxp.

There is a clear interpretation of the role of the linear predictor η in (5.4) when we come back to (5.3):

  • If η=0, then P[Y=1|X1=x1,,Xp=xp]=12 (Y=1 and Y=0 are equally likely).
  • If η<0, then P[Y=1|X1=x1,,Xp=xp]<12 (Y=1 is less likely).
  • If η>0, then P[Y=1|X1=x1,,Xp=xp]>12 (Y=1 is more likely).

To be more precise on the interpretation of the coefficients we need to introduce the odds. The odds is an equivalent way of expressing the distribution of probabilities in a binary variable Y. Instead of using p to characterize the distribution of Y, we can use

(5.6)odds(Y):=p1p=P[Y=1]P[Y=0].

The odds is thus the ratio between the probability of success and the probability of failure.148 It is extensively used in betting.149 due to its better interpretability150 Conversely, if the odds of Y is given, we can easily know what is the probability of success p, using the inverse of (5.6):151

p=P[Y=1]=odds(Y)1+odds(Y).

Recall that the odds is a number in [0,+]. The 0 and + values are attained for p=0 and p=1, respectively. The log-odds (or logit) is a number in [,+].

We can rewrite (5.4) in terms of the odds (5.6)152 so we get:

(5.7)odds(Y|X1=x1,,Xp=xp)=eη=eβ0eβ1x1eβpxp.

Alternatively, taking logarithms, we have the log-odds (or logit)

(5.8)log(odds(Y|X1=x1,,Xp=xp))=β0+β1x1++βpxp.

The conditional log-odds (5.8) plays the role of the conditional mean for multiple linear regression. Therefore, we have an analogous interpretation for the coefficients:

  • β0: is the log-odds when X1==Xp=0.
  • βj, 1jp: is the additive increment of the log-odds for an increment of one unit in Xj=xj, provided that the remaining variables X1,,Xj1,Xj+1,,Xp do not change.

The log-odds is not as easy to interpret as the odds. For that reason, an equivalent way of interpreting the coefficients, this time based on (5.7), is:

  • eβ0: is the odds when X1==Xp=0.
  • eβj, 1jp: is the multiplicative increment of the odds for an increment of one unit in Xj=xj, provided that the remaining variables X1,,Xj1,Xj+1,,Xp do not change. If the increment in Xj is of r units, then the multiplicative increment in the odds is (eβj)r.

As a consequence of this last interpretation, we have:

If βj>0 (respectively, βj<0) then eβj>1 (eβj<1) in (5.7). Therefore, an increment of one unit in Xj, provided that the remaining variables do not change, results in a positive (negative) increment in the odds and in P[Y=1|X1=x1,,Xp=xp].

Case study application

In the Challenger case study we used fail.field as an indicator of whether “there was at least an incident with the O-rings” (1 = yes, 0 = no). Let’s see if the temperature was associated with O-ring incidents (Q1). For that, we compute the logistic regression of fail.field on temp and we plot the fitted logistic curve.

# Logistic regression: computed with glm and family = "binomial"
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)

# Plot data
plot(challenger$temp, challenger$fail.field, xlim = c(-1, 30),
     xlab = "Temperature", ylab = "Incident probability")

# Draw the fitted logistic curve
x <- seq(-1, 30, l = 200)
y <- exp(-(nasa$coefficients[1] + nasa$coefficients[2] * x))
y <- 1 / (1 + y)
lines(x, y, col = 2, lwd = 2)

# The Challenger
points(-0.6, 1, pch = 16)
text(-0.6, 1, labels = "Challenger", pos = 4)

At the sight of this curve and the summary it seems that the temperature was affecting the probability of an O-ring incident (Q1). Let’s quantify this statement and answer Q2 by looking to the coefficients of the model:

# Exponentiated coefficients ("odds ratios")
exp(coef(nasa))
##  (Intercept)         temp 
## 1965.9743592    0.6592539

The exponentials of the estimated coefficients are:

  • eβ^0=1965.974. This means that, when the temperature is zero, the fitted odds is 1965.974, so the (estimated) probability of having an incident (Y=1) is 1965.974 times larger than the probability of not having an incident (Y=0). Or, in other words, the probability of having an incident at temperature zero is 1965.9741965.974+1=0.999.
  • eβ^1=0.659. This means that each Celsius degree increment on the temperature multiplies the fitted odds by a factor of 0.65923, hence reducing it.

However, for the moment we cannot say whether these findings are significant or are just an artifact of the randomness of the data, since we do not have information on the variability of the estimates of β. We will need inference for that.

Estimation by maximum likelihood

The estimation of β from a sample {(xi,Yi)}i=1n153 is done by Maximum Likelihood Estimation (MLE). As it can be seen in Appendix A.2, in the linear model, under the assumptions mentioned in Section 2.3, MLE is equivalent to least squares estimation. In the logistic model, we assume that154

Yi|(X1=xi1,,Xp=xip)Ber(logistic(ηi)),i=1,,n,

where ηi:=β0+β1xi1++βpxip. Denoting pi(β):=logistic(ηi), the log-likelihood of β is

(β)=log(i=1npi(β)Yi(1pi(β))1Yi)(5.9)=i=1n[Yilog(pi(β))+(1Yi)log(1pi(β))].

The ML estimate of β is

β^:=argmaxβRp+1(β).

Unfortunately, due to the nonlinearity of (5.9), there is no explicit expression for β^ and it has to be obtained numerically by means of an iterative procedure. We will see that with more detail in the next section. Just be aware that this iterative procedure may fail to converge in low sample size situations with perfect classification, where the likelihood might be numerically unstable.

Figure 5.5: The logistic regression fit and its dependence on β0 (horizontal displacement) and β1 (steepness of the curve). Recall the effect of the sign of β1 in the curve: if positive, the logistic curve has an ‘s’ form; if negative, the form is a reflected ‘s’. Application available here.

Figure 5.5 shows how the log-likelihood changes with respect to the values for (β0,β1) in three data patterns. The data of the illustration has been generated with the next chunk of code.

# Data
set.seed(34567)
x <- rnorm(50, sd = 1.5)
y1 <- -0.5 + 3 * x
y2 <- 0.5 - 2 * x
y3 <- -2 + 5 * x
y1 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y1)))
y2 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y2)))
y3 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y3)))

# Data
dataMle <- data.frame(x = x, y1 = y1, y2 = y2, y3 = y3)

For fitting a logistic model we employ glm, which has the syntax glm(formula = response ~ predictor, family = "binomial", data = data), where response is a binary variable. Note that family = "binomial" is referring to the fact that the response is a binomial variable (since it is a Bernoulli). Let’s check that indeed the coefficients given by glm are the ones that maximize the likelihood given in the animation of Figure 5.5. We do so for y1 ~ x.

# Call glm
mod <- glm(y1 ~ x, family = "binomial", data = dataMle)
mod$coefficients
## (Intercept)           x 
##  -0.1691947   2.4281626

# -loglik(beta)
minusLogLik <- function(beta) {
  p <- 1 / (1 + exp(-(beta[1] + beta[2] * x)))
  -sum(y1 * log(p) + (1 - y1) * log(1 - p))
}

# Optimization using as starting values beta = c(0, 0)
opt <- optim(par = c(0, 0), fn = minusLogLik)
opt
## $par
## [1] -0.1691366  2.4285119
## 
## $value
## [1] 14.79376
## 
## $counts
## function gradient 
##       73       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

# Visualization of the log-likelihood surface
beta0 <- seq(-3, 3, l = 50)
beta1 <- seq(-2, 8, l = 50)
L <- matrix(nrow = length(beta0), ncol = length(beta1))
for (i in seq_along(beta0)) {
  for (j in seq_along(beta1)) {
    L[i, j] <- minusLogLik(c(beta0[i], beta1[j]))
  }
}
filled.contour(beta0, beta1, -L, color.palette = viridis::viridis,
               xlab = expression(beta[0]), ylab = expression(beta[1]),
               plot.axes = {
                 axis(1); axis(2)
                 points(mod$coefficients[1], mod$coefficients[2],
                        col = 2, pch = 16)
                 points(opt$par[1], opt$par[2], col = 4)
               })
Log-likelihood surface \(\ell(\beta_0,\beta_1)\) and its global maximum \((\hat\beta_0,\hat\beta_1).\)

Figure 5.6: Log-likelihood surface (β0,β1) and its global maximum (β^0,β^1).

# The plot.axes argument is a hack to add graphical information within the
# coordinates of the main panel (behind filled.contour there is a layout()...)

For the regressions y2 ~ x and y3 ~ x, do the following:

  1. Check that the true β is close to maximizing the likelihood computed in Figure 5.5.
  2. Plot the fitted logistic curve and compare it with the one in Figure 5.5.
The extension of the logistic model to the case of a categorical response with more than two levels is sketched in Appendix A.3.

5.2.2 General case

The same idea we used in logistic regression, namely transforming the conditional expectation of Y into something that can be modeled by a linear model (this is, a quantity that lives in R), can be generalized. This raises the family of generalized linear models, which extends the linear model to different kinds of response variables and provides a convenient parametric framework.

The first ingredient is a link function g, that is monotonic and differentiable, which is going to produce a transformed expectation155 to be modeled by a linear combination of the predictors:

g(E[Y|X1=x1,,Xp=xp])=η

or, equivalently,

E[Y|X1=x1,,Xp=xp]=g1(η),

where

η:=β0+β1x1++βpxp

is the linear predictor.

The second ingredient of generalized linear models is a distribution for Y|(X1,,Xp), just as the linear model assumes normality or the logistic model assumes a Bernoulli random variable. Thus, we have two linked generalizations with respect to the usual linear model:

  1. The conditional mean may be modeled by a transformation g1 of the linear predictor η.
  2. The distribution of Y|(X1,,Xp) may be different from the normal.

Generalized linear models are intimately related with the exponential family,156 157 which is the family of distributions with pdf expressible as

(5.10)f(y;θ,ϕ)=exp{yθb(θ)a(ϕ)+c(y,ϕ)},

where a(), b(), and c(,) are specific functions. If Y has the pdf (5.10), then we write YE(θ,ϕ,a,b,c). If the scale parameter ϕ is known, this is an exponential family with canonical parameter θ (if ϕ is unknown, then it may or not may be a two-parameter exponential family).

Distributions from the exponential family have some nice properties. Importantly, if YE(θ,ϕ,a,b,c), then

(5.11)μ:=E[Y]=b(θ),σ2:=Var[Y]=b(θ)a(ϕ).

The canonical link function is the function g that transforms μ=b(θ) into the canonical parameter θ. For E(θ,ϕ,a,b,c), this happens if

(5.12)θ=g(μ)

or, more explicitly due to (5.11), if

(5.13)g(μ)=(b)1(μ).

In the case of canonical link function, the one-line summary of the generalized linear model is (independence is implicit)

(5.14)Y|(X1=x1,,Xp=xp)E(η,ϕ,a,b,c).

Expression (5.14) gives insight on what a generalized linear model does:

  1. Select a member of the exponential family in (5.10) for modeling Y.
  2. The canonical link function g is g(μ)=(b)1(μ). In this case, θ=g(μ).
  3. The generalized linear model associated to the member of the exponential family and g models the conditional θ, given X1,,Xn, by means of the linear predictor η. This is equivalent to modeling the conditional μ by means of g1(η).

The linear model arises as a particular case of (5.14) with

a(ϕ)=ϕ,b(θ)=θ22,c(y,ϕ)=12{y2ϕ+log(2πϕ)},

and scale parameter ϕ=σ2. In this case, μ=θ and the canonical link function g is the identity.

Show that the normal, Bernoulli, exponential, and Poisson distributions are members of the exponential family. For that, express their pdfs in terms of (5.10) and identify who is θ and ϕ.
Show that the binomial and gamma (which includes exponential and chi-squared) distributions are members of the exponential family. For that, express their pdfs in terms of (5.10) and identify who is θ and ϕ.

The following table lists some useful generalized linear models. Recall that the linear and logistic models of Sections 2.2.3 and 5.2.1 are obtained from the first and second rows, respectively.

Support of Y Generating distribution Link g(μ) Expectation g1(η) Scale ϕ Distribution of Y|X=x
R N(μ,σ2) μ η σ2 N(η,σ2)
{0,1} Ber(p) logit(μ) logistic(η) 1 Ber(logistic(η))
{0,,N} B(N,p) log(μNμ) Nlogistic(η) 1 B(N,logistic(η))
{0,1,} Pois(λ) log(μ) eη 1 Pois(eη)
(0,) Γ(a,ν)158 1μ 1η 1ν Γ(ην,ν)159
Obtain the canonical link function for the exponential distribution Exp(λ). What is the scale parameter? What is the distribution of Y|(X1=x1,,Xp=xp) in such model?

Poisson regression

Poisson regression is usually employed for modeling count data that arises from the recording of the frequencies of a certain phenomenon. It considers that

Y|(X1=x1,,Xp=xp)Pois(eη),

this is,

E[Y|X1=x1,,Xp=xp]=λ(Y|X1=x1,,Xp=xp)(5.15)=eβ0+β1x1++βpxp.

Let’s see how to apply a Poisson regression. For that aim we consider the species (download) dataset. The goal is to analyze whether the Biomass and the pH (a factor) of the terrain are influential on the number of Species. Incidentally, it will serve to illustrate that the use of factors within glm is completely analogous to what we did with lm.

# Read data
species <- read.table("species.txt", header = TRUE)
species$pH <- as.factor(species$pH)
# Plot data
plot(Species ~ Biomass, data = species, col = as.numeric(pH))
legend("topright", legend = c("High pH", "Medium pH", "Low pH"),
       col = c(1, 3, 2), lwd = 2) # colors according to as.numeric(pH)


# Fit Poisson regression
species1 <- glm(Species ~ ., data = species, family = poisson)
summary(species1)
## 
## Call:
## glm(formula = Species ~ ., family = poisson, data = species)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5959  -0.6989  -0.0737   0.6647   3.5604  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.84894    0.05281  72.885  < 2e-16 ***
## pHlow       -1.13639    0.06720 -16.910  < 2e-16 ***
## pHmed       -0.44516    0.05486  -8.114 4.88e-16 ***
## Biomass     -0.12756    0.01014 -12.579  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 452.346  on 89  degrees of freedom
## Residual deviance:  99.242  on 86  degrees of freedom
## AIC: 526.43
## 
## Number of Fisher Scoring iterations: 4
# Took 4 iterations of the IRLS

# Interpretation of the coefficients:
exp(species1$coefficients)
## (Intercept)       pHlow       pHmed     Biomass 
##  46.9433686   0.3209744   0.6407222   0.8802418
# - 46.9433 is the average number of species when Biomass = 0 and the pH is high
# - For each increment in one unit in Biomass, the number of species decreases
#   by a factor of 0.88 (12% reduction)
# - If pH decreases to med (low), then the number of species decreases by a factor
#   of 0.6407 (0.3209)

# With interactions
species2 <- glm(Species ~ Biomass * pH, data = species, family = poisson)
summary(species2)
## 
## Call:
## glm(formula = Species ~ Biomass * pH, family = poisson, data = species)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4978  -0.7485  -0.0402   0.5575   3.2297  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    3.76812    0.06153  61.240  < 2e-16 ***
## Biomass       -0.10713    0.01249  -8.577  < 2e-16 ***
## pHlow         -0.81557    0.10284  -7.931 2.18e-15 ***
## pHmed         -0.33146    0.09217  -3.596 0.000323 ***
## Biomass:pHlow -0.15503    0.04003  -3.873 0.000108 ***
## Biomass:pHmed -0.03189    0.02308  -1.382 0.166954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 452.346  on 89  degrees of freedom
## Residual deviance:  83.201  on 84  degrees of freedom
## AIC: 514.39
## 
## Number of Fisher Scoring iterations: 4
exp(species2$coefficients)
##   (Intercept)       Biomass         pHlow         pHmed Biomass:pHlow Biomass:pHmed 
##    43.2987424     0.8984091     0.4423865     0.7178730     0.8563910     0.9686112
# - If pH decreases to med (low), then the effect of the biomass in the number
#   of species decreases by a factor of 0.9686 (0.8564). The higher the pH, the
#   stronger the effect of the Biomass in Species

# Draw fits
plot(Species ~ Biomass, data = species, col = as.numeric(pH))
legend("topright", legend = c("High pH", "Medium pH", "Low pH"),
       col = c(1, 3, 2), lwd = 2) # colors according to as.numeric(pH)

# Without interactions
bio <- seq(0, 10, l = 100)
z <- species1$coefficients[1] + species1$coefficients[4] * bio
lines(bio, exp(z), col = 1)
lines(bio, exp(species1$coefficients[2] + z), col = 2)
lines(bio, exp(species1$coefficients[3] + z), col = 3)

# With interactions seems to provide a significant improvement
bio <- seq(0, 10, l = 100)
z <- species2$coefficients[1] + species2$coefficients[2] * bio
lines(bio, exp(z), col = 1, lty = 2)
lines(bio, exp(species2$coefficients[3] + species2$coefficients[5] * bio + z),
      col = 2, lty = 2)
lines(bio, exp(species2$coefficients[4] + species2$coefficients[6] * bio + z),
      col = 3, lty = 2)

For the challenger dataset, do the following:

  1. Do a Poisson regression of the total number of incidents, nfails.field + nfails.nozzle, on temp.
  2. Plot the data and the fitted Poisson regression curve.
  3. Predict the expected number of incidents at temperatures 0.6 and 11.67.

Binomial regression

Binomial regression is an extension of logistic regression that allows to model discrete responses Y in {0,1,,N}, where N is fixed. In its most vanilla version, it considers the model

(5.16)Y|(X1=x1,,Xp=xp)B(N,logistic(η)),

this is,

(5.17)E[Y|X1=x1,,Xp=xp]=Nlogistic(η).

Comparing (5.17) with (5.4), it is clear that the logistic regression is a particular case with N=1. The interpretation of the coefficients is therefore clear from the interpretation of (5.4), given that logistic(η) models the probability of success of each of the N experiments of the binomial B(N,logistic(η)).

The extra flexibility that binomial regression has offers interesting applications. First, we can use (5.16) as an approach to model proportions160 Y/N[0,1]. In this case, (5.17) becomes161

E[Y/N|X1=x1,,Xp=xp]=logistic(η).

Second, we can let N be dependent on the predictors to accommodate group structures, perhaps the most common usage of binomial regression:

(5.18)Y|X=xB(Nx,logistic(η)),

where the size of the binomial distribution, Nx, depends on the values of the predictors. For example, imagine that the predictors are two quantitative variables and two dummy variables encoding three categories. Then X=(X1,X2,D1,D2) and x=(x1,x2,d1,d2). In this case, Nx could for example take the form

Nx={30,d1=0,d2=0,25,d1=1,d2=0,50,d1=0,d2=1,

that is, we have a different number of experiments on each category, and we want to model the number (or, equivalently, the proportion) of successes for each one, also taking into account the effects of other qualitative variables. This is a very common situation in practice, when one encounters the sample version of (5.18):

(5.19)Yi|Xi=xiB(Ni,logistic(ηi)),i=1,,n.

Let’s see an example of binomial regression that illustrates the particular usage of glm() in this case. The example is a data application from Wood (2006) featuring different binomial sizes. It employs the heart (download) dataset. The goal is to investigate whether the level of creatinine kinase level present in the blood, ck, is a good diagnostic for determining if a patient is likely to have a future heart attack. The number of patients that did not have a heart attack (ok) and that had a heart attack (ha) was established after ck was measured. In total, there are 226 patients that have been aggregated into n=12162 categories of different sizes that have been created according to the average level of ck. Table 5.2 shows the data.

# Read data
heart <- read.table("heart.txt", header = TRUE)

# Sizes for each observation (Ni's)
heart$Ni <- heart$ok + heart$ha

# Proportions of patients with heart attacks
heart$prop <- heart$ha / (heart$ha + heart$ok)
Table 5.2: The heart dataset with Ni (Ni) and Yi/Ni (prop) added.
ck ha ok Ni prop
20 2 88 90 0.022
60 13 26 39 0.333
100 30 8 38 0.789
140 30 5 35 0.857
180 21 0 21 1.000
220 19 1 20 0.950
260 18 1 19 0.947
300 13 1 14 0.929
340 19 1 20 0.950
380 15 0 15 1.000
420 7 0 7 1.000
460 8 0 8 1.000
# Plot of proportions versus ck: twelve observations, each requiring
# Ni patients to determine the proportion
plot(heart$ck, heart$prop, xlab = "Creatinine kinase level",
     ylab = "Proportion of heart attacks")

# Fit binomial regression: recall the cbind() to pass the number of successes
# and failures
heart1 <- glm(cbind(ha, ok) ~ ck, family = binomial, data = heart)
summary(heart1)
## 
## Call:
## glm(formula = cbind(ha, ok) ~ ck, family = binomial, data = heart)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.08184  -1.93008   0.01652   0.41772   2.60362  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.758358   0.336696  -8.192 2.56e-16 ***
## ck           0.031244   0.003619   8.633  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 271.712  on 11  degrees of freedom
## Residual deviance:  36.929  on 10  degrees of freedom
## AIC: 62.334
## 
## Number of Fisher Scoring iterations: 6

# Alternatively: put proportions as responses, but then it is required to
# inform about the binomial size of each observation
heart1 <- glm(prop ~ ck, family = binomial, data = heart, weights = Ni)
summary(heart1)
## 
## Call:
## glm(formula = prop ~ ck, family = binomial, data = heart, weights = Ni)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.08184  -1.93008   0.01652   0.41772   2.60362  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.758358   0.336696  -8.192 2.56e-16 ***
## ck           0.031244   0.003619   8.633  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 271.712  on 11  degrees of freedom
## Residual deviance:  36.929  on 10  degrees of freedom
## AIC: 62.334
## 
## Number of Fisher Scoring iterations: 6

# Add fitted line
ck <- 0:500
newdata <- data.frame(ck = ck)
logistic <- function(eta) 1 / (1 + exp(-eta))
lines(ck, logistic(cbind(1, ck) %*% heart1$coefficients))

# It seems that a polynomial fit could better capture the "wiggly" pattern
# of the data
heart2 <- glm(prop ~ poly(ck, 2, raw = TRUE), family = binomial, data = heart,
              weights = Ni)
heart3 <- glm(prop ~ poly(ck, 3, raw = TRUE), family = binomial, data = heart,
              weights = Ni)
heart4 <- glm(prop ~ poly(ck, 4, raw = TRUE), family = binomial, data = heart,
              weights = Ni)

# Best fit given by heart3
BIC(heart1, heart2, heart3, heart4)
##        df      BIC
## heart1  2 63.30371
## heart2  3 44.27018
## heart3  4 35.59736
## heart4  5 37.96360
summary(heart3)
## 
## Call:
## glm(formula = prop ~ poly(ck, 3, raw = TRUE), family = binomial, 
##     data = heart, weights = Ni)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.99572  -0.08966   0.07468   0.17815   1.61096  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -5.786e+00  9.268e-01  -6.243 4.30e-10 ***
## poly(ck, 3, raw = TRUE)1  1.102e-01  2.139e-02   5.153 2.57e-07 ***
## poly(ck, 3, raw = TRUE)2 -4.649e-04  1.381e-04  -3.367  0.00076 ***
## poly(ck, 3, raw = TRUE)3  6.448e-07  2.544e-07   2.535  0.01125 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 271.7124  on 11  degrees of freedom
## Residual deviance:   4.2525  on  8  degrees of freedom
## AIC: 33.658
## 
## Number of Fisher Scoring iterations: 6

# All fits together
lines(ck, logistic(cbind(1, poly(ck, 2, raw = TRUE)) %*% heart2$coefficients),
      col = 2)
lines(ck, logistic(cbind(1, poly(ck, 3, raw = TRUE)) %*% heart3$coefficients),
      col = 3)
lines(ck, logistic(cbind(1, poly(ck, 4, raw = TRUE)) %*% heart4$coefficients),
      col = 4)
legend("bottomright", legend = c("Linear", "Quadratic", "Cubic", "Quartic"),
       col = 1:4, lwd = 2)

Estimation by maximum likelihood

The estimation of β by MLE can be done in a unified framework, for all generalized linear models, thanks to the exponential family (5.10). Given {(xi,Yi)}i=1n,163 and employing a canonical link function (5.13), we have that

Yi|(X1=xi1,,Xp=xip)E(θi,ϕ,a,b,c),i=1,,n,

where

θi:=ηi:=β0+β1xi1++βpxip,μi:=E[Yi|X1=xi1,,Xp=xip]=g1(ηi).

Then, the log-likelihood is

(5.20)(β)=i=1n(Yiθib(θi)a(ϕ)+c(Yi,ϕ)).

Differentiating with respect to β gives

(β)β=i=1n(Yib(θi))a(ϕ)θiβ

which, exploiting the properties of the exponential family, can be reduced to

(5.21)(β)β=i=1n(Yiμi)g(μi)Vixi,

where xi now represents the i-th row of the design matrix X and Vi:=Var[Yi]=a(ϕ)b(θi). Solving explicitly the system of equations (β)β=0 is not possible in general and a numerical procedure is required. Newton–Raphson is usually employed, which is based in obtaining βnew from the linear system164

(5.22)2(β)ββ|β=βold(βnewβold)=(β)β|β=βold.

A simplifying trick is to consider the expectation of 2(β)ββ|β=βold in (5.22), rather than its actual value. By doing so, we can arrive to a neat iterative algorithm called Iterative Reweighted Least Squares (IRLS). We use the following well-known property of the Fisher information matrix of the MLE theory:

E[2(β)ββ]=E[(β)β((β)β)].

Then, it can be seen that165

(5.23)E[2(β)ββ|β=βold]=i=1nwixixi=XWX,

where wi:=1Vi(g(μi))2 and W:=diag(w1,,wn). Using this notation and from (5.21),

(5.24)(β)β|β=βold=XW(Yμold)g(μold),

Substituting (5.23) and (5.24) in (5.22), we have:

βnew=βoldE[2(β)ββ|β=βold]1(β)β|β=βold=βold+(XWX)1XW(Yμold)g(μold)(5.25)=(XWX)1XWz,

where z:=Xβold+(Yμold)g(μold) is the working vector.

As a consequence, fitting a generalized linear model by IRLS amounts to performing a series of weighted linear models with changing weights and responses given by the working vector. IRLS can be summarized as:

  1. Set βold with some initial estimation.
  2. Compute μold, W, and z.
  3. Compute βnew using (5.25).
  4. Set βold as βnew.
  5. Iterate Steps 2–4 until convergence, then set β^=βnew.
In general, E[2(β)ββ]2(β)ββ. Therefore, IRLS in general departures from the standard Newton–Raphson. However, if the canonical link is used, it can be seen that the equality of 2(β)ββ with its expectation is guaranteed and IRLS is exactly the same as Newton–Raphson. In that case, wi=1g(μi) (which simplifies the computation of W in the algorithm above).

References

Wood, S. N. 2006. Generalized Additive Models. Texts in Statistical Science Series. Boca Raton: Chapman & Hall/CRC. https://doi.org/10.1201/9781420010404.

  1. Recall that a binomial variable with size n and probability p, B(n,p), is obtained by summing n independent Ber(p), so Ber(p) is the same distribution as B(1,p).↩︎

  2. Do not confuse this p with the number of predictors in the model, represented by p. The context should make unambiguous the use of p.↩︎

  3. The fact that the logistic function is a cdf allows remembering that the logistic is to be applied to map R into [0,1], as opposed to the logit function.↩︎

  4. And also, as we will see later, because it is the canonical link function.↩︎

  5. Consequently, the name “odds” used in this context is singular, as it refers to a single ratio.↩︎

  6. Recall that (traditionally) the result of a bet is binary: one either wins or loses it.↩︎

  7. For example, if a horse Y has probability p=2/3 of winning a race (Y=1), then the odds of the horse is p1p=2/31/3=2. This means that the horse has a probability of winning that is twice larger than the probability of losing. This is sometimes written as a 2:1 or 2×1 (spelled “two-to-one”).↩︎

  8. For the previous example: if the odds of the horse was 5, then the probability of winning would be p=5/6.↩︎

  9. To do so, apply (5.6) to (5.4) and use (5.3).↩︎

  10. As in the linear model, we assume the randomness comes from the error present in Y once X is given, not from the X, and we therefore denote xi to the i-th observation of X.↩︎

  11. Section 5.7 discusses in detail the assumptions of generalized linear models.↩︎

  12. Notice that this approach is very different from directly transforming the response as g(Y), as outlined in Section 3.5.1. Indeed, in generalized linear models one transforms E[Y|X1,,Xp], not Y. Of course, g(E[Y|X1,,Xp])E[g(Y)|X1,,Xp].↩︎

  13. Not to be confused with the exponential distribution Exp(λ), which is a member of the exponential family.↩︎

  14. This is the so-called canonical form of the exponential family. Generalizations of the family are possible, though we do not consider them.↩︎

  15. The pdf of a Γ(a,ν) is f(x;a,ν)=aνΓ(ν)xν1eax, for a,ν>0 and x(0,) (the pdf is zero otherwise). The expectation is E[Γ(a,ν)]=νa.↩︎

  16. If the argument ην is not positive, then the probability assigned by Γ(ην,ν) is zero. This delicate case may complicate the estimation of the model. Valid starting values for β are required.↩︎

  17. Note this situation is very different from logistic regression, for which we either have observations with the values 0 or 1. In binomial regression, we can naturally have proportions.↩︎

  18. Clearly, E[B(N,p)/N]=p because E[B(N,p)]=Np.↩︎

  19. The sample size here is n=12, not 226. There are N1,,N12 binomial sizes corresponding to each observation, and i=1nNi=226.↩︎

  20. We assume the randomness comes from the error present in Y once X is given, not from the X. This is implicit in the considered expectations.↩︎

  21. The system stems from a first-order Taylor expansion of the function (β)β:Rp+1Rp+1 about the root β^, where (β)β|β=β^=0.↩︎

  22. Recall that E[(Yiμi)(Yjμj)]=Cov[Yi,Yj]={Vi,i=j,0,ij, because of independence.↩︎