5.2 Model formulation and estimation
For simplicity, we first study the logistic regression and then study the general case of a generalized linear model.
5.2.1 Logistic regression
As we saw in Section 2.2, the multiple linear model described the relation between the random variables and by assuming a linear relation in the conditional expectation:
In addition, it made three more assumptions on the data (see Section 2.3), which resulted in the following one-line summary of the linear model:
Recall that a necessary condition for the linear model to hold is that is continuous, in order to satisfy the normality of the errors. Therefore, the linear model is designed for a continuous response.
The situation when is discrete (naturally ordered values) or categorical (non-ordered categories) requires a different treatment. The simplest situation is when is binary: it can only take two values, codified for convenience as (success) and (failure). For binary variables there is no fundamental distinction between the treatment of discrete and categorical variables. Formally, a binary variable is referred to as a Bernoulli variable:144 145, if
or, equivalently, if
Recall that a Bernoulli variable is completely determined by the probability Therefore, so do its mean and variance:
Assume then that is a Bernoulli variable and that are predictors associated to The purpose in logistic regression is to model
that is, to model how the conditional expectation of or, equivalently, the conditional probability of is changing according to particular values of the predictors. At sight of (5.1), a tempting possibility is to consider the model
However, such a model will run into serious problems inevitably: negative probabilities and probabilities larger than one may happen.
A solution is to consider a link function to encapsulate the value of and map it back to Or, alternatively, a function that takes and maps it to the support of There are several link functions with associated Each link generates a different model:
- Uniform link. Based on the truncation
- Probit link. Based on the normal cdf, this is,
- Logit link. Based on the logistic cdf:146

Figure 5.4: Transformations associated to different link functions. The transformations map the response of a linear regression to
The logistic transformation is the most employed due to its tractability, interpretability, and smoothness.147 Its inverse, is known as the logit function:
In conclusion, with the logit link function we can map the domain of to in order to apply a linear model. The logistic model can be then equivalently stated as
or as
where recall that
There is a clear interpretation of the role of the linear predictor in (5.4) when we come back to (5.3):
- If then ( and are equally likely).
- If then ( is less likely).
- If then ( is more likely).
To be more precise on the interpretation of the coefficients we need to introduce the odds. The odds is an equivalent way of expressing the distribution of probabilities in a binary variable Instead of using to characterize the distribution of we can use
The odds is thus the ratio between the probability of success and the probability of failure.148 It is extensively used in betting.149 due to its better interpretability150 Conversely, if the odds of is given, we can easily know what is the probability of success using the inverse of (5.6):151
Recall that the odds is a number in The and values are attained for and respectively. The log-odds (or logit) is a number in
We can rewrite (5.4) in terms of the odds (5.6)152 so we get:
Alternatively, taking logarithms, we have the log-odds (or logit)
The conditional log-odds (5.8) plays the role of the conditional mean for multiple linear regression. Therefore, we have an analogous interpretation for the coefficients:
- : is the log-odds when
- : is the additive increment of the log-odds for an increment of one unit in provided that the remaining variables do not change.
The log-odds is not as easy to interpret as the odds. For that reason, an equivalent way of interpreting the coefficients, this time based on (5.7), is:
- : is the odds when
- : is the multiplicative increment of the odds for an increment of one unit in provided that the remaining variables do not change. If the increment in is of units, then the multiplicative increment in the odds is
As a consequence of this last interpretation, we have:
Case study application
In the Challenger case study we used fail.field
as an indicator of whether “there was at least an incident with the O-rings” (1
= yes, 0
= no). Let’s see if the temperature was associated with O-ring incidents (Q1). For that, we compute the logistic regression of fail.field
on temp
and we plot the fitted logistic curve.
# Logistic regression: computed with glm and family = "binomial"
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)
# Plot data
plot(challenger$temp, challenger$fail.field, xlim = c(-1, 30),
xlab = "Temperature", ylab = "Incident probability")
# Draw the fitted logistic curve
x <- seq(-1, 30, l = 200)
y <- exp(-(nasa$coefficients[1] + nasa$coefficients[2] * x))
y <- 1 / (1 + y)
lines(x, y, col = 2, lwd = 2)
# The Challenger
points(-0.6, 1, pch = 16)
text(-0.6, 1, labels = "Challenger", pos = 4)
At the sight of this curve and the summary it seems that the temperature was affecting the probability of an O-ring incident (Q1). Let’s quantify this statement and answer Q2 by looking to the coefficients of the model:
# Exponentiated coefficients ("odds ratios")
exp(coef(nasa))
## (Intercept) temp
## 1965.9743592 0.6592539
The exponentials of the estimated coefficients are:
- This means that, when the temperature is zero, the fitted odds is so the (estimated) probability of having an incident () is times larger than the probability of not having an incident (). Or, in other words, the probability of having an incident at temperature zero is
- This means that each Celsius degree increment on the temperature multiplies the fitted odds by a factor of hence reducing it.
However, for the moment we cannot say whether these findings are significant or are just an artifact of the randomness of the data, since we do not have information on the variability of the estimates of We will need inference for that.
Estimation by maximum likelihood
The estimation of from a sample 153 is done by Maximum Likelihood Estimation (MLE). As it can be seen in Appendix A.2, in the linear model, under the assumptions mentioned in Section 2.3, MLE is equivalent to least squares estimation. In the logistic model, we assume that154
where Denoting the log-likelihood of is
The ML estimate of is
Unfortunately, due to the nonlinearity of (5.9), there is no explicit expression for and it has to be obtained numerically by means of an iterative procedure. We will see that with more detail in the next section. Just be aware that this iterative procedure may fail to converge in low sample size situations with perfect classification, where the likelihood might be numerically unstable.
Figure 5.5: The logistic regression fit and its dependence on (horizontal displacement) and (steepness of the curve). Recall the effect of the sign of in the curve: if positive, the logistic curve has an ‘s’ form; if negative, the form is a reflected ‘s’. Application available here.
Figure 5.5 shows how the log-likelihood changes with respect to the values for in three data patterns. The data of the illustration has been generated with the next chunk of code.
# Data
set.seed(34567)
x <- rnorm(50, sd = 1.5)
y1 <- -0.5 + 3 * x
y2 <- 0.5 - 2 * x
y3 <- -2 + 5 * x
y1 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y1)))
y2 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y2)))
y3 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y3)))
# Data
dataMle <- data.frame(x = x, y1 = y1, y2 = y2, y3 = y3)
For fitting a logistic model we employ glm
, which has the syntax glm(formula = response ~ predictor, family = "binomial", data = data)
, where response
is a binary variable. Note that family = "binomial"
is referring to the fact that the response is a binomial variable (since it is a Bernoulli). Let’s check that indeed the coefficients given by glm
are the ones that maximize the likelihood given in the animation of Figure 5.5. We do so for y1 ~ x
.
# Call glm
mod <- glm(y1 ~ x, family = "binomial", data = dataMle)
mod$coefficients
## (Intercept) x
## -0.1691947 2.4281626
# -loglik(beta)
minusLogLik <- function(beta) {
p <- 1 / (1 + exp(-(beta[1] + beta[2] * x)))
-sum(y1 * log(p) + (1 - y1) * log(1 - p))
}
# Optimization using as starting values beta = c(0, 0)
opt <- optim(par = c(0, 0), fn = minusLogLik)
opt
## $par
## [1] -0.1691366 2.4285119
##
## $value
## [1] 14.79376
##
## $counts
## function gradient
## 73 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
# Visualization of the log-likelihood surface
beta0 <- seq(-3, 3, l = 50)
beta1 <- seq(-2, 8, l = 50)
L <- matrix(nrow = length(beta0), ncol = length(beta1))
for (i in seq_along(beta0)) {
for (j in seq_along(beta1)) {
L[i, j] <- minusLogLik(c(beta0[i], beta1[j]))
}
}
filled.contour(beta0, beta1, -L, color.palette = viridis::viridis,
xlab = expression(beta[0]), ylab = expression(beta[1]),
plot.axes = {
axis(1); axis(2)
points(mod$coefficients[1], mod$coefficients[2],
col = 2, pch = 16)
points(opt$par[1], opt$par[2], col = 4)
})

Figure 5.6: Log-likelihood surface and its global maximum
# The plot.axes argument is a hack to add graphical information within the
# coordinates of the main panel (behind filled.contour there is a layout()...)
For the regressions y2 ~ x
and y3 ~ x
, do the following:
5.2.2 General case
The same idea we used in logistic regression, namely transforming the conditional expectation of into something that can be modeled by a linear model (this is, a quantity that lives in ), can be generalized. This raises the family of generalized linear models, which extends the linear model to different kinds of response variables and provides a convenient parametric framework.
The first ingredient is a link function that is monotonic and differentiable, which is going to produce a transformed expectation155 to be modeled by a linear combination of the predictors:
or, equivalently,
where
is the linear predictor.
The second ingredient of generalized linear models is a distribution for just as the linear model assumes normality or the logistic model assumes a Bernoulli random variable. Thus, we have two linked generalizations with respect to the usual linear model:
- The conditional mean may be modeled by a transformation of the linear predictor
- The distribution of may be different from the normal.
Generalized linear models are intimately related with the exponential family,156 157 which is the family of distributions with pdf expressible as
where and are specific functions. If has the pdf (5.10), then we write If the scale parameter is known, this is an exponential family with canonical parameter (if is unknown, then it may or not may be a two-parameter exponential family).
Distributions from the exponential family have some nice properties. Importantly, if then
The canonical link function is the function that transforms into the canonical parameter . For this happens if
or, more explicitly due to (5.11), if
In the case of canonical link function, the one-line summary of the generalized linear model is (independence is implicit)
Expression (5.14) gives insight on what a generalized linear model does:
- Select a member of the exponential family in (5.10) for modeling
- The canonical link function is In this case,
- The generalized linear model associated to the member of the exponential family and models the conditional given by means of the linear predictor This is equivalent to modeling the conditional by means of
The linear model arises as a particular case of (5.14) with
and scale parameter In this case, and the canonical link function is the identity.
The following table lists some useful generalized linear models. Recall that the linear and logistic models of Sections 2.2.3 and 5.2.1 are obtained from the first and second rows, respectively.
Support of | Generating distribution | Link | Expectation | Scale | Distribution of |
---|---|---|---|---|---|
158 | 159 |
Poisson regression
Poisson regression is usually employed for modeling count data that arises from the recording of the frequencies of a certain phenomenon. It considers that
this is,
Let’s see how to apply a Poisson regression. For that aim we consider the species
(download) dataset. The goal is to analyze whether the Biomass
and the pH
(a factor) of the terrain are influential on the number of Species
. Incidentally, it will serve to illustrate that the use of factors within glm
is completely analogous to what we did with lm
.
# Plot data
plot(Species ~ Biomass, data = species, col = as.numeric(pH))
legend("topright", legend = c("High pH", "Medium pH", "Low pH"),
col = c(1, 3, 2), lwd = 2) # colors according to as.numeric(pH)
# Fit Poisson regression
species1 <- glm(Species ~ ., data = species, family = poisson)
summary(species1)
##
## Call:
## glm(formula = Species ~ ., family = poisson, data = species)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5959 -0.6989 -0.0737 0.6647 3.5604
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.84894 0.05281 72.885 < 2e-16 ***
## pHlow -1.13639 0.06720 -16.910 < 2e-16 ***
## pHmed -0.44516 0.05486 -8.114 4.88e-16 ***
## Biomass -0.12756 0.01014 -12.579 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 452.346 on 89 degrees of freedom
## Residual deviance: 99.242 on 86 degrees of freedom
## AIC: 526.43
##
## Number of Fisher Scoring iterations: 4
# Took 4 iterations of the IRLS
# Interpretation of the coefficients:
exp(species1$coefficients)
## (Intercept) pHlow pHmed Biomass
## 46.9433686 0.3209744 0.6407222 0.8802418
# - 46.9433 is the average number of species when Biomass = 0 and the pH is high
# - For each increment in one unit in Biomass, the number of species decreases
# by a factor of 0.88 (12% reduction)
# - If pH decreases to med (low), then the number of species decreases by a factor
# of 0.6407 (0.3209)
# With interactions
species2 <- glm(Species ~ Biomass * pH, data = species, family = poisson)
summary(species2)
##
## Call:
## glm(formula = Species ~ Biomass * pH, family = poisson, data = species)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4978 -0.7485 -0.0402 0.5575 3.2297
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.76812 0.06153 61.240 < 2e-16 ***
## Biomass -0.10713 0.01249 -8.577 < 2e-16 ***
## pHlow -0.81557 0.10284 -7.931 2.18e-15 ***
## pHmed -0.33146 0.09217 -3.596 0.000323 ***
## Biomass:pHlow -0.15503 0.04003 -3.873 0.000108 ***
## Biomass:pHmed -0.03189 0.02308 -1.382 0.166954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 452.346 on 89 degrees of freedom
## Residual deviance: 83.201 on 84 degrees of freedom
## AIC: 514.39
##
## Number of Fisher Scoring iterations: 4
exp(species2$coefficients)
## (Intercept) Biomass pHlow pHmed Biomass:pHlow Biomass:pHmed
## 43.2987424 0.8984091 0.4423865 0.7178730 0.8563910 0.9686112
# - If pH decreases to med (low), then the effect of the biomass in the number
# of species decreases by a factor of 0.9686 (0.8564). The higher the pH, the
# stronger the effect of the Biomass in Species
# Draw fits
plot(Species ~ Biomass, data = species, col = as.numeric(pH))
legend("topright", legend = c("High pH", "Medium pH", "Low pH"),
col = c(1, 3, 2), lwd = 2) # colors according to as.numeric(pH)
# Without interactions
bio <- seq(0, 10, l = 100)
z <- species1$coefficients[1] + species1$coefficients[4] * bio
lines(bio, exp(z), col = 1)
lines(bio, exp(species1$coefficients[2] + z), col = 2)
lines(bio, exp(species1$coefficients[3] + z), col = 3)
# With interactions seems to provide a significant improvement
bio <- seq(0, 10, l = 100)
z <- species2$coefficients[1] + species2$coefficients[2] * bio
lines(bio, exp(z), col = 1, lty = 2)
lines(bio, exp(species2$coefficients[3] + species2$coefficients[5] * bio + z),
col = 2, lty = 2)
lines(bio, exp(species2$coefficients[4] + species2$coefficients[6] * bio + z),
col = 3, lty = 2)
For the challenger
dataset, do the following:
- Do a Poisson regression of the total number of incidents,
nfails.field + nfails.nozzle
, ontemp
. - Plot the data and the fitted Poisson regression curve.
- Predict the expected number of incidents at temperatures and
Binomial regression
Binomial regression is an extension of logistic regression that allows to model discrete responses in where is fixed. In its most vanilla version, it considers the model
this is,
Comparing (5.17) with (5.4), it is clear that the logistic regression is a particular case with The interpretation of the coefficients is therefore clear from the interpretation of (5.4), given that models the probability of success of each of the experiments of the binomial
The extra flexibility that binomial regression has offers interesting applications. First, we can use (5.16) as an approach to model proportions160 In this case, (5.17) becomes161
Second, we can let be dependent on the predictors to accommodate group structures, perhaps the most common usage of binomial regression:
where the size of the binomial distribution, depends on the values of the predictors. For example, imagine that the predictors are two quantitative variables and two dummy variables encoding three categories. Then and In this case, could for example take the form
that is, we have a different number of experiments on each category, and we want to model the number (or, equivalently, the proportion) of successes for each one, also taking into account the effects of other qualitative variables. This is a very common situation in practice, when one encounters the sample version of (5.18):
Let’s see an example of binomial regression that illustrates the particular usage of glm()
in this case. The example is a data application from Wood (2006) featuring different binomial sizes. It employs the heart
(download) dataset. The goal is to investigate whether the level of creatinine kinase level present in the blood, ck
, is a good diagnostic for determining if a patient is likely to have a future heart attack. The number of patients that did not have a heart attack (ok
) and that had a heart attack (ha
) was established after ck
was measured. In total, there are patients that have been aggregated into 162 categories of different sizes that have been created according to the average level of ck
. Table 5.2 shows the data.
# Read data
heart <- read.table("heart.txt", header = TRUE)
# Sizes for each observation (Ni's)
heart$Ni <- heart$ok + heart$ha
# Proportions of patients with heart attacks
heart$prop <- heart$ha / (heart$ha + heart$ok)
ck | ha | ok | Ni | prop |
---|---|---|---|---|
20 | 2 | 88 | 90 | 0.022 |
60 | 13 | 26 | 39 | 0.333 |
100 | 30 | 8 | 38 | 0.789 |
140 | 30 | 5 | 35 | 0.857 |
180 | 21 | 0 | 21 | 1.000 |
220 | 19 | 1 | 20 | 0.950 |
260 | 18 | 1 | 19 | 0.947 |
300 | 13 | 1 | 14 | 0.929 |
340 | 19 | 1 | 20 | 0.950 |
380 | 15 | 0 | 15 | 1.000 |
420 | 7 | 0 | 7 | 1.000 |
460 | 8 | 0 | 8 | 1.000 |
# Plot of proportions versus ck: twelve observations, each requiring
# Ni patients to determine the proportion
plot(heart$ck, heart$prop, xlab = "Creatinine kinase level",
ylab = "Proportion of heart attacks")
# Fit binomial regression: recall the cbind() to pass the number of successes
# and failures
heart1 <- glm(cbind(ha, ok) ~ ck, family = binomial, data = heart)
summary(heart1)
##
## Call:
## glm(formula = cbind(ha, ok) ~ ck, family = binomial, data = heart)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.08184 -1.93008 0.01652 0.41772 2.60362
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.758358 0.336696 -8.192 2.56e-16 ***
## ck 0.031244 0.003619 8.633 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 271.712 on 11 degrees of freedom
## Residual deviance: 36.929 on 10 degrees of freedom
## AIC: 62.334
##
## Number of Fisher Scoring iterations: 6
# Alternatively: put proportions as responses, but then it is required to
# inform about the binomial size of each observation
heart1 <- glm(prop ~ ck, family = binomial, data = heart, weights = Ni)
summary(heart1)
##
## Call:
## glm(formula = prop ~ ck, family = binomial, data = heart, weights = Ni)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.08184 -1.93008 0.01652 0.41772 2.60362
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.758358 0.336696 -8.192 2.56e-16 ***
## ck 0.031244 0.003619 8.633 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 271.712 on 11 degrees of freedom
## Residual deviance: 36.929 on 10 degrees of freedom
## AIC: 62.334
##
## Number of Fisher Scoring iterations: 6
# Add fitted line
ck <- 0:500
newdata <- data.frame(ck = ck)
logistic <- function(eta) 1 / (1 + exp(-eta))
lines(ck, logistic(cbind(1, ck) %*% heart1$coefficients))
# It seems that a polynomial fit could better capture the "wiggly" pattern
# of the data
heart2 <- glm(prop ~ poly(ck, 2, raw = TRUE), family = binomial, data = heart,
weights = Ni)
heart3 <- glm(prop ~ poly(ck, 3, raw = TRUE), family = binomial, data = heart,
weights = Ni)
heart4 <- glm(prop ~ poly(ck, 4, raw = TRUE), family = binomial, data = heart,
weights = Ni)
# Best fit given by heart3
BIC(heart1, heart2, heart3, heart4)
## df BIC
## heart1 2 63.30371
## heart2 3 44.27018
## heart3 4 35.59736
## heart4 5 37.96360
summary(heart3)
##
## Call:
## glm(formula = prop ~ poly(ck, 3, raw = TRUE), family = binomial,
## data = heart, weights = Ni)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.99572 -0.08966 0.07468 0.17815 1.61096
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.786e+00 9.268e-01 -6.243 4.30e-10 ***
## poly(ck, 3, raw = TRUE)1 1.102e-01 2.139e-02 5.153 2.57e-07 ***
## poly(ck, 3, raw = TRUE)2 -4.649e-04 1.381e-04 -3.367 0.00076 ***
## poly(ck, 3, raw = TRUE)3 6.448e-07 2.544e-07 2.535 0.01125 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 271.7124 on 11 degrees of freedom
## Residual deviance: 4.2525 on 8 degrees of freedom
## AIC: 33.658
##
## Number of Fisher Scoring iterations: 6
# All fits together
lines(ck, logistic(cbind(1, poly(ck, 2, raw = TRUE)) %*% heart2$coefficients),
col = 2)
lines(ck, logistic(cbind(1, poly(ck, 3, raw = TRUE)) %*% heart3$coefficients),
col = 3)
lines(ck, logistic(cbind(1, poly(ck, 4, raw = TRUE)) %*% heart4$coefficients),
col = 4)
legend("bottomright", legend = c("Linear", "Quadratic", "Cubic", "Quartic"),
col = 1:4, lwd = 2)
Estimation by maximum likelihood
The estimation of by MLE can be done in a unified framework, for all generalized linear models, thanks to the exponential family (5.10). Given 163 and employing a canonical link function (5.13), we have that
where
Then, the log-likelihood is
Differentiating with respect to gives
which, exploiting the properties of the exponential family, can be reduced to
where now represents the -th row of the design matrix and Solving explicitly the system of equations is not possible in general and a numerical procedure is required. Newton–Raphson is usually employed, which is based in obtaining from the linear system164
A simplifying trick is to consider the expectation of in (5.22), rather than its actual value. By doing so, we can arrive to a neat iterative algorithm called Iterative Reweighted Least Squares (IRLS). We use the following well-known property of the Fisher information matrix of the MLE theory:
Then, it can be seen that165
where and Using this notation and from (5.21),
Substituting (5.23) and (5.24) in (5.22), we have:
where is the working vector.
As a consequence, fitting a generalized linear model by IRLS amounts to performing a series of weighted linear models with changing weights and responses given by the working vector. IRLS can be summarized as:
- Set with some initial estimation.
- Compute and
- Compute using (5.25).
- Set as
- Iterate Steps 2–4 until convergence, then set
References
Recall that a binomial variable with size and probability , is obtained by summing independent so is the same distribution as ↩︎
Do not confuse this with the number of predictors in the model, represented by The context should make unambiguous the use of ↩︎
The fact that the logistic function is a cdf allows remembering that the logistic is to be applied to map into as opposed to the logit function.↩︎
And also, as we will see later, because it is the canonical link function.↩︎
Consequently, the name “odds” used in this context is singular, as it refers to a single ratio.↩︎
Recall that (traditionally) the result of a bet is binary: one either wins or loses it.↩︎
For example, if a horse has probability of winning a race (), then the odds of the horse is This means that the horse has a probability of winning that is twice larger than the probability of losing. This is sometimes written as a or (spelled “two-to-one”).↩︎
For the previous example: if the odds of the horse was then the probability of winning would be ↩︎
As in the linear model, we assume the randomness comes from the error present in once is given, not from the and we therefore denote to the -th observation of ↩︎
Section 5.7 discusses in detail the assumptions of generalized linear models.↩︎
Notice that this approach is very different from directly transforming the response as as outlined in Section 3.5.1. Indeed, in generalized linear models one transforms not Of course, ↩︎
Not to be confused with the exponential distribution which is a member of the exponential family.↩︎
This is the so-called canonical form of the exponential family. Generalizations of the family are possible, though we do not consider them.↩︎
The pdf of a is for and (the pdf is zero otherwise). The expectation is ↩︎
If the argument is not positive, then the probability assigned by is zero. This delicate case may complicate the estimation of the model. Valid starting values for are required.↩︎
Note this situation is very different from logistic regression, for which we either have observations with the values or In binomial regression, we can naturally have proportions.↩︎
Clearly, because ↩︎
The sample size here is not There are binomial sizes corresponding to each observation, and ↩︎
We assume the randomness comes from the error present in once is given, not from the This is implicit in the considered expectations.↩︎
The system stems from a first-order Taylor expansion of the function about the root where ↩︎
Recall that because of independence.↩︎