Chapter 10 Solutions to Selected Exercises
10.1 Chapter 1
Question 5
y_i = weight of the i-th person, x_{i1} = their age, x_{i2} = sex, x_{i3} = height, x_{i4} = mean daily food intake, x_{i5} = mean daily energy expenditure. Distribution of y_i could be Gaussian. Linear component is \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5}.
y_i = proportion of infected mice in 20 mice in one experiment, x_{ij} = indicator variable for exposure at level j, with j = 1, 2, 3, 4. Distribution of y_i could be rescaled binomial. Linear component is \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4}.
y_i = number of trips to the supermarket per week, x_{i1} = number of people in household, x_{i2} = household income, x_{i3} = distance to the supermarket. Distribution of y_i could be Poisson. Linear component is \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3}.
10.2 Chapter 2
Question 1b
For log link
We have S(\boldsymbol{\beta}) = \sum_{i=1}^{n_A + n_B} ( y_i - \mu_i ) \, \boldsymbol{x}_i, where \boldsymbol{x}_i = (1, x_i)^T. Note that data in the same group have the same estimated mean since x_i is the same within each group. From the score equation:
\boldsymbol{0} = S(\boldsymbol{\hat{\beta}}) = \sum_{i=1}^{n_A} ( y_i - \hat{\mu}_A ) \begin{pmatrix} 1 \\ 1 \end{pmatrix} + \sum_{i=n_A+1}^{n_A+n_B} ( y_i - \hat{\mu}_B ) \begin{pmatrix} 1 \\ 0 \end{pmatrix}.
This implies:
\sum_{i=1}^{n_A} ( y_i - \hat{\mu}_A ) + \sum_{i=n_A+1}^{n_A+n_B} ( y_i - \hat{\mu}_B ) = 0 and
\sum_{i=1}^{n_A} ( y_i - \hat{\mu}_A ) = 0. These two equations imply that \hat{\mu}_A = \frac{1}{n_A} \sum_{i=1}^{n_A} y_i and \hat{\mu}_B = \frac{1}{n_B} \sum_{i=n_A+1}^{n_A+n_B} y_i.
For general link
With a general link, data in the same group still have the same estimated mean and linear predictor. Using the general form of the score function, we have:
\boldsymbol{0} = S(\boldsymbol{\hat{\beta}}) = \frac{1}{\phi} \sum_{i=1}^{n_A} ( y_i - \hat{\mu}_A ) \frac{h'(\hat{\eta}_A)}{\mathcal{V}(\hat{\mu}_A)} \begin{pmatrix} 1 \\ 1 \end{pmatrix} + \frac{1}{\phi} \sum_{i=n_A+1}^{n_A+n_B} ( y_i - \hat{\mu}_B ) \frac{h'(\hat{\eta}_B)}{\mathcal{V}(\hat{\mu}_B)} \begin{pmatrix} 1 \\ 0 \end{pmatrix}. This implies:
\sum_{i=1}^{n_A} ( y_i - \hat{\mu}_A ) \frac{h'(\hat{\eta}_A)}{\mathcal{V}(\hat{\mu}_A)} + \sum_{i=n_A+1}^{n_A+n_B} ( y_i - \hat{\mu}_B ) \frac{h'(\hat{\eta}_B)}{\mathcal{V}(\hat{\mu}_B)} = 0 and
\sum_{i=1}^{n_A} ( y_i - \hat{\mu}_A ) \frac{h'(\hat{\eta}_A)}{\mathcal{V}(\hat{\mu}_A)} = 0. These two equations also imply that \hat{\mu}_A = \frac{1}{n_A} \sum_{i=1}^{n_A} y_i and \hat{\mu}_B = \frac{1}{n_B} \sum_{i=n_A+1}^{n_A+n_B} y_i.
Question 2
- This link function tries to capture E[y_i] = \mu_i = (\boldsymbol{\beta}^T \boldsymbol{x}_i)^2.
- \displaystyle S(\boldsymbol{\beta}) = \frac{2}{\sigma^2} \sum_i \left( y_i - (\boldsymbol{\beta}^T \boldsymbol{x}_i)^2 \right) (\boldsymbol{\beta}^T \boldsymbol{x}_i) \boldsymbol{x}_i.
Question 3
- First we identify the EDF components. An exponential distribution is an EDF with \phi = 1, \theta = -\lambda, and b(\theta) = -\ln(-\theta). This gives \mu = -1/\theta = 1/\lambda and \text{Var}[y] = 1/\theta^{2} = \mu^{2} as expected. Recall that \text{Var}[y] = \phi \;\mathcal{V}(\mu), defining \mathcal{V}. The response function h is \mu = h(\eta) = e^{\eta}, so that \mathcal{V}(\mu) = \mu^{2} = e^{2\eta}. Putting all this together in the standard formula gives
S(\boldsymbol{\beta}) = \sum_i (y_i - e^{\boldsymbol{\beta}^T \boldsymbol{x}_i}) \;e^{-2 \boldsymbol{\beta}^T \boldsymbol{x}_i} \;e^{\boldsymbol{\beta}^T \boldsymbol{x}_i} \boldsymbol{x}_i = \sum_i \Biggl( {y_i \over e^{\boldsymbol{\beta}^T \boldsymbol{x}_i}} - 1 \Biggr) \boldsymbol{x}_i.
- Differentiating gives
F_{\text{obs}}(\boldsymbol{\beta}) = -{\partial S\over \partial\boldsymbol{\beta}} = \sum_i {y_i \over e^{\boldsymbol{\beta}^T \boldsymbol{x}_i}} \;\boldsymbol{x}_{i} \boldsymbol{x}_{i}^{T}.
- Expectation will only act on the y_i in each term, giving \mu_i, so
F(\boldsymbol{\beta}) = \sum_i {\mu_i \over e^{\boldsymbol{\beta}^T \boldsymbol{x}_i}} \;\boldsymbol{x}_i \boldsymbol{x}_i^T = \sum_i \boldsymbol{x}_i \boldsymbol{x}_i^T.
Notice that this is not even a function of \boldsymbol{\beta}.
Question 5
We have \eta_i = g(\mu_i) = \log(\mu_i) = \log(n_i \lambda_i) = \log n_i + \boldsymbol{\beta}^T \boldsymbol{x}_i. We can also think of this model as a normal Poisson GLM with \eta_i = \tilde{\boldsymbol{\beta}}^T \tilde{\boldsymbol{x}}_i, where \tilde{\boldsymbol{\beta}} = \begin{pmatrix} 1 \\ \boldsymbol{\beta} \end{pmatrix} and \tilde{\boldsymbol{x}}_i = \begin{pmatrix} \log n_i \\ \boldsymbol{x}_i \end{pmatrix}. See Section 3.5.3 in the textbook for an example of this model.
This model is similar to a normal Poisson GLM but with an additional \log n_i term in the linear predictor, which does not affect the general forms of the score function and Fisher information (you should verify this). Thus, we can use the expressions for the score function and Fisher information in the lecture notes. Assume the data is ungrouped (it is also unrealistic to have grouped data here since we often cannot get several responses for each population). Since log is the natural link for this model, we have:
S(\boldsymbol{\beta}) = \sum_i (y_i - \mu_i) \boldsymbol{x}_i = \sum_i \left( y_i - n_i \exp(\boldsymbol{\beta}^T \boldsymbol{x}_i) \right) \boldsymbol{x}_i. Furthermore, for the natural link, the observed Fisher information and expected Fisher information are the same, with
F_{\text{obs}}(\boldsymbol{\beta}) = F(\boldsymbol{\beta}) = \sum_i h'(\eta_i) \boldsymbol{x}_i\boldsymbol{x}_i^T = \sum_i n_i \exp(\boldsymbol{\beta}^T \boldsymbol{x}_i) \boldsymbol{x}_i\boldsymbol{x}_i^T.
Question 6
The score function and Fisher information in matrix notation are:
\begin{align} \boldsymbol{S} &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} (\boldsymbol{Y} - \boldsymbol{\mu}) \\ \boldsymbol{F} &= \boldsymbol{X}^{T}\boldsymbol{D}^{T}\boldsymbol{\Sigma}^{-1}\boldsymbol{D}\boldsymbol{X} \end{align}
where
\begin{align} \boldsymbol{X} &= \begin{pmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{pmatrix} \\ \boldsymbol{Y} &= \begin{pmatrix} 12 \\ 10 \\ 8 \\ 7 \end{pmatrix} \\ \boldsymbol{\mu} &= \begin{pmatrix} e^{\beta_0} \\ e^{\beta_0 + \beta_1} \\ e^{\beta_0 + 2 \beta_1} \\ e^{\beta_0 + 3 \beta_1} \end{pmatrix} \\ \boldsymbol{D} &= \begin{pmatrix} e^{\beta_0} & 0 & 0 & 0 \\ 0 & e^{\beta_0 + \beta_1} & 0 & 0 \\ 0 & 0 & e^{\beta_0 + 2 \beta_1} & 0 \\ 0 & 0 & 0 & e^{\beta_0 + 3 \beta_1} \end{pmatrix} \\ \boldsymbol{\Sigma} &= \begin{pmatrix} \frac{e^{\beta_0}}{20} & 0 & 0 & 0 \\ 0 & \frac{e^{\beta_0 + \beta_1}}{20} & 0 & 0 \\ 0 & 0 & \frac{e^{\beta_0 + 2 \beta_1}}{10} & 0 \\ 0 & 0 & 0 & \frac{e^{\beta_0 + 3 \beta_1}}{10} \end{pmatrix}. \end{align}
Furthermore, since log link is the natural link and \phi=1 for the Poisson GLM, we can further simplify:
\begin{align} \boldsymbol{S} &= \boldsymbol{X}^{T} \boldsymbol{G} (\boldsymbol{Y} - \boldsymbol{\mu}) \\ \boldsymbol{F} & = \boldsymbol{X}^{T}\boldsymbol{G}^{T}\boldsymbol{\Sigma}\boldsymbol{G}\boldsymbol{X} \end{align}
where \boldsymbol{G} is the grouping matrix: \boldsymbol{G} = \begin{pmatrix} 20 & 0 & 0 & 0 \\ 0 & 20 & 0 & 0 \\ 0 & 0 & 10 & 0 \\ 0 & 0 & 0 & 10 \end{pmatrix}.
Question 9
- From \boldsymbol{S} = \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} (\boldsymbol{Y} - \boldsymbol{\mu}), we have:
\begin{align} E[\boldsymbol{S}] = \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} (E[\boldsymbol{Y}] - \boldsymbol{\mu}) = \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu}) = 0 \end{align}
and
\begin{align} \text{Var}[\boldsymbol{S}] &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \text{Var}(\boldsymbol{Y} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \text{Var}(\boldsymbol{Y}) \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \boldsymbol{\Sigma} \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X}. \end{align}
- Taking the derivative of \boldsymbol{S} and applying the product rule for derivatives, we have:
\begin{align} \frac{\partial \boldsymbol{S}}{\partial \boldsymbol{\beta}} &= \boldsymbol{X}^{T} \frac{\partial }{\partial \boldsymbol{\beta}}(\boldsymbol{D} \boldsymbol{\Sigma}^{-1}) (\boldsymbol{Y} - \boldsymbol{\mu}) + \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial }{\partial \boldsymbol{\beta}} (\boldsymbol{Y} - \boldsymbol{\mu}) \\ &= \boldsymbol{X}^{T} \frac{\partial }{\partial \boldsymbol{\beta}}(\boldsymbol{D} \boldsymbol{\Sigma}^{-1}) (\boldsymbol{Y} - \boldsymbol{\mu}) - \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial \boldsymbol{\mu}}{\partial \boldsymbol{\beta}}. \end{align}
Thus,
\begin{align} \boldsymbol{F} = E \left[- \frac{\partial \boldsymbol{S}}{\partial \boldsymbol{\beta}} \right] &= - E \left[ \boldsymbol{X}^{T} \frac{\partial }{\partial \boldsymbol{\beta}}(\boldsymbol{D} \boldsymbol{\Sigma}^{-1}) ( \boldsymbol{Y} - \boldsymbol{\mu} ) \right] + E \left[ \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial \boldsymbol{\mu}}{\partial \boldsymbol{\beta}} \right] \\ &= - \boldsymbol{X}^{T} \frac{\partial }{\partial \boldsymbol{\beta}}( \boldsymbol{D} \boldsymbol{\Sigma}^{-1}) E [ \boldsymbol{Y} - \boldsymbol{\mu} ] + \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial \boldsymbol{\mu}}{\partial \boldsymbol{\beta}} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial \boldsymbol{\mu}}{\partial \boldsymbol{\eta}} \frac{\partial \boldsymbol{\eta}}{\partial \boldsymbol{\beta}} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X}. \end{align}
10.3 Chapter 3
Question 2a
Let x_{i1} = {\tt time}_i, x_{i2} = {\tt I(cos(2*pi*time/12))}_i, and x_{i3} = {\tt I(sin(2*pi*time/12))}_i, where the subscript i indicates the i-th data point. We perform the test with
\begin{aligned} & \mathcal{H}_0: \eta_i = \beta_0 + \beta_1 x_{i1} \\ \text{against } \quad & \mathcal{H}_1: \eta_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3}. \end{aligned} Let \boldsymbol{\beta} = (\beta_0, \beta_1, \beta_2, \beta_3)^T, then
\begin{aligned} C & = \begin{pmatrix} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} \\ \gamma & = \begin{pmatrix} 0 \\ 0 \end{pmatrix}. \end{aligned}
The variance that we are looking for is:
C \, F(\hat{\boldsymbol\beta})^{-1} \, C^T = C \, \text{Var} [ \hat{\boldsymbol\beta} ] \, C^T = \text{Var} \left[ \begin{pmatrix} \hat{\beta_2} \\ \hat{\beta_3} \end{pmatrix} \right].
Thus, the Wald statistic is:
\begin{aligned} W &= (C \hat{\boldsymbol\beta} - \gamma)^T \left( C F(\hat{\boldsymbol\beta})^{-1} C^T \right)^{-1} (C \hat{\boldsymbol\beta} - \gamma) \\ &= (\hat{\beta_2} \;\; \hat{\beta_3}) \, \text{Var} \left[ \begin{pmatrix} \hat{\beta_2} \\ \hat{\beta_3} \end{pmatrix} \right]^{-1} \begin{pmatrix} \hat{\beta_2} \\ \hat{\beta_3} \end{pmatrix}. \end{aligned}
From R, we can get (\hat{\beta_2} \;\; \hat{\beta_3}) and the corresponding covariance matrix by:
## I(cos(2 * pi * time/12)) I(sin(2 * pi * time/12))
## 0.1812537 -0.4231874
## I(cos(2 * pi * time/12)) I(sin(2 * pi * time/12))
## I(cos(2 * pi * time/12)) 0.0092468054 -0.0002083338
## I(sin(2 * pi * time/12)) -0.0002083338 0.0095238337
Thus, the Wald statistic can be computed by:
## [,1]
## [1,] 22.00497
Under \mathcal{H}_0, this statistic is \chi^{2}(2) distributed with the p-value:
## [,1]
## [1,] 1.666024e-05
Since this p-value is very small, we reject \mathcal{H}_0 at both 5% and 1% significance levels.
Note that we can check our result by using the wald.test
function from the mdscore
package (to be installed if necessary), e.g.,
## Warning: package 'mdscore' was built under R version 4.3.3
## $W
## [1] 22.00497
##
## $pvalue
## [1] 1.666024e-05
##
## attr(,"class")
## [1] "wald.test"
Question 3
- From exercise 3 of Chapter 2, when \boldsymbol{x}_i = (1, \texttt{wbc}_i)^T, the Fisher information becomes:
F = \sum_i \begin{pmatrix} 1 & \texttt{wbc}_i \\ \texttt{wbc}_i & \texttt{wbc}_i^2 \end{pmatrix}.
Using the data, we can find:
F = \begin{pmatrix} 17.0000 & 69.6300 \\ 69.6300 & 291.4571 \end{pmatrix}.
Its inverse is:
F^{-1} = \begin{pmatrix} 2.7384 & -0.6542 \\ -0.6542 & 0.1597 \end{pmatrix}.
- For a new value \texttt{wbc}_0 = 3, the estimated value of the linear predictor is given by \hat\eta_0 = \hat{\boldsymbol\beta}^T \boldsymbol{x}_0 = \hat{\beta}_1 + 3 \hat{\beta}_2 = 5.150. The estimated expected value of \texttt{time} is thus e^{5.15} = 172.4.
From the value of F^{-1}, we find the standard error of this value to be:
SE = \sqrt{ \boldsymbol{x}_0^T F^{-1} \boldsymbol{x}_0 } = \sqrt{ \begin{pmatrix} 1 & 3 \end{pmatrix} \begin{pmatrix} 2.7384 & -0.6542 \\ -0.6542 & 0.1597 \end{pmatrix} \begin{pmatrix} 1 \\ 3 \end{pmatrix} } = \sqrt{0.251} = 0.5.
Thus the normal-approximation confidence interval on the linear predictor scale is: CI = 5.15 \pm 1.96 \times 0.5 = [4.17, 6.13].
and thus on the response scale, the confidence interval is: CI = \exp([4.17, 6.13]) = [64.6, 460].
- The value of \hat\beta_2 implies that \mu = e^{\beta_1 - 1.109 \, {\tt wbc}}. Thus, as the white blood cell count increases, the survival time goes down exponentially. In fact, an increase in 1 in the count reduces survival time by about two-thirds (since e^{-1.109} = 0.3299 = 1-0.67).
As it is not specified here which test to use, it is acceptable to apply either of Wald test, z-test, t-test, or even conduct a test via a confidence interval for {\beta}_2. We can simply conduct the z-test remembering that \hat{\boldsymbol\beta} - \boldsymbol\beta is distributed asymptotically normal with mean 0 and covariance F^{-1}(\hat{\boldsymbol\beta}). From the value for F^{-1}, we have that under \mathcal{H}_0, \text{Var}[\hat\beta_2] = 0.1597. This gives a z value of -1.109/\sqrt{0.1597} = -2.78. The probability of getting the observed value or less (i.e., a situation more severe) is then \Phi(-2.78) = 0.00276. This is significant at the 5\% level (and at the 1\% level).
Question 4
- Let there be n matches and p teams. We can construct a match matrix M \in \mathbb{R}^{n \times p} where each column represents a team and each row a match by assigning:
x_{ij} = \begin{cases} +1 & \mbox{if team } j \mbox{ played at home in match } i \\ -1 & \mbox{if team } j \mbox{ played away in match } i \\ 0 & \mbox{if team } j \mbox{ did not play in match } i \end{cases}
However, notice that:
- x_{i1} = - \sum_{j=2}^p x_{ij}, meaning that M is not a design matrix because it will be singular. In particular, this means that the team strength is not identifiable.
- Further, this matrix does not provide for an intercept.
Hence, we adjust the above matrix into the design matrix X \in \mathbb{R}^{n \times p}, where the first column is (1,\ldots, 1)^T and where the remaining columns are composed of any subset of p-1 columns of M. From the result table, the strength for Arsenal is omitted, which means that the column for Arsenal is not included in X. In this setting, Arsenal is defined to be strength zero and all other strengths are fitted relative to this.
The intercept term represents the home team advantage as it is a constant additive effect to the home team strength.
- The simple model predicts odds on Chelsea winning of:
e^{\hat{\beta}_0 + \hat{\beta}_{\text{Chelsea}} - \hat{\beta}_{\text{Newcastle}}} = e^{-0.284+(-0.498)-(-1.507)} = e^{0.725} = 2.06.
This is quite different from the bookies.
10.4 Chapter 4
Question 2
We will only compare polio.glm
and polio1.glm
here (see the solutions of Exercise 2a in Chapter 3 for the test hypothesis).
- First, we fit
polio3.glm
again but with an appropriate order for the variables and then get the analysis of deviance table:
temp_data <- rep(c(5.195, 5.138, 5.316, 5.242, 5.094, 5.108, 5.260, 5.153,
5.155, 5.231, 5.234, 5.142, 5.173, 5.167), each = 12)
scaled_temp <- 10 * (temp_data - min(temp_data))/(max(temp_data) - min(temp_data))
uspolio$temp <- scaled_temp
polio3.glm <- glm(cases~time + I(cos(2*pi*time/12)) + I(sin(2*pi*time/12)) +
I(cos(2*pi*time/6)) + I(sin(2*pi*time/6)) + temp,
family=poisson(link=log), data=uspolio)
anova(polio3.glm)
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: cases
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 167 343.00
## time 1 9.4538 166 333.55
## I(cos(2 * pi * time/12)) 1 3.4306 165 330.12
## I(sin(2 * pi * time/12)) 1 19.3938 164 310.72
## I(cos(2 * pi * time/6)) 1 21.3637 163 289.36
## I(sin(2 * pi * time/6)) 1 0.5036 162 288.85
## temp 1 12.0192 161 276.84
From the table, the test statistic for \mathcal{H}_0: polio.glm
and \mathcal{H}_1: polio1.glm
is:
\frac{D({\tt polio.glm}, {\tt polio1.glm})}{\hat{\phi}} = \frac{333.55 - 310.72}{1} = 22.83.
Under \mathcal{H}_0, this statistic is approximately \chi^{2}(2) distributed with the p-value:
## [1] 1.102881e-05
Since this p-value is very small, we reject \mathcal{H}_0 at both 5% and 1% significance levels.
- We can check our result above using the following R code, which should give us a very similar p-value.
## Analysis of Deviance Table
##
## Model 1: cases ~ time
## Model 2: cases ~ time + I(cos(2 * pi * time/12)) + I(sin(2 * pi * time/12))
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 166 333.55
## 2 164 310.72 2 22.825 1.106e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- If we only have the summary of
polio3.glm
, we can use the null deviance and residual deviance in the summary to test the null model againstpolio3.glm
.
Question 3
- From the table in Section 4.5.3, the test statistic for \mathcal{H}_0: \mathcal{M}_1 and \mathcal{H}_1: \mathcal{M}_3 is:
\frac{D(\mathcal{M}_1, \mathcal{M}_3)}{\hat{\phi}} = \frac{8.172 - 5.785}{0.266} = 8.97.
Under \mathcal{H}_0, this statistic is approximately \chi^{2}(2) distributed with the p-value:
## [1] 0.01127689
Thus, we reject \mathcal{H}_0 at 5% but not at 1% significance level.
The full model \mathcal{M}_7 adds more parameters that do not help reduce the deviance while at the same time eating up degrees of freedom. This makes us fail to reject \mathcal{H}_0.
This is because R uses different dispersion estimates for the two
anova
calls. The first table uses the estimated dispersion of \mathcal{M}_7, while the second table uses the estimated dispersion of \mathcal{M}_2 (the bigger model).
## [1] 0.02258198
## [1] 0.04149395
Question 4
- A = 1, C = 80, D = 72.807 + 5.2574 = 78.064, B = 88.268 - 78.064 = 10.204.
- We compare the models: \begin{aligned} M_0 &: 1 + \texttt{age} + \texttt{sex} \\ M_1 &: 1 + \texttt{age} + \texttt{sex} + \texttt{ward} \end{aligned}
The test statistic is:
\frac{D(M_0, M_1)}{\hat\phi} = \frac{72.807 - 71.719}{0.9856213} = 1.1035.
This is a likelihood ratio statistic, which follows a \chi^2 distribution. The degrees of freedom corresponds to the difference in the number of parameters between the two models, which is 2. The relevant quantile is:
\chi^2_{2, 0.05} = 5.99
so we do not reject M_0. There is no evidence that the inclusion of \texttt{ward} leads to a better model compared to the model with only \texttt{age} and \texttt{sex}.
- The dispersion is very close to 1, so the shape parameter (1/dispersion) is close to 1 as well. Since an Exponential GLM is a special case of Gamma GLMs with dispersion 1 (or equivalently, shape parameter 1), one may consider modelling an Exponential instead of a Gamma-distribution for the response variable.
10.5 Chapter 5
Question 2
We will only compare polio.glm
and polio1.glm
here (continuing from the solutions of Exercise 2 in Chapter 4).
- The Pearson estimated dispersion of
polio1.glm
is:
## [1] 2.347206
The test statistic for \mathcal{H}_0: polio.glm
and \mathcal{H}_1: polio1.glm
is:
\frac{D({\tt polio.glm}, {\tt polio1.glm})}{\hat{\phi}} = \frac{333.55 - 310.72}{2.347} = 9.727.
Under \mathcal{H}_0, this statistic is approximately \chi^{2}(2) distributed with the p-value:
## [1] 0.007723405
Thus, we reject \mathcal{H}_0 at both 5% and 1% significance levels.
- We fit the quasi-Poisson models and compare them using the
anova
function. The p-value here is similar to the one obtained from part (a).
polio.quasi <- glm(cases ~ time, family=quasipoisson(link=log), data=uspolio)
polio1.quasi <- glm(cases ~ time + I(cos(2*pi*time/12)) + I(sin(2*pi*time/12)),
family=quasipoisson(link=log), data=uspolio)
anova(polio.quasi, polio1.quasi, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: cases ~ time
## Model 2: cases ~ time + I(cos(2 * pi * time/12)) + I(sin(2 * pi * time/12))
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 166 333.55
## 2 164 310.72 2 22.825 0.007737 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Question 3
- Note that if we model Y \sim \text{Bin}(m, p), then E(Y) = mp and \mbox{Var}(Y) = mp(1-p). There will be overdispersion if the actual variance of Y is greater than mp(1-p).
Since Y = \sum_{i=1}^m Y_i, the actual variance of Y is:
\mbox{Var}(Y) = \sum_{i=1}^m \mbox{Var}(Y_i) + 2 \sum_{i<j} \mbox{Cov}(Y_i, Y_j).
We know that \mbox{Var}(Y_i) = p(1-p) and \mbox{Cov}(Y_i, Y_j) = \mbox{Corr}(Y_i, Y_j) \sqrt{\mbox{Var}(Y_i)} \sqrt{\mbox{Var}(Y_j)} = \tau p(1-p). Thus,
\mbox{Var}(Y) = mp(1-p) + (m-1) m \, \tau p(1-p).
If \tau > 0 then \mbox{Var}(Y) > mp(1-p), which implies overdispersion.
If \tau < 0 then \mbox{Var}(Y) < mp(1-p), which implies underdispersion. However, we also need \mbox{Var}(Y) > 0, which means \tau > - \frac{1}{m-1}. This case rarely happens in a real poll, but it may happen when the respondents compete with each other to give an affirmative answer.
In this case, Y = \sum_{i=1}^M Y_i with E(M) = \mu_M and \text{Var}(M) = \sigma_M^2. Using the law of total expectation, we have:
E(Y) = E(E(Y | M)) = E(E(\sum_{i=1}^M Y_i | M)) = E(M p) = p E(M) = p \mu_M.
Using the law of total variance, the previous computation, and the computation from part (a), we have:
\begin{aligned} \text{Var}(Y) &= E(\text{Var}(Y | M)) + \text{Var}(E(Y|M)) \\ &= E(Mp(1-p) + (M-1) M \tau p(1-p)) + \text{Var}(M p) \\ &= E(Mp(1-p)) + E(M^2 \tau p(1-p)) - E(M \tau p(1-p)) + \text{Var}(M p) \\ &= \mu_M p(1-p) + (\sigma_M^2 + \mu_M^2) \tau p(1-p) - \mu_M \tau p(1-p) + p^2 \sigma_M^2. \end{aligned}
There is overdispersion in the binomial model if \text{Var}(Y) > \mu_M p(1-p), which means:
(\sigma_M^2 + \mu_M^2) \tau p(1-p) - \mu_M \tau p(1-p) + p^2 \sigma_M^2 > 0.
This is equivalent to:
\tau > - \frac{p \sigma_M^2}{(1-p)(\sigma_M^2 + \mu_M^2 - \mu_M)}.
Note that part (a) is a special case of this condition with \sigma_M = 0 and \mu_M = m.
Question 4
a) i. The mean of an EDF is given by \mu = b'(\theta). In this case, we have \mu = -1/\theta.
- The log-likelihood is given by:
\ell(\beta) = \sum_{i=1}^n \left\{ \frac{y_i \theta_i - b(\theta_i)}{\phi} + c(y_i, \phi) \right\}.
In this case, b(\theta) = -\ln(-\theta) = \ln \mu and \phi = 1/\nu. The log-likelihood therefore becomes:
\begin{aligned} \ell(\beta) &= \sum_{i=1}^n \left\{ \frac{ - \frac{y_i}{\mu_i} - \ln \mu_i}{\phi} + \text{const} \right\} = -\frac{1}{\phi} \sum_{i=1}^n \left\{ \frac{y_i}{\mu_i} + \ln \mu_i + \text{const} \right\}. \end{aligned}
- In the saturated model, maximization with respect to the separate \beta_i’s, one for each data point, is the same as maximization with respect to the \mu_i, leading to:
\frac{\partial \ell}{\partial \mu_i} = \frac{1}{\phi} \left\{ \frac{y_i}{\mu_i^2} - \frac{1}{\mu_i} \right\} = 0,
so that its MLE becomes \mu_i = y_i. Thus, the log-likelihood of the saturated model is:
\ell_{\text{sat}} = -\frac{1}{\phi} \sum_{i=1}^n \left\{ 1 + \ln y_i + \text{const} \right\}.
- Using the definition of the deviance, we have:
D = 2 \phi (\ell_{\text{sat}} - \ell(\beta)) = 2 \sum_{i=1}^n \left\{ \frac{y_i}{\mu_i} - \ln \frac{y_i}{\mu_i} - 1 \right\}.
b) The rough (quick-and-dirty) estimate of the dispersion is: \hat{\phi} = \frac{D}{n-p} = \frac{5.109}{25-8} = 0.3005.
c) i. C = \begin{pmatrix} 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \end{pmatrix} \in \mathbb{R}^{2 \times 8}.
C = \begin{pmatrix} 0 & 0 & 0 & 0 & 0 & 1 & -1 & -1 \end{pmatrix} \in \mathbb{R}^{1 \times 8}.
C = \begin{pmatrix} 0 & 1 & 0 & \ldots & 0 \\ 0 & 0 & 1 & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \ldots & 1 \end{pmatrix} \in \mathbb{R}^{7 \times 8}.
d) The hypothesis is H_0: \beta_2 = 0 and \beta_3 = 0, or H_0 : C \beta = 0.
- Recall that for the Wald test, under H_0 : C \beta = \gamma:
W = (C \hat\beta - \gamma)^T (C F^{-1}(\hat\beta) C^T)^{-1} (C \hat\beta - \gamma) \stackrel{a}{\sim} \chi^2(s),
where C F^{-1}(\hat\beta) C^T = \text{Var}(C \hat\beta). Then we can reject H_0 at significance level \alpha if W > \chi^2_{s, \alpha}.
For this question, \text{Var}(C \hat\beta) = \text{Var}((\hat\beta_2, \hat\beta_3)^T), which can be obtained from the appropriate submatrix of the full covariance:
\text{Var}(C \hat\beta) = \begin{pmatrix} 0.00004767174 & -0.0001699029 \\ -0.0001699029 & 0.0660872869 \end{pmatrix}.
To calculate W, we need its inverse:
\text{Var}(C \hat\beta)^{-1} = \begin{pmatrix} 21170.77 & 54.43 \\ 54.43 & 15.27 \end{pmatrix}.
Now, the vector C \hat\beta = (0.009675, 0.05091)^T, so that:
W = \begin{pmatrix} 0.009675 & 0.05091 \end{pmatrix} \begin{pmatrix} 21170.77 & 54.43 \\ 54.43 & 15.27 \end{pmatrix} \begin{pmatrix} 0.009675 \\ 0.05091 \end{pmatrix} = 2.0749.
Since \chi^2_{2, 0.05} = 5.99, the conclusion is not to reject H_0.
- Recall that for the likelihood ratio test,
\Lambda = 2 (\ell (\hat\beta) - \ell(\hat\beta_0)) = \frac{D_0 - D}{\hat\phi} \stackrel{a}{\sim} \chi^2(s),
where \hat\beta is the MLE of the full model, \hat\beta_0 is the MLE of the reduced model under H_0, D_0 is the deviance of the reduced model, and D is the deviance of the full model. Then we can reject H_0 at significance level \alpha if \Lambda > \chi^2_{s, \alpha}.
The deviance of the full model is, according to the R code:
D = 2 \phi (\ell_{\text{sat}} - \ell (\hat\beta)) = 5.109.
The deviance of the reduced model with \beta_2 = \beta_3 = 0 is:
D_0 = 2 \phi (\ell_{\text{sat}} - \ell (\hat\beta_0)) = 5.604.
Therefore,
\Lambda = \frac{5.604 - 5.109}{0.3005} = 1.6473.
Since \chi^2_{2, 0.05} = 5.99, the conclusion is again not to reject H_0; i.e., the full model is not sufficiently better than the reduced model.
10.6 Chapter 6
Question 3
- With \boldsymbol{\mu}= \boldsymbol{X}\boldsymbol{\beta}, we have the mean estimate \hat{\boldsymbol{\mu}}= \boldsymbol{X}\hat{\boldsymbol{\beta}}, and so
\begin{aligned} \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}(\boldsymbol{Y}-\hat{\boldsymbol{\mu}}) &= \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}(\boldsymbol{Y}- \boldsymbol{X}\hat{\boldsymbol{\beta}}) \\ &= \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}(\boldsymbol{Y}- \boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{Y}) \\ &= \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{Y}-(\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})(\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{Y} \\ &= \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{Y}- \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{Y} \\ &= 0. \end{aligned}
- Note that for any non-random matrix \boldsymbol{A}, we have \mbox{Var}(\boldsymbol{A}\boldsymbol{Y}) = \boldsymbol{A} \mbox{Var}(\boldsymbol{Y}) \boldsymbol{A}^T. Also note that \mbox{Var}(\boldsymbol{Y})=\boldsymbol{\Sigma}. Thus,
\begin{aligned} \mbox{Var}(\hat{\boldsymbol{\beta}}) &= (\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1} \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\mbox{Var}(\boldsymbol{Y})\boldsymbol{\Sigma}^{-1}\boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1} \\ &= (\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1} \boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1} \\ &= (\boldsymbol{X}^T\boldsymbol{\Sigma}^{-1}\boldsymbol{X})^{-1}. \end{aligned}
- In this case, one can write \boldsymbol{\Sigma}= \phi\tilde{\boldsymbol{\Sigma}}, where the dispersion parameter \phi is unknown and the matrix \tilde{\boldsymbol{\Sigma}} is known. Hence, one has \boldsymbol{\Sigma}^{-1}= \phi^{-1}\tilde{\boldsymbol{\Sigma}}^{-1}, and so
\begin{aligned} \hat{\boldsymbol{\beta}} &= (\boldsymbol{X}^T\phi^{-1}\tilde{\boldsymbol{\Sigma}}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T\phi^{-1}\tilde{\boldsymbol{\Sigma}}^{-1}\boldsymbol{Y} \\ &= \phi \phi^{-1}(\boldsymbol{X}^T\tilde{\boldsymbol{\Sigma}}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T\tilde{\boldsymbol{\Sigma}}^{-1}\boldsymbol{Y} \\ &= (\boldsymbol{X}^T\tilde{\boldsymbol{\Sigma}}^{-1}\boldsymbol{X})^{-1}\boldsymbol{X}^T\tilde{\boldsymbol{\Sigma}}^{-1}\boldsymbol{Y}. \end{aligned}
This does not depend on \phi and hence does not require its estimation.
- Since we do not want to model temporal autocorrelations, we can use the exchangeable correlation structure. Thus, one way of writing \boldsymbol{\Sigma} is:
\begin{pmatrix} \sigma^2 & \alpha \sigma^2 & \alpha \sigma^2 & & & \\ \alpha \sigma^2 & \sigma^2 & \alpha \sigma^2 & & & \\ \alpha \sigma^2 & \alpha \sigma^2 & \sigma^2 & & & \\ &&& \sigma^2 & \alpha \sigma^2 & \alpha \sigma^2\\ &&& \alpha \sigma^2 & \sigma^2 & \alpha \sigma^2\\ &&& \alpha \sigma^2 & \alpha \sigma^2 & \sigma^2 \end{pmatrix}
where \alpha<1 is some constant. (If one wants to make a link to part (c), which was not requested, one could observe that \phi=\sigma^2 is just the error variance and then \alpha plays the role of the intra-class correlation.)
Question 4
- The naive variance estimate is:
\text{Var}(\hat\beta) = \frac{\hat\sigma^2 (1 - \hat\alpha^2)}{\sum_i \left( a_i^2 + b_i^2 - 2 \hat\alpha \, a_i b_i \right)}
where a_i = x_{i1} h'(\hat\beta x_{i1}) and b_i = x_{i2} h'(\hat\beta x_{i2}).
In this case, a_i = b_i, implying \text{Var}(\hat\beta) = \hat\sigma^2 (1 + \hat\alpha)/(\sum_i 2a_i^2). So, \text{Var}(\hat\beta) increases when \hat\alpha increases from 0.
In this case, a_i = 0, implying \text{Var}(\hat\beta) = \hat\sigma^2 (1 - \hat\alpha^2)/(\sum_i b_i^2). So, \text{Var}(\hat\beta) decreases when \hat\alpha increases from 0 to 1.
Question 5
- From the notations for \boldsymbol{X}, \boldsymbol{\Sigma}, and \boldsymbol{Y} in Section 6.3, we can easily verify the expressions for \hat{\boldsymbol\beta} and \text{Var}(\hat{\boldsymbol\beta}). Thus,
E(\hat\beta) = \big( \sum_i \boldsymbol{x}^T_i \boldsymbol{\Sigma}_i^{-1} \boldsymbol{x}_i \big)^{-1} \big( \sum_i \boldsymbol{x}^T_i \boldsymbol{\Sigma}_i^{-1} E ( \boldsymbol{Y}_i ) \big) = \big( \sum_i \boldsymbol{x}^T_i \boldsymbol{\Sigma}_i^{-1} \boldsymbol{x}_i \big)^{-1} \big( \sum_i \boldsymbol{x}^T_i \boldsymbol{\Sigma}_i^{-1} \boldsymbol{x}_i \big) \, \beta = \beta.
- We have \text{Var}(\hat{\beta}) = \big( \sum_i \boldsymbol{x}^T_i \boldsymbol{\Sigma}_i^{-1} \boldsymbol{x}_i \big)^{-1} and
\begin{aligned} \boldsymbol{x}^T_i \boldsymbol{\Sigma}_i^{-1} \boldsymbol{x}_i &= \left(\begin{array}{ccc} x_{i1} & \dots & x_{im} \end{array}\right) \, c \left(\begin{array}{cccc} 1 & \phi & \dots & \phi \\ \phi & 1 & \dots & \phi \\ \vdots & \vdots & \ddots & \vdots \\ \phi & \phi & \dots & 1 \end{array}\right) \left(\begin{array}{c} x_{i1} \\ \vdots \\ x_{im} \end{array}\right) \\ &= c \, \left(\begin{array}{ccc} x_{i1} & \dots & x_{im} \end{array}\right) \left(\begin{array}{c} \phi \sum_j x_{ij} + (1-\phi) x_{i1} \\ \vdots \\ \phi \sum_j x_{ij} + (1-\phi) x_{im} \end{array}\right) \\ &= c \, \sum_k x_{ik} (\phi \sum_j x_{ij} + (1-\phi) x_{ik}) \\ &= c \left( \sum_j x_{ij}^2 + \phi \Big[ \big(\sum_j x_{ij}\big)^2 - \sum_j x_{ij}^2 \Big] \right). \end{aligned}
Thus,
\begin{aligned} \text{Var}(\hat{\beta}) &= \big( \sum_i \boldsymbol{x}^T_i \boldsymbol{\Sigma}_i^{-1} \boldsymbol{x}_i \big)^{-1} = \left( \sum_i c \, \Big\{ \sum_j x_{ij}^2 + \phi \Big[ \big(\sum_j x_{ij}\big)^2 - \sum_j x_{ij}^2 \Big] \Big\} \right)^{-1} \\ &= \frac{c^{-1}}{\sum_i \Big\{ \sum_j x_{ij}^2 + \phi \Big[ \big(\sum_j x_{ij}\big)^2 - \sum_j x_{ij}^2 \Big] \Big\}} \\ &= \frac{\sigma^2(1 + (m-1)\phi\rho)}{\sum_i \Big\{ \sum_j x_{ij}^2 + \phi \Big[ \big(\sum_j x_{ij}\big)^2 - \sum_j x_{ij}^2 \Big] \Big\}}. \end{aligned}
- If the clustering is ignored, then \boldsymbol{Y} \sim \mathcal{N}(\boldsymbol{X} \beta, \sigma^2 \boldsymbol{I}). Thus, b = (\boldsymbol{X}^T \boldsymbol{X})^{-1} \boldsymbol{X}^T \boldsymbol{Y} and \text{Var}(b) = \sigma^2 (\boldsymbol{X}^T \boldsymbol{X})^{-1} = \sigma^2/\sum_i \sum_j x_{ij}^2.