Module 7: Review Solutions

APOLOGIES FOR FAST TYPING AND TYPOS

Module 1: Multiple Linear Regression Review

The linear regression model in matrix form is:

$Y = X β + ϵ, ϵ \sim N (0, σ^{2} I)$

Here:

$Y$ : $n \times 1$ vector of responses.
$X$ : $n \times p$ design matrix of predictors.
$β$ : $p \times 1$ vector of coefficients.
$ϵ$ : $n \times 1$ vector of errors.

The Ordinary Least Squares (OLS) estimate for $β$ is:

$\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} Y$

Simulation in R: Below gives the design matrix and response outcome length 3.

Write code to compute $\hat{β}$ .

solution: see below

What are the properties of the OLS Estimator $\hat{β}$ , hint: also review theorems and when is OLS = MLE?
Interpret the coefficients based on your calculation.

solution:

Intercept ( $β_{0} = - 1.223$ ): When both $X_{1}$ and $X_{2}$ are 0, the expected value of $Y$ is approximately $- 1.223$ .
Slope for $X_{1}$ ( $β_{1} = 2.926$ ): For every one-unit increase in $X_{1}$ , $Y$ is expected to increase by approximately $2.926$ , holding $X_{2}$ constant.
Slope for $X_{2}$ ( $β_{2} = 16.217$ ): For every one-unit increase in $X_{2}$ , $Y$ is expected to increase by approximately $16.217$ , holding $X_{1}$ constant.

Inference for Coefficients: Write out the covariance of $\hat{β}$ and add code to calculate the covariance.

solution: $c o v (\hat{β}) = σ^{2} (X^{'} X)^{- 1}$ where ${\hat{σ}}^{2} = \frac{1}{n - p} \sum_{i} (\hat{Y} - y)^{2}$ . $n$ is number of observations and $p$ is number of predictors including the intercept. See code below for result

Explain what the diagonal and off-diagonal elements of $Var (\hat{β})$ represent and how they relate to confidence intervals for the coefficients.

solution:

Each diagonal element corresponds to the variance of an individual coefficient estimate, e.g., $Var ({\hat{β}}_{j})$ . A smaller value indicates a more precise estimate of the coefficient. The off-diagonal elements capture the covariance between ${\hat{β}}_{j}$ and ${\hat{β}}_{j^{'}}$ . The standard error is used to compute confidence intervals for each coefficient. For a 95% confidence level, the interval is: ${\hat{β}}_{j} \pm t^{*} \cdot \sqrt{Var ({\hat{β}}_{j})}$ where $t^{*}$ is the critical value from the $t$ -distribution with $n - p$ degrees of freedom.

# Example Data
beta_true <- c(-0.5, 2, 1)
N = 100
set.seed(1232)
X <- cbind(1, rnorm(N, 0, 0.1), rnorm(N, 0, 0.1))  # Design matrix (with intercept)
Y <- X %*% beta_true + rnorm(N, 0, sd = 10)  # Response vector

# Compute beta-hat
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% Y


# Variance of beta-hat?
# Calculate predicted values
Y_hat <- X %*% beta_hat

# Residuals
residuals <- Y - Y_hat

# Number of observations and predictors
n <- nrow(X)
p <- ncol(X)

# Estimate of sigma^2
sigma2 <- sum(residuals^2) / (n - p)

# Variance of beta hat
var_beta <- sigma2 * solve(t(X) %*% X)
var_beta

          [,1]        [,2]        [,3]
[1,] 1.0628145   0.8354875   0.3820135
[2,] 0.8354875 114.3273692   7.1107418
[3,] 0.3820135   7.1107418 124.2301963

What is the difference between ${\hat{σ}}^{2}$ and $s^{2}$ ?

solution: ${\hat{σ}}^{2}$ is an estimate of the true variance of the error term ( $σ^{2}$ ) in the underlying population, which captures the overall variability of the observed data $Y$ . It is calculated as: ${\hat{σ}}^{2} = \frac{\sum (Y_{i} - {\hat{Y}}_{i})^{2}}{n}$ But this estimated variance is biased because it does not account for the estimated coefficients in the calculation of $\hat{Y}$ . $s^{2}$ is an unbiased estimate of the population variance $σ^{2}$ , adjusted for the loss of degrees of freedom due to estimating parameters. It is calculated as: $s^{2} = \frac{\sum (Y_{i} - {\hat{Y}}_{i})^{2}}{n - p}$ where $p$ is the number of parameters (including the intercept) in the model. The adjustment by $n - p$ accounts for the degrees of freedom used to estimate the model parameters, making $s^{2}$ an unbiased estimator.

Interactions and Effect Modification

Interactions allow the effect of one predictor to depend on the value of another. The model:

$Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{3} (X_{1} \cdot X_{2}) + ϵ$

Example:

set.seed(123)
# Simulated data
beta <- c(1,2,4,3)

X1 = rnorm(N, 0, 10)
X2 = rnorm(N, 0, 10)
Y = beta[1] + beta[2] * X1 + beta[3] * X2 + beta[4] * X1* X2 + rnorm(N, 0, 1)


# Fit a model with interaction
model <- lm(Y ~ X1 * X2)
summary(model)


Call:
lm(formula = Y ~ X1 * X2)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8719 -0.6777 -0.1086  0.5897  2.3166 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.140983   0.095777   11.91   <2e-16 ***
X1          1.990719   0.010834  183.75   <2e-16 ***
X2          4.003434   0.009881  405.15   <2e-16 ***
X1:X2       3.001591   0.001145 2621.64   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9468 on 96 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 2.406e+06 on 3 and 96 DF,  p-value: < 2.2e-16

Interpretation: The coefficient for $X 1 : X 2$ quantifies the interaction effect. If significant, it means the effect of $X 1$ on $Y$ depends on $X 2$ .

Write the derivation for the prediction interval for a new observation.

solution Let ${\tilde{y}}_{i}$ be a new observation. The variance of ${\tilde{y}}_{i}$ is given by ${\tilde{σ}}^{2} = {\hat{σ}}^{2} x_{i}^{'} (X^{'} X)^{- 1} x_{i} + {\hat{σ}}^{2}$ which is the sum of the $v a r (X \hat{β})$ plus the uncertainty due to measurement error $ϵ_{i}$ . The 95% prediction interval is given by:

${\tilde{y}}_{i} \pm 1.96 \tilde{σ}$

Write the code to calculate the 95% prediction interval using the results above.

X <- cbind(1, X1, X2, X1 * X2)
# Fit the linear model
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% Y  # Coefficients SHOULD BE THE SAME IN MODEL FIT.
Y_hat <- X %*% beta_hat  # Predicted values
invXtX = solve(t(X) %*% X)

# Residual variance (s^2)
n <- nrow(X)
p <- ncol(X)
residuals <- Y - Y_hat
sigmahat <- sum(residuals^2) / (n - p) #use the unbiased estimate of sigma^2
V= sigmahat * invXtX 

# Create New observation
X_new <- c(1, 0.5, -0.2, 0.5 * -0.2)  # Example new observation

#variance of the predicted value for new obs
var_pred = diag(X%*% V %*% t(X)) + sigmahat
se_pred <- sqrt(var_pred)

# Predicted value for the new observation
Y_tilde <- sum(X_new * beta_hat)

# Prediction interval
t_value <- 1.96  # Approximate critical value for 95% confidence level
lower_bound <- Y_tilde - t_value * se_pred
upper_bound <- Y_tilde + t_value * se_pred

# Output results
list(
  Predicted_Value = Y_tilde,
  Lower_Bound = lower_bound,
  Upper_Bound = upper_bound
)

$Predicted_Value
[1] 1.035496

$Lower_Bound
  [1] -0.8390086 -0.8318278 -0.8534432 -0.8300029 -0.8366900 -0.8609849
  [7] -0.8361786 -0.9144077 -0.8372763 -0.8449684 -0.8474830 -0.8375611
 [13] -0.8577305 -0.8294912 -0.8393128 -0.8783098 -0.8324946 -0.8876870
 [19] -0.8411778 -0.8433674 -0.8462933 -0.8380024 -0.8460702 -0.8372862
 [25] -0.8887344 -0.8742723 -0.8399868 -0.8299350 -0.8626564 -0.8455114
 [31] -0.8648955 -0.8342448 -0.8382554 -0.8373542 -0.9038708 -0.8645619
 [37] -0.8564345 -0.8367165 -0.8738222 -0.8523781 -0.8466786 -0.8307088
 [43] -0.9066002 -0.9931355 -0.9014803 -0.8494879 -0.8534904 -0.8404448
 [49] -0.9366524 -0.8438432 -0.8395376 -0.8371793 -0.8315026 -0.8694619
 [55] -0.8305436 -0.8521416 -0.8813132 -0.8327621 -0.8420568 -0.8302738
 [61] -0.8489705 -0.8445691 -0.8462770 -1.0820479 -0.8462880 -0.8323413
 [67] -0.8397293 -0.8308627 -0.8497703 -0.8986433 -0.8333879 -0.8991027
 [73] -0.8397962 -0.9136369 -0.8421195 -0.8587137 -0.8311747 -0.8550741
 [79] -0.8330112 -0.8313375 -0.8387114 -0.8561249 -0.8325357 -0.8404486
 [85] -0.8307128 -0.8301529 -0.8907155 -0.8316762 -0.8391681 -0.8440704
 [91] -0.8434577 -0.8321426 -0.8303199 -0.8439066 -0.8893207 -0.8952267
 [97] -0.9335357 -0.8963709 -0.8334024 -0.8672146

$Upper_Bound
  [1] 2.910001 2.902821 2.924436 2.900996 2.907683 2.931978 2.907171 2.985400
  [9] 2.908269 2.915961 2.918476 2.908554 2.928723 2.900484 2.910306 2.949303
 [17] 2.903487 2.958680 2.912171 2.914360 2.917286 2.908995 2.917063 2.908279
 [25] 2.959727 2.945265 2.910980 2.900928 2.933649 2.916504 2.935888 2.905238
 [33] 2.909248 2.908347 2.974864 2.935555 2.927427 2.907709 2.944815 2.923371
 [41] 2.917671 2.901702 2.977593 3.064128 2.972473 2.920481 2.924483 2.911438
 [49] 3.007645 2.914836 2.910530 2.908172 2.902495 2.940455 2.901536 2.923134
 [57] 2.952306 2.903755 2.913050 2.901267 2.919963 2.915562 2.917270 3.153041
 [65] 2.917281 2.903334 2.910722 2.901855 2.920763 2.969636 2.904381 2.970095
 [73] 2.910789 2.984630 2.913112 2.929706 2.902167 2.926067 2.904004 2.902330
 [81] 2.909704 2.927118 2.903529 2.911441 2.901706 2.901146 2.961708 2.902669
 [89] 2.910161 2.915063 2.914451 2.903135 2.901313 2.914899 2.960313 2.966220
 [97] 3.004528 2.967364 2.904395 2.938207

Confounding

A variable is a confounder if it influences both the predictor and the outcome, potentially biasing the estimated relationship.

Example:

# Example dataset
data <- data.frame(
  smoking = c(0, 1, 0, 1),
  age = c(20, 30, 40, 50),
  disease = c(0, 1, 1, 1)
)

# Fit a model without adjusting for age
model1 <- glm(disease ~ smoking, family = binomial, data = data)

# Fit a model adjusting for age
model2 <- glm(disease ~ smoking + age, family = binomial, data = data)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model1)


Call:
glm(formula = disease ~ smoking, family = binomial, data = data)

Deviance Residuals: 
       1         2         3         4  
-1.17741   0.00008   1.17741   0.00008  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.238e-16  1.414e+00   0.000    1.000
smoking      1.957e+01  7.604e+03   0.003    0.998

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4.4987  on 3  degrees of freedom
Residual deviance: 2.7726  on 2  degrees of freedom
AIC: 6.7726

Number of Fisher Scoring iterations: 18

summary(model2)


Call:
glm(formula = disease ~ smoking + age, family = binomial, data = data)

Deviance Residuals: 
         1           2           3           4  
-1.197e-05   1.197e-05   1.197e-05   2.110e-08  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)    -70.078 160246.558       0        1
smoking         23.359  87770.712       0        1
age              2.336   5067.441       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4.4987e+00  on 3  degrees of freedom
Residual deviance: 4.2978e-10  on 1  degrees of freedom
AIC: 6

Number of Fisher Scoring iterations: 23

Compare the coefficient for smoking in both models to see how controlling for age effects the relationship. Interpret the findings for this coefficient in both settings.

solution: In the first model, the coefficient $β_{1}$ represents the effect of smoking on the odds of disease, without considering age. The confounding effect of age can potentially bias the results in model 1. The results show there is a 19.57 unit increase in the log-odds of disease for smokers compared to non-smokers. In the second model, the coefficient for smoking reflects the relationship between smoking and disease, accounting for age as a confounder. The results show there again is a 23.36 unit increase in log-odds of disease comparing smokers to non-smokers after accounting for age. The results between models 1 and 2 are show an increase in the effect of smoking after accounting for age. If results are similar with respect to the smoking effect this indicates age does not confound the relationship between smoking and disease. But here we see a change.

Compare the models to find the optimal model using a F-test statistic.

solution: The Analysis of Deviance Table compares two models: the unadjusted model (Model 1), which includes only smoking as a predictor, and the adjusted model (Model 2), which includes both smoking and age. The residual deviance for Model 1 is 2.7726 with 2 degrees of freedom, while the residual deviance for Model 2 is reduced to 0.0000 with 1 degree of freedom. The difference in deviance is 2.7726, with a corresponding p-value of 0.09589. This result indicates that adding age to the model explains additional variability in the outcome (disease), reducing the deviance. However, the p-value of 0.09589 is not significant, suggesting trivial improvement in model fit.

In conclusion, while age appears to contribute to the model, its effect is not strong enough to confidently reject the null hypothesis that it has no additional impact on the outcome beyond smoking. Depending on the context, further investigation or a larger sample size may be warranted to clarify its role.

# Perform F-test to compare models
anova_result <- anova(model1, model2, test = "Chisq")

# Display the result
anova_result

Analysis of Deviance Table

Model 1: disease ~ smoking
Model 2: disease ~ smoking + age
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1         2     2.7726                       
2         1     0.0000  1   2.7726  0.09589 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Module 2 Linear Mixed Models

Linear mixed models (LMMs) are designed to handle correlated data. This arises when: -

Multiple measurements are taken on the same individuals (e.g., repeated measures).
Data are grouped into clusters (e.g., students nested within schools).

The general structure of a mixed model is: $Y_{i j} = β_{0} + β_{1} X_{i j} + u_{j} + ϵ_{i j}$ Where:

$Y_{i j}$ : Response for individual $i$ in group $j$ .
$β_{0} + β_{1} X_{i j}$ : Fixed effects (global trends or population-level effects).
$u_{j}$ : Random effect for group $j$ , $u_{j} \sim N (0, τ^{2})$ .
$ϵ_{i j}$ : Residual error, $ϵ_{i j} \sim N (0, σ^{2})$ .

Fixed vs. Random Effects

Fixed Effects:
- Parameters of interest (e.g., $β_{0}$ , $β_{1}$ ).
- Represent the population-level effects.
Random Effects:
- Account for variability between clusters or groups.
- Allow for partial pooling of information across groups.

Example: In a study of students’ test scores:

Fixed effect: Average effect of study hours on scores.
Random effect: Variation in scores across schools.

Variance Components

Mixed models decompose variability into:

1. Between-cluster variability ( $τ^{2}$ ): How much clusters differ.

2. Within-cluster variability ( $σ^{2}$ ): How much individuals within a cluster differ.

Example Dataset: Sleep study with repeated measures of reaction times for individuals under sleep deprivation. 1. Fit a linear mixed model regressing $R e a c t i o n$ on $D a y s$ and a random intercept for $S u b j e c t$

# Load the lme4 package
library(lme4)

Loading required package: Matrix

# Example: Sleep study dataset
data("sleepstudy", package = "lme4")

# Fit a mixed-effects model
# Reaction time ~ Days of sleep deprivation + (Random intercept for Subject)
model <- lmer(Reaction ~ Days + (1 | Subject), data = sleepstudy)
summary(model)

Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (1 | Subject)
   Data: sleepstudy

REML criterion at convergence: 1786.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2257 -0.5529  0.0109  0.5188  4.2506 

Random effects:
 Groups   Name        Variance Std.Dev.
 Subject  (Intercept) 1378.2   37.12   
 Residual              960.5   30.99   
Number of obs: 180, groups:  Subject, 18

Fixed effects:
            Estimate Std. Error t value
(Intercept) 251.4051     9.7467   25.79
Days         10.4673     0.8042   13.02

Correlation of Fixed Effects:
     (Intr)
Days -0.371

Write out the interpretation of the fixed effects $\hat{β}$ in this model.

solution: $β_{0}$ represents the predicted reaction time when Days = 0 (baseline reaction time without sleep deprivation). $β_{1}$ represents the average change in reaction time per additional day of sleep deprivation. A positive $β_{1} = 10.47$ indicates that reaction time increases (worsens) with more days of sleep deprivation.

What is the estimates for the between-subject variability and the within-subject variability?

solution: The between subject variance is given by:

$Var (Intercept) = 1378.2 (Std. Dev. = \sqrt{1378.2} = 37.12)$ The within subject variance is given by:

$Residual Variance = 960.5 (Std. Dev. = \sqrt{960.5} = 30.99)$

Calculate and interpret the Intraclass Correlation.

solution: : The intraclass correlation coefficient (ICC) measures the proportion of total variability in the outcome (reaction time) that is attributable to between-subject variability. It is calculated as:

$ICC = \frac{σ_{between}^{2}}{σ_{between}^{2} + σ_{within}^{2}}$

Substituting the values:

$ICC = \frac{1378.2}{1378.2 + 960.5} = \frac{1378.2}{2338.7} \approx 0.5895$

The ICC is approximately 0.59, meaning that about 59% of the total variability in reaction time is attributable to differences between subjects, while the remaining 41% is due to variability within subjects.

Extend this model to include both random intercepts AND slopes. Write out the model, fit the model with the code below, and interpret.

solution:

The extended model includes both random intercepts and random slopes. This allows each subject to have: 1. A unique baseline reaction time (random intercept). 2. A unique effect of Days on reaction time (random slope).

The model can be written as:

${Reaction}_{i j} = β_{0} + β_{1} \cdot {Days}_{i j} + u_{0 j} + u_{1 j} \cdot {Days}_{i j} + ϵ_{i j}$

Where: - $β_{0}$ : Fixed intercept (average reaction time when Days = 0 across all subjects). - $β_{1}$ : Fixed slope (average effect of Days across all subjects). - $u_{0 j}$ : Random intercept for subject $j$ , accounting for individual baseline differences. - $u_{1 j}$ : Random slope for subject $j$ , accounting for individual differences in the effect of Days. - $ϵ_{i j}$ : Residual error for observation $i$ in subject $j$ .

# Random slope and intercept model
model2 <- lmer(Reaction ~ Days + (Days | Subject), data = sleepstudy)
summary(model2)

Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (Days | Subject)
   Data: sleepstudy

REML criterion at convergence: 1743.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.9536 -0.4634  0.0231  0.4634  5.1793 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 Subject  (Intercept) 612.10   24.741       
          Days         35.07    5.922   0.07
 Residual             654.94   25.592       
Number of obs: 180, groups:  Subject, 18

Fixed effects:
            Estimate Std. Error t value
(Intercept)  251.405      6.825  36.838
Days          10.467      1.546   6.771

Correlation of Fixed Effects:
     (Intr)
Days -0.138

Comparing both models, the estimated fixed effects do not change. However, there is a large decrease in the between subject variability from 1378 to 612.10 due to the inclusion of random slopes. The between subject variability in slopes is 35.07. This shows the effect of sleep depriovation on reaction time varies between subjects, but the variability in slopes is smaller than the variability in random intercepts. The correlation between random intercepts and slopes is 0.07, indicating a weak positive relationship between baseline reaction times and the effect of sleep deprivation.

Is the random intercept or random intercept and slope model preferred? To compare models, we use the $A I C$ . Calculate the AICS and determine which model is better.

solution: Comparing AIC values between models, there is very slight change, however, there is a smaller AIC associated with model 2, indicating that model 2 results in a slightly better fit compared to model 1.

AIC(model, model2)

       df      AIC
model   4 1794.465
model2  6 1755.628

Module 3 Generalized Linear Models

GLMs extend linear regression to handle response variables that are not normally distributed.
A GLM consists of three components:
1. A probability distribution in the Exponential Family (e.g., Binomial, Poisson).
2. Systematic Component: A linear model $η = X β$ , where $X$ is the design matrix and $β$ are the coefficients.
3. Link Function: Transforms the expected value of the response variable to the scale of the linear predictor, $g (μ) = η$ .

The probability distribution must be in the Exponential Family.

Write out the general form of the Exp Family and show that the Poisson distribution fits this form.

solution: The general form of a distribution in the exponential family is written as:

$f (y; θ, ϕ) = \exp (\frac{y \cdot θ - b (θ)}{a (ϕ)} + c (y, ϕ))$

Where: - $y$ : Observed data (response variable). - $θ$ : Canonical (natural) parameter. - $ϕ$ : Dispersion parameter (often fixed to 1 for distributions like Poisson). - $a (ϕ)$ : A function of the dispersion parameter. - $b (θ)$ : A function that normalizes the distribution (ensures a proper probability distribution). - $c (y, ϕ)$ : A function involving the observed data and dispersion parameter.

The probability mass function (PMF) of the Poisson distribution is:

$f (y; λ) = \frac{λ^{y} e^{- λ}}{y!}, y = 0, 1, 2, \dots$

Start with the PMF:

$f (y; λ) = \frac{λ^{y} e^{- λ}}{y!}$

Rewrite $λ$ in terms of the natural parameter $θ$ . For the Poisson distribution:
- $λ = e^{θ}$ , so $θ = \log (λ)$ .
Substitute $λ = e^{θ}$ into the PMF:

$f (y; λ) = \frac{e^{θ y} e^{- e^{θ}}}{y!}$

Rearrange terms to match the exponential family form:

$f (y; λ) = \exp (y θ - e^{θ} - \log (y!))$

Identify the components of the exponential family:
- $θ$ : Canonical (natural) parameter, $\log (λ)$ .
- $b (θ)$ : Normalizing function, $b (θ) = e^{θ}$ .
- $a (ϕ)$ : Dispersion function, $a (ϕ) = 1$ (fixed for Poisson).
- $c (y, ϕ)$ : Function of $y$ , $c (y, ϕ) = - \log (y!)$ .

Thus, the Poisson distribution can be written in the general form of the exponential family as:

$f (y; θ) = \exp (y θ - b (θ) + c (y))$

Where: $b (θ) = e^{θ}$ and $c (y) = - \log (y!)$

Common GLMs:

Logistic Regression:
- Used for binary data.
- Link function: $g (μ) = \log (\frac{μ}{1 - μ})$ (logit link).
- Response variable $y_{i} \sim Binomial (n_{i}, π_{i})$ .
Probit Regression:
- Also for binary data.
- Link function: $g (μ) = Φ^{- 1} (μ)$ , where $Φ$ is the standard normal cumulative distribution function.
Poisson Regression:
- Used for count data.
- Link function: $g (μ) = \log (μ)$ .
- Response variable $y_{i} \sim Poisson (λ_{i})$ .

Consider a study on whether students pass an exam ( $y_{i} = 1$ if pass, $y_{i} = 0$ otherwise). The model is:

$logit (π_{i}) = β_{0} + β_{1} {Hours}_{i}$

Write the logistic regression model where outcomes $y_{i}$ regress on the number of hours spent studying $h o u r s_{i}$ .

solution:

$logit (π_{i}) = \log (\frac{π_{i}}{1 - π_{i}}) = β_{0} + β_{1} \cdot {Hours}_{i}$ where $π_{i} = P (y_{i} = 1)$ .

Write out the expectation.

solution: The expectation is written as:

$E [y_{i}] = π_{i} = P (y_{i} = 1 ∣ {Hours}_{i}) = \frac{1}{1 + \exp (- (β_{0} + β_{1} \cdot {Hours}_{i}))}$ 4. Interpret the coefficient $β_{1}$ in the context of the study.

solution: $β_{1}$ : The slope, representing the change in the log-odds of passing for each additional hour of studying.

If $β_{1} = 0.5$ , compute the odds of passing for a student who studies 4 hours compared to one who studies 2 hours.

solution:

The odds ratio (OR) quantifies the change in odds for a one-unit increase in the predictor. For two students with ${Hours}_{1} = 4$ and ${Hours}_{2} = 2$ , the odds ratio is:

$OR = \frac{{Odds for Hours}_{1}}{{Odds for Hours}_{2}}$

The odds of passing are calculated as:

$Odds = e^{β_{0} + β_{1} \cdot Hours}$

However, the intercept ( $β_{0}$ ) cancels when comparing odds for the same model. Thus, the odds ratio simplifies to:

$OR = e^{β_{1} \cdot ({Hours}_{1} - {Hours}_{2})}$

Substituting Values $β_{1} = 0.5$ , ${Hours}_{1} = 4$ , ${Hours}_{2} = 2$

$OR = e^{0.5 \cdot (4 - 2)} = e^{0.5 \cdot 2} = e^{1} \approx 2.718$

The odds of passing for a student who studies 4 hours are approximately 2.718 times the odds of passing for a student who studies 2 hours.

Calculate the 95% CIs for the OR. (Assume $SE (β_{1}) = 0.1$ )

solution:

The odds ratio (OR) is calculated from the coefficient $β_{1}$ of the logistic regression model. The 95% confidence interval (CI) for the OR is derived as:

$CI = (e^{β_{1} - 1.96 \cdot SE (β_{1})}, e^{β_{1} + 1.96 \cdot SE (β_{1})})$ Where: $β_{1} = 0.5$ (log-odds per unit increase in hours), $SE (β_{1})$ : The standard error of $β_{1}$ , $1.96$ : The critical value for a 95% confidence level (from the standard normal distribution).

Lower Bound: $e^{β_{1} - 1.96 \cdot SE (β_{1})} = e^{0.5 - 1.96 \cdot 0.1} = e^{0.5 - 0.196} = e^{0.304} \approx 1.355$
Upper Bound: $e^{β_{1} + 1.96 \cdot SE (β_{1})} = e^{0.5 + 1.96 \cdot 0.1} = e^{0.5 + 0.196} = e^{0.696} \approx 2.006$

The 95% confidence interval for the odds ratio is approximately:

$(1.355, 2.006)$

Load the mtcars dataset in R and fit a logistic regression model to predict whether a car has an automatic transmission (am) using mpg and hp as predictors.

Write the R code to fit the model using the glm() function.

solution

data(mtcars)

# Fit the logistic regression model
model <- glm(am ~ mpg + hp, data = mtcars, family = binomial)

# Display the summary of the model
summary(model)


Call:
glm(formula = am ~ mpg + hp, family = binomial, data = mtcars)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.41460  -0.42809  -0.07021   0.16041   1.66500  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept) -33.60517   15.07672  -2.229   0.0258 *
mpg           1.25961    0.56747   2.220   0.0264 *
hp            0.05504    0.02692   2.045   0.0409 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 19.233  on 29  degrees of freedom
AIC: 25.233

Number of Fisher Scoring iterations: 7

Interpret the coefficients of mpg and hp in the output.

solution: 1. mpg (1.25961): - For each one-unit increase in mpg, the log-odds of having an automatic transmission increase by 1.25961, holding hp constant. - In terms of odds, the odds of having an automatic transmission multiply by $\exp (1.25961) \approx 3.52$ for every additional mile per gallon. 2. hp (0.05504): - For each one-unit increase in hp, the log-odds of having an automatic transmission increase by 0.05504, holding mpg constant. - In terms of odds, the odds of having an automatic transmission multiply by $\exp (0.05504) \approx 1.057$ for every additional unit of horsepower.

Use the model to predict the probability of an automatic transmission for a car with mpg = 25 and hp = 100.

solution: Predicting log-odds for a car with mpg = 25 and hp = 100 are calculated as:

$logit (π) = - 33.60517 + 1.25961 \cdot 25 + 0.05504 \cdot 100$ Compute the log-odds: $logit (π) = - 33.60517 + (1.25961 \cdot 25) + (0.05504 \cdot 100)$

$logit (π) = - 33.60517 + 31.49025 + 5.504 = 3.38908$

Convert log-odds to probability: $π = \frac{1}{1 + \exp (- logit (π))} = \frac{1}{1 + \exp (- 3.38908)}$

$π \approx \frac{1}{1 + 0.0337} \approx 0.967$

The predicted probability of the car having an automatic transmission is approximately 96.7% for a car with mpg = 25 and hp = 100.

Poisson Regression with Random Intercept

A hospital administrator is analyzing the number of patients admitted to different wards each day. The number of daily admissions is modeled as a Poisson-distributed outcome. The wards are suspected to have differing baseline admission rates, so a random intercept is included in the model.

The model for daily admissions $y_{i j}$ in ward $i$ on day $j$ is given by:

$\begin{aligned} y_{i j} & \sim P o i s (μ_{i j}) \\ \log (μ_{i j}) & = β_{0} + β_{1} {DayType}_{i j} + u_{i} \end{aligned}$

$u_{i} \sim N (0, σ_{u}^{2})$ is the random intercept for ward $i$ .
${DayType}_{i j}$ is a binary indicator (0 = weekday, 1 = weekend).
$β_{0}$ is the overall fixed intercept.
$β_{1}$ is the fixed effect of the day type.

Identify the fixed and random components of this model.

solution:

Fixed effects:

$β_{0}$ : The fixed intercept represents the log of the baseline admission rate on weekdays across all wards.
$β_{1}$ : The fixed effect of ${DayType}_{i j}$ represents the change in the log of the admission rate when moving from a weekday ( $DayType = 0$ ) to a weekend ( $DayType = 1$ ). This captures the overall effect of day type across all wards.

Random Effects:

$u_{i} \sim N (0, σ_{u}^{2})$ : The random intercept for ward $i$ accounts for differences in baseline admission rates between wards. Each ward has its own unique offset from the overall baseline admission rate ( $β_{0}$ ), modeled as a random draw from a normal distribution with mean 0 and variance $σ_{u}^{2}$ .

Write the expected value of $y_{i j}$ in terms of $μ_{i j}$ .

solution:

For the Poisson distribution, the expected value of the outcome $y_{i j}$ is equal to its mean, $μ_{i j}$ . Therefore:

$E [y_{i j}] = μ_{i j}$

The relationship between $μ_{i j}$ and the predictors is defined by the model:

$\log (μ_{i j}) = β_{0} + β_{1} \cdot {DayType}_{i j} + u_{i}$

Exponentiating both sides to express $μ_{i j}$ :

$μ_{i j} = \exp (β_{0} + β_{1} \cdot {DayType}_{i j} + u_{i})$

Substituting this into the expectation:

$E [y_{i j}] = \exp (β_{0} + β_{1} \cdot {DayType}_{i j} + u_{i})$

Interpret the meaning of $β_{1}$ in this model.

solution: $β_{1}$ represents the fixed effect of day type on the log of the expected number of daily admissions.

What does $σ_{u}^{2}$ tell you about the variability across wards?

solution: $σ_{u}^{2}$ quantifies how much the baseline log admission rates vary between wards.

Write R code to simulate data from this model. Assume the estimated coefficients are $β_{0} = 2.5$ , $β_{1} = - 0.3$ , and $σ_{u}^{2} = 0.2$

solution:

set.seed(123)


# Parameters
beta_0 <- 2.5      # Fixed intercept
beta_1 <- -0.3     # Fixed effect of DayType
sigma_u <- sqrt(0.2)  # Standard deviation of random intercepts
n_wards <- 15     # Number of wards
n_days <- 30       # Number of days per ward

# Simulate random intercepts for wards
random_intercepts <- rnorm(n_wards, mean = 0, sd = sigma_u)

# Create data frame
data <- expand.grid(Ward = 1:n_wards, Day = 1:n_days)
data$DayType <- sample(0:1, size = nrow(data), replace = TRUE)  # Randomly assign weekdays/weekends
data$u_i <- random_intercepts[data$Ward]  # Assign random intercepts to wards

# Calculate the log mean (linear predictor)
data$log_mu <- beta_0 + beta_1 * data$DayType + data$u_i

# Calculate the mean (expected number of admissions)
data$mu <- exp(data$log_mu)

# Simulate Poisson outcomes
data$y <- rpois(nrow(data), lambda = data$mu)

# View a sample of the simulated data
head(data)

  Ward Day DayType         u_i   log_mu        mu  y
1    1   1       0 -0.25065233 2.249348  9.481549  7
2    2   1       1 -0.10293850 2.097061  8.142209 11
3    3   1       0  0.69707555 3.197076 24.460891 21
4    4   1       1  0.03153231 2.231532  9.314127  6
5    5   1       1  0.05781923 2.257819  9.562213  6
6    6   1       0  0.76700038 3.267000 26.232534 28

library(ggplot2)

# Plot: Admissions over Days for Each Ward
ggplot(data, aes(x = Day, y = y, color = factor(Ward))) +
  geom_line() +
  geom_point() +
  labs(
    title = "Simulated Admissions Over Time by Ward",
    x = "Day",
    y = "Number of Admissions",
    color = "Ward"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

Predict the expected number of admissions for: A weekday ( $DayType = 0$ ) in Ward A, where $u_{A} = 0.1$ .

solution:

The expected number of admissions is:

$μ_{i j} = \exp (β_{0} + β_{1} \cdot {DayType}_{i j} + u_{i})$

Substituting the values: $\log (μ_{i j}) = 2.5 + (- 0.3 \cdot 0) + 0.1 = 2.6$

Compute $μ_{i j}$ by exponentiating: $μ_{i j} = \exp (2.6) \approx 13.46$

The expected number of admissions for a weekday ( $DayType = 0$ ) in Ward A, where $u_{A} = 0.1$ , is approximately 13.46 admissions.

Module 4: Generalized Estimating Equations (GEEs)

GEEs are an extension of GLMs used for correlated or clustered data, such as repeated measures.
Unlike random-effects models, GEEs estimate population-averaged (marginal) effects rather than subject-specific effects.
Key features include:
- Estimating Equation: An iterative method for parameter estimation.
- Correlation Structures: Specify the within-cluster correlation (e.g., exchangeable, autoregressive, unstructured).
- Robust Standard Errors: Account for misspecification of the correlation structure.

Correlation Structures

Exchangeable: Equal correlation among all observations within a cluster.
Autoregressive (AR1): Correlation decreases with increasing time separation.
Unstructured: No specific structure; each correlation is estimated.

The respiratory dataset includes longitudinal data from a respiratory study. The main variables are: - outcome: Binary outcome indicating the presence or absence of infection. - age: Age of the subject. - visit: Time of the visit. - treatment: Treatment group (placebo or drug).

Load the dataset and fit a GEE with an exchangeable correlation structure:

library(geepack)
library(gee)
set.seed(1232)
# Load the respiratory dataset
data("respiratory", package = "geepack")

# Fit the GEE
fit_gee <-  gee(outcome ~ age +  visit + treat, data = respiratory, id = id, corstr = "exchangeable", family = binomial)

Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27

running glm to get initial regression estimate

(Intercept)         age       visit      treatP 
 1.31881870 -0.01217949 -0.06244781 -0.98377461

# Summary of the model
summary(fit_gee)


 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logit 
 Variance to Mean Relation: Binomial 
 Correlation Structure:     Exchangeable 

Call:
gee(formula = outcome ~ age + visit + treat, id = id, data = respiratory, 
    family = binomial, corstr = "exchangeable")

Summary of Residuals:
       Min         1Q     Median         3Q        Max 
-0.7445718 -0.4523645  0.2814629  0.3964276  0.6710428 


Coefficients:
               Estimate Naive S.E.    Naive z Robust S.E.   Robust z
(Intercept)  1.31321054 0.47893788  2.7419225  0.47503342  2.7644593
age         -0.01205922 0.01143541 -1.0545504  0.01160090 -1.0395071
visit       -0.06245418 0.06350982 -0.9833784  0.06561084 -0.9518882
treatP      -0.98039084 0.31316711 -3.1305677  0.31275820 -3.1346607

Estimated Scale Parameter:  1.010102
Number of Iterations:  2

Working Correlation
          [,1]      [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1.0000000 0.4890786    0    0    0    0    0    0
[2,] 0.4890786 0.4890786    0    0    0    0    0    0
[3,] 0.4890786 1.0000000    0    0    0    0    0    0
[4,] 0.4890786 0.4890786    0    0    0    0    0    0
[5,] 0.4890786 0.4890786    0    0    0    0    0    0
[6,] 1.0000000 0.4890786    0    0    0    0    0    0
[7,] 0.4890786 0.4890786    0    0    0    0    0    0
[8,] 0.4890786 1.0000000    0    0    0    0    0    0

Interpret the coefficients for age and smoking.

solution: Coefficient for age ( $β_{1} = - 0.012$ ) which represents the change in the log-odds of the outcome for a one-unit increase in age.The odds ratio is given by $\exp (β_{1}) = e x p (- 0.012) = 0.99$ , which quantifies how the odds of the outcome change with age. $β_{1} < 0$ , older individuals have lower odds of the outcome.

The coefficient for treat ( $β_{3} = - 0.9804$ ) represents the change in the log-odds of the outcome for individuals receiving treatment compared to those not receiving treatment. The odds ratio is $\exp (β_{3}) = 0.375$ , which quantifies the effect of treatment on the odds of the outcome.

Compare the robust and model-based standard errors in the output. What do they indicate about model assumptions?

solution:

Since the naive and robust standard errors are very similar, it suggests that the assumed correlation structure (exchangeable) aligns well with the actual data structure. There is little evidence of misspecification in the correlation structure.
Choice of S.E.: The robust standard errors are generally preferred as they are consistent even if the correlation structure is misspecified. In this case, the similarity between the two types of standard errors implies that either can be used with confidence.

Explain how misspecifying the correlation structure affects the model’s results.

solution: If the correlation structure is misspecified, model-based standard errors (which assume the correct correlation structure) will likely be biased. This leads to over- or underestimation of the variability of the coefficient estimates. Robust (sandwich) standard errors, which are less sensitive to misspecification of the correlation structure, may still provide reliable estimates, even when the assumed correlation structure is incorrect.

Consider a study on school performance, where test scores are modeled as a function of teacher experience and classroom size, clustered by school.

Write the general form of a GEE for this scenario, using $y_{i j}$ for test scores, $x_{1}$ for teacher experience, and $x_{2}$ for classroom size.

solution:

Here is an example GEE formula. You may have chosen a different model which can be fine as long as you justify it:

$y_{i j} = β_{0} + β_{1} \cdot x_{1 i j} + β_{2} \cdot x_{2 i j} + u_{i} + ϵ_{i j}$

Where: $y_{i j}$ is the test score for student $j$ in school $i$ , $x_{1 i j}$ is the teacher experience, $x_{2 i j}$ is the classroom size, $u_{i}$ is the random intercept for school $i$ , $ϵ_{i j}$ is the residual error for student $j$ in school $i$ .

The general form of the working correlation structure for GEE is specified as:

$cor (y_{i j}, y_{i k}) = ρ (exchangeable structure)$

How does the interpretation of coefficients differ between a GEE and a mixed-effects model?

Simulate data for a Poisson GEE: Run the code below to simulate clustered count data where the response variable represents the number of tasks completed by workers in teams, with predictors for experience (exp) and workload (load).

library(geepack)
library(gee)
# Simulate clustered Poisson data
set.seed(123)
n_teams <- 10
n_workers <- 5
exp <- rnorm(n_teams * n_workers, mean = 5, sd = 2)
load <- rpois(n_teams * n_workers, lambda = 10)
team <- rep(1:n_teams, each = n_workers)
tasks <- rpois(n_teams * n_workers, lambda = exp(0.5 + 0.3 * exp - 0.1 * load))

# Create a data frame
data <- data.frame(team, exp, load, tasks)

# Fit a Poisson GEE
fit_pois_gee <- gee(tasks ~ exp + load, data = data, id = team, corstr = "exchangeable", family = poisson)

Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27

running glm to get initial regression estimate

(Intercept)         exp        load 
 0.65183432  0.25481011 -0.09221407

# Summary of the model
summary(fit_pois_gee)


 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logarithm 
 Variance to Mean Relation: Poisson 
 Correlation Structure:     Exchangeable 

Call:
gee(formula = tasks ~ exp + load, id = team, data = data, family = poisson, 
    corstr = "exchangeable")

Summary of Residuals:
       Min         1Q     Median         3Q        Max 
-2.5792704 -0.9928465 -0.2308147  0.8542432  3.1320614 


Coefficients:
               Estimate Naive S.E.   Naive z Robust S.E.  Robust z
(Intercept)  0.68090576 0.30291226  2.247865  0.33007871  2.062859
exp          0.25262895 0.03879274  6.512274  0.04602186  5.489325
load        -0.09397882 0.02651737 -3.544047  0.01494948 -6.286427

Estimated Scale Parameter:  0.6964406
Number of Iterations:  2

Working Correlation
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1.0000000 0.1138303 0.1138303 0.1138303 0.1138303
[2,] 0.1138303 1.0000000 0.1138303 0.1138303 0.1138303
[3,] 0.1138303 0.1138303 1.0000000 0.1138303 0.1138303
[4,] 0.1138303 0.1138303 0.1138303 1.0000000 0.1138303
[5,] 0.1138303 0.1138303 0.1138303 0.1138303 1.0000000

Interpret the coefficients for exp and load.

solution: $\hat{β_{1}} = 0.25$ represents the change in the log of the expected number of tasks for a one-unit increase in experience (exp), holding workload (load) constant. - The exponentiated value of $β_{1} = e x p (0.25)$ gives the relative change in the expected number of tasks for each additional unit of experience. $_2 = -0.0940 $represents the change in the log of the expected number of tasks for a one-unit increase in workload (load), holding experience (exp) constant. The exponentiated value of $β_{2} = e x p (- 0.09)$ gives the relative change in the expected number of tasks for each additional unit of workload.

Change the correlation structure to unstructured and compare the results. What differences do you observe?

solution:

# Fit a Poisson GEE
fit_pois_gee2 <- gee(tasks ~ exp + load, data = data, id = team, corstr = "AR-M", family = poisson)

Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27

running glm to get initial regression estimate

(Intercept)         exp        load 
 0.65183432  0.25481011 -0.09221407

# Summary of the model
summary(fit_pois_gee2)


 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logarithm 
 Variance to Mean Relation: Poisson 
 Correlation Structure:     AR-M , M = 1 

Call:
gee(formula = tasks ~ exp + load, id = team, data = data, family = poisson, 
    corstr = "AR-M")

Summary of Residuals:
       Min         1Q     Median         3Q        Max 
-2.6058032 -0.9899380 -0.2396283  0.8584646  3.1402227 


Coefficients:
               Estimate Naive S.E.   Naive z Robust S.E.  Robust z
(Intercept)  0.64257506 0.30582240  2.101138  0.29645731  2.167513
exp          0.25799275 0.03980350  6.481660  0.04290350  6.013326
load        -0.09305641 0.02690561 -3.458625  0.01584915 -5.871382

Estimated Scale Parameter:  0.6984726
Number of Iterations:  2

Working Correlation
             [,1]        [,2]        [,3]        [,4]         [,5]
[1,] 1.0000000000 0.074917778 0.005612673 0.000420489 0.0000315021
[2,] 0.0749177775 1.000000000 0.074917778 0.005612673 0.0004204890
[3,] 0.0056126734 0.074917778 1.000000000 0.074917778 0.0056126734
[4,] 0.0004204890 0.005612673 0.074917778 1.000000000 0.0749177775
[5,] 0.0000315021 0.000420489 0.005612673 0.074917778 1.0000000000

solution: Between the two models estimates of fixed effects change very litte. The working correlation matrix shows a decrease in correlation with observations that are farther apart and an increase in correlation with observations that are closer. The robust standard errors for $e x p$ and $l o a d$ are larger than the naive SEs indicating a more conservative estimate, i.e., the naive SEs are misleading and could be underestimated.

Robust Standard Errors

GEEs provide robust standard errors to account for potential misspecification of the working correlation structure.
These standard errors are derived using a sandwich estimator, ensuring valid inference even if the assumed correlation structure is incorrect.

Formula for Robust Variance-Covariance Matrix: The robust variance-covariance matrix of the estimated coefficients $\hat{β}$ is given by:

$Var (\hat{β}) = A^{- 1} B A^{- 1},$

where:

$A = - \frac{\partial U (β)}{\partial β}$ is the sensitivity matrix.
$B = Cov [U (β)]$ is the variability of the score function $U (β)$ .

Compare the implications of using model-based standard errors versus robust standard errors when the working correlation structure is misspecified.

solution: In the above model we include an AR covariance structure which account for correlations that decrease over time or distance within groups. The exchangeable structure in the first model assumes constant correlation among all observations. If data exhibit temporal or sequential patterns, the AR structure is more appropriate. If the model is misspecified, ie you assume exach when AR is the true model, the naive SEs would severly under-estimate the covariance structure. Robust SE are larger and account for additional noise or covariance thereby being robust to model misspecification.

Based on the $R$ output above compare and interpret the model-based and Robust SEs for the covariate coefficients. What do differences indicate about the correlation structure?

solution: The Robust S.E. is slightly larger than the Naive S.E., indicating that the model accounts for more uncertainty in the presence of potential misspecification of the correlation structure. The AR model assumes that correlations decay over time, and the robust standard errors correct for any potential issues in this assumption.

For exp (Experience): The larger difference between the naive and robust standard errors suggests that the assumed correlation structure (exchangeable) may not fully capture the true dependencies between observations. The AR1 model, which accounts for decaying correlations over time, provides more reliable estimates of the coefficient for exp, as reflected by the smaller robust standard errors.
For load (Workload): The robust standard errors being smaller than the naive standard errors suggests that the AR1 structure better accounts for the correlation between observations within teams, leading to more precise estimates of the effect of load.

Variance-Covariance Derivation in GEEs

GEEs handle within-cluster correlations using a working correlation matrix, denoted by $R (α)$ , where $α$ is a parameter that specifies the correlation structure (e.g., exchangeable, AR1).
The total variance-covariance matrix is derived as:

$Σ_{i} = ϕ V (μ_{i})^{1 / 2} R (α) V (μ_{i})^{1 / 2},$

where:

$V (μ_{i})$ is the variance function (e.g., $μ_{i}$ for Poisson, $μ_{i} (1 - μ_{i})$ for Binomial).
$ϕ$ is the dispersion parameter.

Module 5: Smoothing Functions

Basis Expansion and Parametric Splines

Consider the regression model below:

$\begin{aligned} y_{i} & = g (x_{i}) + ϵ_{i} \\ g (x_{i}) & = \sum_{m = 1}^{M} β_{m} b_{m} (x_{i}) \end{aligned}$

Write out the model cubic spline

solution For cubic splines, the basis functions $b_{m} (x_{i})$ are typically piecewise cubic polynomials that are continuous and have continuous first and second derivatives. The number of basis functions $M$ depends on the number of knots and the degree of the spline (in this case, cubic, so degree 3).

The cubic spline basis functions, $b_{m} (x_{i})$ , are designed to: - Interpolate the data at certain values of $x_{i}$ called knots, - Ensure that the spline is smooth at the knots (i.e., the first and second derivatives are continuous).

The cubic spline model can be written as:

$g (x_{i}) = β_{0} + β_{1} b_{1} (x_{i}) + β_{2} b_{2} (x_{i}) + \dots + β_{M} b_{M} (x_{i})$

Where each $b_{m} (x_{i})$ is a cubic polynomial that takes different forms depending on the location of the knots. The typical cubic spline basis function $b_{m} (x_{i})$ for a given knot is defined in such a way that:

$b_{m} (x) = {\begin{cases} (x - ξ_{m})^{3}, & if ξ_{m} \leq x < ξ_{m + 1} \\ 0, & otherwise \end{cases}$ Where $ξ_{m}$ and $ξ_{m + 1}$ are the adjacent knots

This gives a piecewise smooth function for $g (x_{i})$ , ensuring flexibility in modeling the relationship between $x_{i}$ and $y_{i}$ while maintaining smoothness at the knots.

Using the splines package, fit a piecewise cubic spline model to the GAbirth dataset.

library(splines)

# Load GAbirth dataset
load(paste0(getwd(), "/data/GAbirth.RData"))
GAbirth <- dat
# Fit cubic spline model
fit_spline <- lm(bw ~ ns(age, knots = c(20, 30, 40)), data = GAbirth)

# Summarize results
summary(fit_spline)


Call:
lm(formula = bw ~ ns(age, knots = c(20, 30, 40)), data = GAbirth)

Residuals:
     Min       1Q   Median       3Q      Max 
-2555.55   -66.43    68.93   168.19   576.66 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      3111.81      25.01 124.401  < 2e-16 ***
ns(age, knots = c(20, 30, 40))1   291.67      26.12  11.168  < 2e-16 ***
ns(age, knots = c(20, 30, 40))2   231.37      38.82   5.961 2.69e-09 ***
ns(age, knots = c(20, 30, 40))3   300.16      66.00   4.548 5.55e-06 ***
ns(age, knots = c(20, 30, 40))4   146.26      61.86   2.364   0.0181 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 309.5 on 4995 degrees of freedom
Multiple R-squared:  0.05464,   Adjusted R-squared:  0.05388 
F-statistic: 72.18 on 4 and 4995 DF,  p-value: < 2.2e-16

Interpret the coefficients for age and explain how the knots affect the model fit.

solution:

The estimated intercept is 3111.8, which represents the estimated value of the dependent variable (bw, birth weight) when age is at the reference level (below 20 years, since the first knot is at 20). Significance: The intercept is highly significant with a p-value of < 2e-16, suggesting that the baseline birth weight when age is below 20 is statistically different from zero.
This coefficient, 291.7, represents the change in birth weight for an increase in age between 20 and 30 years compared to the reference group (below 20 years). Significance: The coefficient is highly significant (p-value < 2e-16), indicating a strong positive effect of age between 20 and 30 years on birth weight.
The coefficient, 231.4, represents the change in birth weight for an increase in age between 30 and 40 years compared to the reference group (below 20 years). Significance: The coefficient is highly significant (p-value = 2.7e-09), suggesting a positive effect of age between 30 and 40 years on birth weight, although the effect is slightly smaller than the previous age range (20-30 years).
The coefficient, 300.2, represents the change in birth weight for an increase in age between 40 and 50 years compared to the reference group (below 20 years). Significance: This effect is also highly significant (p-value = 5.6e-06), suggesting a strong positive effect of age between 40 and 50 years on birth weight, even though the sample size in this age range may be smaller.
The coefficient, 146.3, represents the change in birth weight for an increase in age between 50 and 60 years compared to the reference group (below 20 years). Significance: This effect is statistically significant (p-value = 0.018), though the magnitude of the effect is smaller than the previous age ranges.

Use the AIC function to compare this model to a simple linear regression of bw ~ age.

solution:

Based on the AIC values, the spline model has a slightly better fit than the linear model on age.

fit_linear <- lm(bw ~ age, data = GAbirth)

AIC(fit_spline, fit_linear)

           df      AIC
fit_spline  6 71545.23
fit_linear  3 71584.76

We pre-defined the knots in the above code. Now let’s use automatic knot selection with the L2-penalty.

a. Write out the objective function with the penalty constraint.

solution: In the context of cubic splines with automatic knot selection using an L2-penalty, the objective function consists of two parts: the least squares term (fit) and the L2-penalty term (smoothing penalty).

The least squares term measures how well the model fits the data:

$Fit Term = \sum_{i = 1}^{n} {(y_{i} - g (x_{i}))}^{2}$

Where $y_{i}$ is the observed value for the $i$ -th observation, $g (x_{i})$ is the cubic spline function fitted to $x_{i}$ , $n$ is the number of observations.

The L2-penalty term controls the smoothness of the spline by penalizing the total curvature:

$Penalty Term = λ \sum_{m = 1}^{M} {(\frac{d^{2} b_{m} (x)}{d x^{2}})}^{2} = λ β^{'} B β$

Where: $λ$ is the penalty parameter that controls the strength of the penalty (larger $λ$ leads to smoother splines), $b_{m} (x)$ is the $m$ -th basis function of the spline, The second derivative $\frac{d^{2} b_{m} (x)}{d x^{2}}$ measures the curvature, and the penalty term encourages less curvature, meaning fewer knots or smoother splines.

The total objective function combines both the fit and the penalty:

$Objective Function = a r g m i n_{β} (Y - X β)^{'} (Y - X β) + λ β^{'} B β$

Where: - The first term is the sum of squared residuals, which measures the fit of the model to the data. - The second term is the L2-penalty, which penalizes excessive curvature in the spline. - $λ$ controls the trade-off between fitting the data well and keeping the spline smooth.

b. Derive the closed form solution for $\hat{β}$ and describe the impact of the penalty term.

solution:

Given the objective function for the cubic spline with an L2-penalty, we aim to derive the closed-form solution for $\hat{β}$ and describe the effect of the penalty term on the model.

The total objective function becomes:

$Objective Function = ∥ Y - X β ∥^{2} + λ β^{T} B β$ To minimize the objective function with respect to $β$ , we take the derivative and set it equal to zero:

$\frac{\partial}{\partial β} (∥ Y - X β ∥^{2} + λ β^{T} B β) = 0$ First, compute the derivative of the least squares term:

$\frac{\partial}{\partial β} ∥ Y - X β ∥^{2} = - 2 X^{T} (Y - X β)$ Then, compute the derivative of the penalty term:

$\frac{\partial}{\partial β} λ β^{T} B β = 2 λ D^{T} D β$

Set the sum of these two derivatives equal to zero:

$- 2 X^{T} (Y - X β) + 2 λ B β = 0$

Simplifying, we get the closed-form solution for $\hat{β}$ :

$(X^{T} X + λ B) \hat{β} = X^{T} Y$

Thus, the closed-form solution for the coefficients $\hat{β}$ is:

$\hat{β} = (X^{T} X + λ B)^{- 1} X^{T} Y$

The Generalized Cross-Validation Error (GCV) is defined as

$G C V = \frac{1}{n} \frac{\sum_{i} (y_{i} - \hat{y})^{2}}{[1 - \frac{1}{n} t r (S)]^{2}}$

Fitting GAM with $m g c v$ : Fit two models where (1) default $k = 10$ , and (2) $k = 30$ . Compare GCV values and determine the model with best fit.

solution: There is little difference between GCV values when using $k = 10$ vs $k = 30$ knots. In the first model (k=10), the GCV 95836, the second model (K=30) has a GCV 95835. Therefore, the addition of more knots does not greatly improve model fit.

library(mgcv)

Loading required package: nlme


Attaching package: 'nlme'

The following object is masked from 'package:lme4':

    lmList

This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.

fit_spline10 <- gam(bw ~ s(age), data = GAbirth)
fit_spline30 <- gam(bw ~ s(age, k=30), data = GAbirth)

Check the diagnostic plots using $p l o t ()$ for the most optimal model from (6). Interpret the findings from the diagnostic plot.

solution:

gam.check(fit_spline30)


Method: GCV   Optimizer: magic
Smoothing parameter selection converged after 7 iterations.
The RMS GCV score gradient at convergence was 0.08232834 .
The Hessian was positive definite.
Model rank =  30 / 30 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

         k'  edf k-index p-value
s(age) 29.0  3.5    1.01    0.68

Using the optimal model chosen from (6), interpret the parametric coefficients.

Solution:
The estimated intercept is 3288.82, which represents the estimated birth weight when the age (age) is at the reference level (the baseline age, likely the minimum value or center point of the age variable).
The standard error of 4.38 indicates the precision of the estimate. Since the standard error is small relative to the coefficient, we can be confident in the intercept estimate.
The t-value of 752 and the p-value of < 2e-16 indicate that the intercept is statistically significant at the highest level, meaning the baseline birth weight (when age is at the reference level) is significantly different from zero.

Generalized Additive Models for Non-Normal Data.

Generalized Additive Models (GAMs): Extend GLMs by allowing additive non-linear relationships between predictors and the response.

$g (E (y)) - β_{0} + \sum_{j = 1}^{p} s_{j} (X_{j})$

Bivariate Splines: Used for modeling interactions between two continuous variables.
Interpreting Non-Linear Effects: GAMs provide smooth terms ( $s (x)$ ) that capture non-linear patterns in the data.

Use the gam() function from the mgcv package to fit a model predicting binary outcomes (e.g., infection status) from age and time of visit.

library(mgcv)

# Simulate binary data
set.seed(123)
age <- runif(100, 20, 60)
time <- runif(100, 0, 10)
outcome <- rbinom(100, 1, plogis(0.5 + 0.03 * age - 0.1 * time))

# Fit GAM
fit_gam <- gam(outcome ~ s(age) + s(time), family = binomial, method = "REML")

# Summarize and plot
summary(fit_gam)


Family: binomial 
Link function: logit 

Formula:
outcome ~ s(age) + s(time)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.1390     0.2461   4.629 3.68e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
          edf Ref.df Chi.sq p-value
s(age)  1.895  2.373  4.705   0.146
s(time) 2.107  2.637  2.002   0.416

R-sq.(adj) =  0.053   Deviance explained =  8.7%
-REML = 56.127  Scale est. = 1         n = 100

plot(fit_gam, pages = 1)

Extend the GAM model to include a bivariate spline for age and time.

fit_gam_bivariate <- gam(outcome ~ te(age, time), family = binomial, method = "REML")

# Summarize and plot
summary(fit_gam_bivariate)


Family: binomial 
Link function: logit 

Formula:
outcome ~ te(age, time)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.1874     0.2571   4.618 3.87e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df Chi.sq p-value
te(age,time) 6.026  7.806  7.497   0.469

R-sq.(adj) =  0.0472   Deviance explained = 10.8%
-REML = 50.367  Scale est. = 1         n = 100

plot(fit_gam_bivariate, scheme = 1)

What does the bivariate spline capture that separate univariate smooths do not?

solutioin: The bivariate spline te(age, time) captures the interaction effect between age and time. Specifically, it allows for nonlinear relationships between both variables, considering how changes in age and time together influence the outcome. In contrast, separate univariate smooths for age and time would capture the individual nonlinear effects of each variable on the outcome, but they do not account for the interaction between the two variables. A univariate smooth would look at how age affects the outcome without considering how that effect changes over time (or vice versa for time).

Visualize the interaction between age and time using the plot() function.

# Visualize the interaction between age and time using the plot function
plot(fit_gam_bivariate, pages = 1, scale = 0)

Module 6: Principal Component Analysis

PCA is a method for dimensionality reduction that identifies principal components (PCs), which are uncorrelated linear combinations of the original variables. These PCs are ordered by the amount of variance they explain in the data.

Steps in PCA

Standardize the data if variables are measured on different scales.
Compute the covariance matrix $Σ$ or correlation matrix.
Perform eigenvalue decomposition (EVD) or singular value decomposition (SVD).
- Eigenvalues ( $λ_{i}$ ) represent the variance explained by each PC.
- Eigenvectors ( $v_{i}$ ) define the weights for each variable in the PCs.
Transform the data into the principal component space.

Variance Explained by PCs

The proportion of variance explained (PVE) by the $k$ th PC is given by: ${PVE}_{k} = \frac{λ_{k}}{\sum_{i = 1}^{p} λ_{i}},$ where $λ_{k}$ is the eigenvalue corresponding to the $k$ th PC.

Principal Component Regression (PCR)

PCR uses PCs as predictors in regression models instead of the original variables. This approach is particularly useful when predictors are highly correlated.

Consider a dataset with three variables $X_{1}$ , $X_{2}$ , and $X_{3}$ . The covariance matrix $Σ$ is:

$Σ = [\begin{matrix} 2 & 1 & 0.5 \\ 1 & 2 & 0.3 \\ 0.5 & 0.3 & 1 \end{matrix}] .$

Compute the eigenvalues and eigenvectors of $Σ$ (use R if needed).

solution:

# Define the covariance matrix
Sigma <- matrix(c(2, 1, 0.5, 
                  1, 2, 0.3, 
                  0.5, 0.3, 1), 
                nrow = 3, byrow = TRUE)

# Compute eigenvalues and eigenvectors
eig <- eigen(Sigma)

# Display eigenvalues and eigenvectors
eig$values

[1] 3.1495178 1.0807973 0.7696849

eig$vectors

           [,1]       [,2]       [,3]
[1,] -0.6954201  0.5052059  0.5110361
[2,] -0.6716472 -0.7098118 -0.2122666
[3,] -0.2555011  0.4908505 -0.8329376

Write the first principal component as a linear combination of $X_{1}$ , $X_{2}$ , and $X_{3}$ .

solution:

Given the eigenvalues and eigenvectors of the covariance matrix $Σ$ , we can write the first principal component (PC1) as a linear combination of the original variables $X_{1}$ , $X_{2}$ , and $X_{3}$ . The first eigenvector corresponding to the largest eigenvalue ( $λ_{1} = 3.15$ ) represents the weights for the first principal component.

From the output of eig$vectors, the first eigenvector is:

$v_{1} = [\begin{matrix} - 0.695 \\ - 0.672 \\ - 0.256 \end{matrix}]$

Thus, the first principal component (PC1) can be written as a linear combination of $X_{1}$ , $X_{2}$ , and $X_{3}$ :

$PC1 = - 0.695 \cdot X_{1} - 0.672 \cdot X_{2} - 0.256 \cdot X_{3}$

This expression shows that the first principal component is a weighted sum of the original variables, where the coefficients represent the contribution of each variable to the overall component.

Interpretation of the First Principal Component: - $X_{1}$ (corresponding to the first coefficient -0.695) has the largest negative weight, meaning it contributes most negatively to PC1. - $X_{2}$ (coefficient -0.672) also contributes negatively, but slightly less than $X_{1}$ . - $X_{3}$ (coefficient -0.256) has the smallest negative weight, indicating a smaller negative contribution to PC1.

Perform PCA on the numerical variables in the mtcars dataset:

# Load mtcars dataset
data(mtcars)

# Perform PCA
pca_mtcars <- prcomp(mtcars, scale. = TRUE)

# Summary of PCA
summary(pca_mtcars)

Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6    PC7
Standard deviation     2.5707 1.6280 0.79196 0.51923 0.47271 0.46000 0.3678
Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031 0.01924 0.0123
Cumulative Proportion  0.6008 0.8417 0.89873 0.92324 0.94356 0.96279 0.9751
                           PC8    PC9    PC10   PC11
Standard deviation     0.35057 0.2776 0.22811 0.1485
Proportion of Variance 0.01117 0.0070 0.00473 0.0020
Cumulative Proportion  0.98626 0.9933 0.99800 1.0000

# Plot PCA
biplot(pca_mtcars, scale = 0)

How many PCs explain at least 90% of the variance?

solution:To find how many principal components (PCs) explain at least 90% of the variance, check the cumulative proportion of variance explained. The first 3 components explain 90% of the variance.

# Calculate cumulative variance explained
cumsum(pca_mtcars$sdev^2) / sum(pca_mtcars$sdev^2)

 [1] 0.6007637 0.8417153 0.8987332 0.9232421 0.9435558 0.9627918 0.9750884
 [8] 0.9862612 0.9932655 0.9979960 1.0000000

Interpret the first two PCs using the loadings $(p c a_m t c a r s$ rotation)$.

solution:

# Get the loadings for the first two PCs
pca_mtcars$rotation[, 1:2]

            PC1         PC2
mpg  -0.3625305  0.01612440
cyl   0.3739160  0.04374371
disp  0.3681852 -0.04932413
hp    0.3300569  0.24878402
drat -0.2941514  0.27469408
wt    0.3461033 -0.14303825
qsec -0.2004563 -0.46337482
vs   -0.3065113 -0.23164699
am   -0.2349429  0.42941765
gear -0.2069162  0.46234863
carb  0.2140177  0.41357106

Variables with positive loadings (such as cyl, disp, hp, wt) have a positive relationship with PC1, while variables with negative loadings (such as mpg, drat, qsec, vs, am, gear) have a negative relationship with PC1.
PC1 seems to capture a general measure of vehicle performance or size, as variables like cyl, disp, and hp (which are related to engine size and power) have positive loadings, and variables like mpg (miles per gallon), drat (rear axle ratio), and qsec (quarter-mile time) have negative loadings, which are inversely related to performance.
PC2 appears to capture a different aspect of the data, possibly related to the vehicle’s weight, performance, and transmission type.
Variables such as am (automatic transmission), gear, and carb (carburetor count) have positive loadings, suggesting a relationship with these factors in PC2. Specifically, am and gear suggest that the second component might be related to transmission type and gear configuration.
Variables like qsec and wt have negative loadings, indicating a negative relationship with these components.
The relatively small loadings for mpg, cyl, disp, and hp suggest that PC2 is less sensitive to these variables compared to PC1.

Calculate the proportion of variance explained (PVE) for the first three PCs.

solution: The proportion of variance explained (PVE) can be computed as the square of the standard deviations of the principal components divided by the total variance:

# Calculate the proportion of variance explained
pve <- (pca_mtcars$sdev)^2 / sum((pca_mtcars$sdev)^2)

# Display the PVE for the first three PCs
pve[1:3]

[1] 0.60076366 0.24095163 0.05701793

Create a scree plot to visualize the PVE for all components.

pve <- (pca_mtcars$sdev)^2 / sum((pca_mtcars$sdev)^2)
plot(pve, type = "b", xlab = "Principal Component", ylab = "Proportion of Variance Explained")

Write out the model for a PCA regression of mpg (miles per gallon) on the first 3 PCs from the mtcars dataset.

Solution:

$m p g = β_{1} P C_{1} + β_{2} P C_{2} + β_{3} P C_{3}$

Perform this regression in R

# PCA for regression
pca_data <- data.frame(pca_mtcars$x[, 1:3], mpg = mtcars$mpg)

# Regression on PCs
fit_pcr <- lm(mpg ~ PC1 + PC2 + PC3, data = pca_data)
summary(fit_pcr)


Call:
lm(formula = mpg ~ PC1 + PC2 + PC3, data = pca_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5769 -1.1683 -0.1571  1.0958  4.1058 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 20.09062    0.35240  57.010  < 2e-16 ***
PC1         -2.18495    0.13928 -15.688 2.12e-15 ***
PC2          0.09718    0.21992   0.442  0.66197    
PC3         -1.36055    0.45210  -3.009  0.00549 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.993 on 28 degrees of freedom
Multiple R-squared:  0.9012,    Adjusted R-squared:  0.8906 
F-statistic: 85.12 on 3 and 28 DF,  p-value: 3.496e-14

Compare the results to a regression using all original predictors.

solution:

fit_reg <- lm(mpg ~., data= mtcars)

PCA Regression:
- PC1 and PC3 are both significant predictors of mpg, while PC2 is not significant. This suggests that the first and third principal components capture most of the meaningful variation in mpg, and the second principal component does not contribute much.
Regular Regression:
- The original predictors such as cyl, disp, hp, drat, and others show little to no significance, with wt being the only variable that is close to significant with a p-value of 0.063.

Interpret the coefficients of the PCs and discuss why they differ from coefficients in a regression on the original variables.

solution:

PCA Regression:
- PC1: The coefficient for PC1 is -2.1850, which is statistically significant (p-value = 2.1e-15), indicating that the first principal component has a strong negative effect on mpg.
- PC2: The coefficient for PC2 is 0.0972, but it is not significant (p-value = 0.6620), suggesting that PC2 does not contribute significantly to explaining mpg.
- PC3: The coefficient for PC3 is -1.3605, and it is statistically significant (p-value = 0.0055), suggesting that PC3 has a significant negative effect on mpg.
Regular Regression:
- The coefficient for cyl is -0.1114, indicating a small negative relationship between cyl and mpg, though it is not statistically significant (p-value = 0.916).
- The coefficient for disp is 0.0133, suggesting a small positive relationship with mpg, but it is also not statistically significant (p-value = 0.463).
- The coefficient for hp is -0.0215, indicating a slight negative relationship with mpg, though it is not significant (p-value = 0.335).
- The coefficient for wt is -3.7153, showing a significant negative effect on mpg with a p-value of 0.063, though it’s slightly above the 0.05 significance threshold.
- am, gear, and carb also show positive relationships with mpg, but none are statistically significant (p-values > 0.05).
- Interpretation of Coefficients
PCA Regression:
- The coefficients in the PCA regression represent how the principal components (combinations of the original variables) contribute to explaining mpg. These components capture the most variance in the data and summarize the relationships between the variables.
- PC1 and PC3 have significant coefficients, indicating that they explain important aspects of the variance in mpg. PC2, however, does not contribute significantly, suggesting that it represents noise or less important variation.
Regular Regression:
- The coefficients in the regular regression are the direct effects of the original variables on mpg. Some variables (e.g., wt) show a meaningful negative effect on mpg, but many of the other predictors are not statistically significant. This could be due to multicollinearity between the original predictors, which makes it harder to isolate the individual effects.
- The regular regression model’s coefficients are directly tied to the individual variables, while in PCA, the coefficients correspond to the principal components, which are uncorrelated and can better capture the underlying structure in the data.