5.30 Exercises

Suppose the variable “age category” has levels <18 years, 18 to 25 years, 26 to 39 years, 40 to 55 years, 56 to 64 years, and 65 years and older. If you include this variable in a regression model, how many age category regression coefficients will be estimated?
True or false? The unstandardized residual for an individual case is the difference between the observed and predicted outcome for that case.
True or false? An assumption of the linear regression model is that the outcome is normally distributed.
Explain the difference between a confounder and a mediator. Which should be adjusted for in a regression model and which should not?
What are two ways of incorporating effect modification into a regression analysis?
Which is wider, a confidence interval for the mean outcome or a prediction interval for an individual observation? Why?
Name the assumptions of the linear regression model.
The natural log, inverse, and square root transformations are all special cases of what family of transformations?
What kind of additional analysis can be done to determine the impact of outliers and influential observations on the regression coefficients and the conclusions of the regression analysis?
Explain when a multiple testing adjustment would be necessary.

For Exercises 11 to 13, use the 2018 Natality subsample teaching dataset (natality2018_rmph.Rdata, see Appendix A.3).

Create a complete case analysis dataset that includes all cases with no missing data for the following variables: DBWT (birthweight (g)), CIG_REC (smoked during pregnancy), risks (risk factors reported), MAGER (mother’s age), MRACEHISP (mother’s race/ethnicity), DMAR (marital status), MEDUC (mother’s education), PRIORLIVE (prior births now living), PRIORDEAD (prior births now dead), and BMI (mother’s body mass index, kg/m²). How many rows were in the original dataset? How many rows are in the complete case analysis dataset?
Examine the complete case analysis dataset you created in Exercise 11. Create a summary() and histogram for each continuous variable. For each categorical variable, check to see if it is coded as a factor and create a frequency table. If you are not sure whether a numeric (non-factor) variable should be treated as continuous or categorical, use table() to see how many unique values there are – if there are only a few, then consider it categorical.
Using the complete case analysis dataset you created in Exercise 11, treat PRIORLIVE and PRIORDEAD as factor variables and collapse them so they have fewer levels, since many of the levels are sparse. Create a new variable called PRIORLIVE_cat that is a factor with levels 0, 1, 2, 3, and 4+. Create a new variable called PRIORDEAD_cat that is a factor with levels 0 and 1+. Hint: Use factor() to turn them into factor variables first, then use fct_collapse() to collapse the levels. Finally, use table() to compare the variables before and after the derivation to make sure it was done correctly.

For Exercises 14 to 21, use the modified complete case analysis 2018 Natality subsample teaching dataset you created in Exercises 11 and 13. If you did not complete Exercises 11 and 13, you can still do these exercises using PRIORLIVE instead of PRIORLIVE_cat and PRIORDEAD instead of PRIORDEAD_cat (treat them as continuous variables). In these exercises, you will assess the association between birthweight (the outcome) and smoking during pregnancy (the primary predictor), adjusted for risk factors reported, mother’s age, mother’s race/ethnicity, marital status, mother’s education, prior births now living (use the collapsed factor version if you completed Exercise 13), prior births now dead (use the collapsed factor version if you completed Exercise 13), and mother’s body mass index.

Visualize the unadjusted relationship between the outcome and each predictor.
Fit both the unadjusted and adjusted regression models. Assess the unadjusted and adjusted association between birthweight and cigarette use during pregnancy. How do the effect estimate, its 95% confidence interval, and its statistical significance change after adjusting for other variables?
Create a regression table displaying the results of the adjusted model, including the regression coefficients, their 95% CIs and p-values, and multiple degree of freedom p-values for categorical variables with more than two levels. Include an appropriate brief label for each variable.
Interpret the regression coefficient(s) for each predictor in the adjusted model. For each, state the regression coefficient, 95% CI, whether or not the predictor is statistically significant (along with its p-value), and provide an interpretation in terms of how differences in the predictor are associated with differences in the mean outcome.
Visualize the adjusted relationships. How are the solid lines in these plots related to the estimated regression coefficients?
Does the association between cigarette smoking during pregnancy and birthweight depend on mother’s education? Modify the adjusted model appropriately to be able to answer this question. Regardless of statistical significance, create a plot illustrating the smoking effect at each level of mother’s education and estimate the cigarette smoking effect at each level of mother’s education.
Does the association between mother’s BMI and birthweight differ between mothers of different race/ethnicity? Modify the adjusted model appropriately to be able to answer this question. Regardless of statistical significance, estimate the BMI effect at each level of mother’s race/ethnicity and make a plot illustrating the BMI effect at each level of mother’s race/ethnicity.
Does the association between mother’s BMI and birthweight depend on mother’s age? Modify the adjusted model appropriately to be able to answer this question. Regardless of statistical significance, estimate the BMI effect for mother’s age 20, 30, and 40 years and make a plot illustrating the BMI effect at each of these ages.

For Exercises 22 and 23, use the CAMP teaching dataset camp_0_48_rmph.rData (see Appendix A.6).

In Exercise 26 in Chapter 4, we tested to see if post-bronchodilator Forced Expiratory Volume at 1 second (FEV1) (POSFEV0) differed significantly between treatment groups (Budesonide, Nedocromil, Placebo) (TG). Does the difference in POSFEV0 between treatment groups significantly depend on whether or not the child has a dehumidifier in the home (dehumid)?
In Exercise 27 in Chapter 4, we estimated the two treatment effects (Budesonide vs. Placebo, Nedocromil vs. Placebo) along with their 95% CIs and p-values. Using the model you fit in the previous exercise, estimate these two treatment effects among those with a dehumidifier and among those without. Interpret these results.

For Exercises 24 and 25, the following code creates the NHANES 2017-2018 (Appendix A.1) teaching dataset needed:

load("Data/nhanes1718_adult_fast_sub_rmph.Rdata")
nhanesf <- nhanes_adult_fast_sub
rm(nhanes_adult_fast_sub)

nhanesf.complete <- nhanesf %>% 
  select(LBDGLUSI, BMXWAIST, smoker, RIDAGEYR,
         RIAGENDR, RIDRETH3, income) %>% 
  drop_na()

In Section 5.10.9.2, Example 5.3, we fit the following model

fit.ex5.3.int <- lm(LBDGLUSI ~ RIAGENDR + income + RIAGENDR:income,
                    data = nhanesf.complete)

and plotted fasting glucose vs. income by gender. In this exercise, plot fasting glucose vs. gender by income to visualize how the income effects differ between genders.

In Section 5.10.9.3, Example 5.4, we fit the following model

fit.ex5.4.int <- lm(LBDGLUSI ~ RIDAGEYR + BMXWAIST + RIDAGEYR:BMXWAIST,
                    data = nhanesf.complete)

and plotted fasting glucose vs. waist circumference by age. In this exercise, plot fasting glucose vs. age at waist circumference values of 80, 100, and 120 cm to visualize how the age effect differs between those with different levels of waist circumference.

In each of Exercises 26 to 28, draw a figure similar to those in Section 5.9 that illustrates the described hypothetical relationships between variables, and state whether this illustrates a confounder, mediator, or moderator.

Substance use affects risk of suicide in part through substance use disorder.
The effect of substance use on suicide differs between those with different gender identity.
The effect of substance use on suicide must be adjusted for gender identity because gender identity is associated with both variables but is not in the causal pathway.

For Exercises 29 to 35, use the UN Human Development Data (unhdd2020.rmph.rData, see Appendix A.1).

What is the association between the outcome life expectancy at birth (years) (life) and expected years of schooling (years) (educ_expected), adjusted for urban population (%) (urban), annual growth in per capita GDP (%) (gdp_percap_growth), and depth of food deficit (average dietary energy supply adequacy, 2017/2019, %) (food_deficit)? Estimate the adjusted regression coefficient, 95% CI, and p-value. Also, how does the adjusted association compare to the unadjusted and, based on this comparison, what would you conclude about the extent to which the three variables you adjusted for confound the association? A 10% difference in effects is often used as a cutoff for deciding if confounding is meaningful (of course, this cutoff is arbitrary). Make sure you fit the unadjusted model to the same sample as your adjusted model.
Using the adjusted model from the previous exercise, what is the estimated mean life expectancy (and its 95% CI) among countries with 10 years of expected schooling? What about among countries with 15 years of expected schooling? Assume 50% urban population (urban = 50), no change in annual per capita GDP (gdp_percap_growth = 0), and a depth of food deficit of 120% (food_deficit = 120).
For each of the two sets of predictor values in the previous exercise, within what bounds do we expect 95% of countries’ life expectancy to fall?
Does the Gender Inequality Index (gii) differ between HDI groups (hdi_group), adjusted for urban population (%) (urban), median age (years) (age_median), and current health expenditure (2017, % of GDP) (cur_health_expen)? Estimate the adjusted differences between groups and the reference level (“Low”), their 95% CIs, and p-values. Also, how do the adjusted differences compare to the unadjusted and, based on these comparisons, what would you conclude about the extent to which the three variables you adjusted for confound the association? A 10% difference in effects is often used as a cutoff for deciding if confounding is meaningful (of course, this cutoff is arbitrary). Make sure you fit the unadjusted model to the same sample as your adjusted model.
Using the adjusted model from the previous exercise, what is the estimated mean GII (and its 95% CI) among countries in the “Low” HDI group? What about among countries in the “Very high” HDI group? For the other variables in the model (urban population, median age, and current health expenditure), choose any plausible value. You can use summary() to summarize their distributions to help you decide what values are plausible.
For each of the two sets of predictor values in the previous exercise, within what bounds do we expect 95% of countries’ GII to fall?
Is there an association between the predictor tuberculosis incidence (2018, per 100,000 people) (tb) and the outcome Human Development Index (hdi)? Plot the unadjusted association. Then fit the model and check the assumptions. Then transform the predictor (try each of log, inverse, and square root) and re-fit the model to answer the research question.

For Exercises 36-43, use our COVID-19 county-level data (covid_20210908_rmph.rData, see Appendix A.4).

Is the (predictor) county-level number of ICU beds per 100,000 population (icu_beds_per_100k) associated with the (outcome) number of deaths (deaths.usafacts.20210908)? Give a possible interpretation for the direction of association.
Is the predictor county-level number of ICU beds per 100,000 population (icu_beds_per_100k) associated with the outcome number of deaths (deaths.usafacts.20210908) after adjusting for population size (PopulationEstimate2018)?
Assess the regression model assumptions (normality, linearity, and constant variance) for the model you fit in the previous exercise.
In the model from the previous exercise, log-transform the outcome and both predictors, re-fit the model, and re-check the assumptions. Hint: Check for zeros for each variable, and if there are any then add a small number to the variable prior to the log-transformation.
Instead of a log-transformation of the outcome in the previous exercise, try a Box-Cox transformation, re-fit the model, and re-check the assumptions. If there were zeros and you added a small number prior to transformation in the previous exercise, do the same here prior to the Box-Cox transformation. Hint: Fit the model with log-transformed predictors but with the UN-transformed outcome (after adding a small number, if needed), then apply the MASS::boxcox() function to that model.
Fit a regression model for the outcome “estimated age-adjusted percentage of diagnosed diabetes in 2016 among adults age 20 and over” (DiabetesPercentage) that includes the following predictors:

X..Adults.with.Obesity: percentage of the adult population (age 20 and older) that reports a body mass index (BMI) greater than or equal to 30 kg/m² (2016)
Rural.UrbanContinuumCode2013: rural-urban continuum code
MedianAge2010: median age of county in 2010
X..Fair.or.Poor.Health: percentage of adults reporting fair or poor health (age-adjusted) (2017)
SVIPercentile: the county’s overall percentile ranking indicating the CDC’s Social Vulnerability Index (SVI); higher ranking indicates greater social vulnerability
X..Uninsured: percentage of population under age 65 without health insurance (2017)
X..Unemployed: percentage of population ages 16 and older unemployed but seeking work (2018)
X..Children.in.Poverty: percentage of people under age 18 in poverty (2018)

Check for collinearity. Resolve any collinearity issues you find to arrive at a model where all VIFs are below 2.5. Compare the regression coefficients and their standard errors in the full model and the final model. Did any change by a meaningful amount?

Using the final model from the previous exercise, determine if there are any outliers using an outlier test, and then visualize the outliers. Based on the results of the outlier test, examine the significant outliers by looking at their values of X..Adults.with.Obesity, DiabetesPercentage, and population (pop.usafacts). Compare their obesity % and diabetes % to the distribution of these variables in all counties in this dataset. What is unusual about these counties with respect to their obesity % and % of adults with diabetes (what makes them outliers in our model)?
Using the final model from the previous exercise, look for influential observations.

For Exercises 44 to 46, use our 2018 Natality teaching dataset (natality2018_rmph.Rdata, see Appendix A.3).

Using our natality teaching dataset, fit multiple models, each one testing the association between the outcome birthweight (g) (DBWT) and a single risk factor, and each adjusted for Mother’s Age (MAGER), Mother’s Education (MEDUC), Marital Status (DMAR), and Total Birth Order (TBO_REC). Fit an adjusted model for each of the following risk factors, and then adjust for multiple testing using the Bonferroni and Hommel methods to obtain adjusted p-values. Which were statistically significant before adjusting for multiple testing? After adjusting for multiple testing? Was the Bonferroni adjustment too conservative for any of the risk factors? NOTE: The need for multiple testing here assumes that the researcher is screening for “any risk factor” and will conclude they have found an association if any of them are significant.

RF_GDIAB: Gestational Diabetes
RF_PHYPE: Pre-pregnancy Hypertension
RF_GHYPE: Gestational Hypertension
RF_PPTERM: Previous Preterm Birth
RF_INFTR: Infertility Treatment Used
RF_CESAR: Previous Cesarean

Regress the outcome birthweight (g) (DBWT) on the following predictors: MRACEHISP (mother’s race/ethnicity), risks (risk factors reported), MAGER (mother’s age), PRIORLIVE (prior births now living), PRIORDEAD (prior births now dead), DMAR (marital status), MEDUC (mother’s education), and BMI (mother’s body mass index, kg/m²). Compute the p-values for each of the six pairwise comparisons of estimated birthweight between the four levels of MRACEHISP and use the Hommel method to adjust for multiple comparisons. Which comparisons were significant before adjusting for multiple comparisons? After? Hint: You will have to re-level the predictor in two different ways to get all the comparisons.
Referring back to the model in the previous exercise, if mother’s race/ethnicity is the primary predictor of interest, does it make sense to adjust for all the other variables in the model? Think about each other predictor and evaluate whether it is a confounder or a mediator of racial disparities in birthweight. Remove the mediators and re-fit the model. Compare the magnitude of the estimated racial disparities before and after removing the mediators. Explain the difference.

Exercises 47 and 48 refer back to Exercises 41-43. The following code creates the final model that was fit in Exercise 41.

fit1 <- lm(DiabetesPercentage ~ X..Adults.with.Obesity +
             Rural.UrbanContinuumCode2013 +
            MedianAge2010 + X..Fair.or.Poor.Health + X..Uninsured + 
            X..Unemployed, data = covid)

Do a sensitivity analysis for fit1 to assess the robustness of model conclusions to the presence of outliers (which you looked for in Exercise 42).
Do a sensitivity analysis for fit1 to assess the robustness of model conclusions to the presence of influential observations (which you looked for in Exercise 43).
Suppose you have a dataset with $n = 200$ observations. In order to avoid overfitting, what is the maximum number of predictors you should include in a regression model?
You are designing a study for which you wish to fit a regression model with eight predictors. What is the minimum sample size you need to avoid overfitting?
Using the NHANES fasting subset teaching dataset (nhanes1718_adult_fast_sub_rmph.Rdata, see Appendix A.1), regress log systolic blood pressure (log(sbp)) on log body mass index (log(BMXBMI) and age (RIDAGEYR).

Suppose you want to predict the log SBP for an individual with a BMI of 70 kg/m². Would that be a valid prediction? Why or why not?
Suppose you want to predict the log SBP for an individual who is age 25 years. Would that be a valid prediction? Why or why not?
Suppose you want to predict the log SBP for an individual with a BMI of 55 kg/m² and an age of 70 years. Would that be a valid prediction? Why or why not?