Visualizing statistical models

Sources: Original material; Wickham (2010)

1 Data

  • Here we will work with the ‘famous’ swiss data set (?swiss)
kable(head(swiss))
Fertility Agriculture Examination Education Catholic Infant.Mortality
Courtelary 80.2 17.0 15 12 9.96 22.2
Delemont 83.1 45.1 6 9 84.84 22.2
Franches-Mnt 92.5 39.7 5 5 93.40 20.2
Moutier 85.8 36.5 12 7 33.77 20.3
Neuveville 76.9 43.5 17 15 5.16 20.6
Porrentruy 76.1 35.3 9 7 90.57 26.6

2 Packages & functions

  • Objective: Summarize results in graphs instead of unreadable tables
  • Modelling results: Stored in model objects (Extract with modelname$...)
  • Approach 1: Estimate models and extract estimates/predictions from model objects using broom package and visualize (I prefer this one!)
    • broom::tidy(...): Use the tidy function to extract estimates etc.
    • conf.int = TRUE, conf.level = 0.95: Add arguments to calculate confidence intervals
    • broom::augment(...): Generate & extract predictions
  • Approach 2: Use plotting functions that directly plot estimates from model objects
  • Additional functions
    • geom_linerange(): Plot a line for an interval
    • coord_flip(): Flip a plot

3 Extracting coefficients

library(broom)
# Simple linear regression
fit <- lm(Fertility ~ Catholic + Agriculture + Education, data = swiss) # see ?swiss
# fit$... 
coef(fit)
(Intercept)    Catholic Agriculture   Education 
 86.2250198   0.1452013  -0.2030377  -1.0721468 
library(broom)
# Simple linear regression
fit <- lm(Fertility ~ Catholic + Agriculture + Education, data = swiss) # see ?swiss
# fit$... 
coef(fit)
(Intercept)    Catholic Agriculture   Education 
 86.2250198   0.1452013  -0.2030377  -1.0721468 
kable(tidy(fit, conf.int = TRUE, conf.level = 0.95))
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 86.2250198 4.7347179 18.211226 0.0000000 76.6765511 95.7734885
Catholic 0.1452013 0.0301466 4.816504 0.0000184 0.0844049 0.2059978
Agriculture -0.2030377 0.0711516 -2.853591 0.0066235 -0.3465286 -0.0595467
Education -1.0721468 0.1558018 -6.881477 0.0000000 -1.3863512 -0.7579425

4 Coefficient plots

  • Here we will work with the ‘famous’ swiss data set (?swiss)
  • Figure 1 below provides a simple coefficient plot.
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?
Figure 1: Coefficient plot

4.1 Lab: Data & code

  • Learning objectives
    • How to plot the results of statistical models
    • How to use the broom package/tidy function to extract coefficients
    • How to plot different confidence intervals

We start by preparing the data, i.e., estimating the model and extracting the coefficients of interest. Importantly, the table you see below is all you need as a basis for the ggplot functions.

library(broom)
fit <- lm(Fertility ~ Catholic + Agriculture + Education, data = swiss) # see ?swiss
data_coefs <- tidy(fit, conf.int = TRUE, conf.level = 0.95)

data_coefs <- data_coefs %>%
           rename(Variable = term,
                  Coefficient = estimate,
                  SE = std.error) %>%
           filter(Variable != "(Intercept)")

data_coefs <- data_coefs %>% select(-SE, 
                              -statistic,
                              -p.value)

data_coefs %>% kable("html") %>% kable_styling(font_size = 10)
Variable Coefficient conf.low conf.high
Catholic 0.1452013 0.0844049 0.2059978
Agriculture -0.2030377 -0.3465286 -0.0595467
Education -1.0721468 -1.3863512 -0.7579425



Subsequently, we feed those estimates into the ggplot function.

  • geom_point(), geom_linerange(): Plot estimates and intervals
  • geom_hline(): Plot 0 line
  • coord_flip(): Flip the plot
ggplot(data_coefs, aes(x = Variable, y = Coefficient)) +
        geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) +
        geom_point(aes(x = Variable, 
                    y = Coefficient)) + 
        geom_linerange(aes(x = Variable, 
                     ymin = conf.low,
                     ymax = conf.high),
                   lwd = 0.5) +
        ggtitle("Outcome: Fertility") +
        coord_flip()

4.2 Exercise

  1. Use the code from above but change the model for which you generate the coefficient plot.
    • Visualize a coefficient plot in which the outcome variable is Infant.Mortality instead of Fertility.
    • And add an additional explanatory variable, namely Examination.
  2. Try to reduce the ink that is used in the plot above.
  3. Change the confidence level of the confidence intervals to 80%.
Exercise solution
library(broom)
fit <- lm(Infant.Mortality ~ Catholic + Agriculture + Education + Examination, data = swiss) # see ?swiss
data_coefs <- tidy(fit, conf.int = TRUE, conf.level = 0.80)

data_coefs <- data_coefs %>%
           rename(Variable = term,
                  Coefficient = estimate,
                  SE = std.error) %>%
           filter(Variable != "(Intercept)")

ggplot(data_coefs, aes(x = Variable, y = Coefficient)) +
        geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) +
        geom_point(aes(x = Variable, 
                    y = Coefficient)) + 
        geom_linerange(aes(x = Variable, 
                     ymin = conf.low,
                     ymax = conf.high),
                   lwd = 0.5) +
        ggtitle("Outcome: Fertility") +
        coord_flip() +
                theme_light()

5 Coefficient plots with facetting & nesting

  • Figure 2 below provides a coefficient plot for subsets/facets of the data.
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?
Figure 2: Coefficient plot: Facetting

5.1 Lab: Data & code

  • Learning objectives
    • How to estimate models in data subsets using nest() and map() and work with nested dataframes
    • How to plot the results of statistical models accross data subsets (using faceting)
    • How to use the broom package/tidy function to extract coefficients

Data preparations are somewhat more complicated when we want to show facets. We work with nested dataframes, as well as model results that are nested in a dataframe.

We proceed in several steps:

  1. We split the dataset into subsets according to values of Examination_cat using the nest() function.
  2. We estimate the linear models in those subsets (see map(..lm(...))).
  3. We tidy the estimations as to obtain a nice dataframe.
  4. We estimate confidence intervals also obtaining nice dataframes (see map(fit, conf.level = 0.90, confint_tidy)).
  5. We rename the vars in the confidence intervals dataframes (see rename_all(...)).
  6. We unnest() the data obtaining one dataframe that contains the estimates and intervals across all subsets.
  7. Finally, we filter the intercepts in those estimations and the result is shown in Table 1.
# Create categorical examination variable
swiss <- swiss %>%
  mutate(Examination_cat = cut(Examination,
    breaks = quantile(Examination, probs = seq(0, 1, 0.25)),
    labels = c("lowest", "lower", "higher", "highest")
  ))
# table(swiss$Examination, swiss$Examination_cat)



data_plot <- swiss %>%
  filter(!is.na(Examination_cat)) %>%
  nest(data = c(Fertility, Agriculture, Examination, Education, Catholic, Infant.Mortality)) %>%
  mutate(
    fit = map(data, ~ lm(Fertility ~ Catholic + Agriculture + Education, data = .))) %>%
    mutate(data_coefs = map(fit, ~tidy(.x, conf.int = TRUE, conf.level = 0.99))) %>%
  unnest(c(data_coefs)) %>%
  rename(
    Variable = term,
    Coefficient = estimate,
    SE = std.error
  ) %>%
  filter(Variable != "(Intercept)") %>%
  select(
    Examination_cat,
    Variable,
    Coefficient,
    conf.low:conf.high
  )

data_plot %>% 
    kable("html") %>%
    kable_styling(font_size = 11)
Table 1: Dataframe with coefficients
Examination_cat Variable Coefficient conf.low conf.high
lower Catholic 0.2533255 0.1141128 0.3925381
lower Agriculture -0.3935888 -0.8181912 0.0310136
lower Education -1.6441445 -2.5703935 -0.7178955
lowest Catholic 0.0880331 -0.0728588 0.2489250
lowest Agriculture -0.4003087 -0.7160134 -0.0846040
lowest Education -1.5787587 -4.1066775 0.9491602
higher Catholic -0.3684091 -1.0056639 0.2688456
higher Agriculture -0.1691309 -0.6650066 0.3267448
higher Education -0.3300658 -2.0818956 1.4217641
highest Catholic -0.0508162 -1.8561544 1.7545221
highest Agriculture 0.2351947 -1.0774665 1.5478558
highest Education -0.5220767 -1.9539828 0.9098295



Plotting the data is straightforward again. We use the same code as above but now specify the facetting with facet_grid(Examination_cat ~ .) to get Figure 2.

# GGPLOT
ggplot(
  data_plot,
  aes(x = Variable, y = Coefficient)
) +
  geom_hline(yintercept = 0, colour = gray(1 / 2), lty = 2) +
  geom_point(aes(
    x = Variable,
    y = Coefficient
  )) +
  geom_linerange(
    aes(
      x = Variable,
      ymin = conf.low,
      ymax = conf.high
    ),
    lwd = 0.5
  ) +
  ggtitle("Outcome: Fertility (Subsets: Examination)") +
  coord_flip() +
  facet_grid(Examination_cat ~ .)

5.2 Exercise

This exercise only concerns creating the dataframe of coefficients relying on the nesting approach. Please use the code below but adapt as follows.

  • Instead of estimating the model across subsets of Examination_cat generate an equivalent grouping variable called Education_cat (use copy/replace).
  • Afterwards the estimated models should only include the independent variables Catholic and Agriculture and they should be estimated for subsets of Education_cat.
  • Finally, please estimate 0.95% confidence intervals.
    • Tip: Keep only the necessary variables before nesting: select(Education_cat, Fertility, Agriculture, Catholic)
# Create categorical examination variable
swiss <- swiss %>%
  mutate(Examination_cat = cut(Examination,
    breaks = quantile(Examination, probs = seq(0, 1, 0.25)),
    labels = c("lowest", "lower", "higher", "highest")
  ))
# table(swiss$Examination, swiss$Examination_cat)



data_plot <- swiss %>%
  filter(!is.na(Examination_cat)) %>%
  nest(data = c(Fertility, Agriculture, Examination, Education, Catholic, Infant.Mortality)) %>%
  mutate(
    fit = map(data, ~ lm(Fertility ~ Catholic + Agriculture + Education, data = .))) %>%
    mutate(data_coefs = map(fit, ~tidy(.x, conf.int = TRUE, conf.level = 0.99))) %>%
  unnest(c(data_coefs)) %>%
  rename(
    Variable = term,
    Coefficient = estimate,
    SE = std.error
  ) %>%
  filter(Variable != "(Intercept)") %>%
  select(
    Examination_cat,
    Variable,
    Coefficient,
    conf.low:conf.high
  )

data_plot %>% 
    kable("html") %>%
    kable_styling(font_size = 11)
Table 2: Dataframe with coefficients
Examination_cat Variable Coefficient conf.low conf.high
lower Catholic 0.2533255 0.1141128 0.3925381
lower Agriculture -0.3935888 -0.8181912 0.0310136
lower Education -1.6441445 -2.5703935 -0.7178955
lowest Catholic 0.0880331 -0.0728588 0.2489250
lowest Agriculture -0.4003087 -0.7160134 -0.0846040
lowest Education -1.5787587 -4.1066775 0.9491602
higher Catholic -0.3684091 -1.0056639 0.2688456
higher Agriculture -0.1691309 -0.6650066 0.3267448
higher Education -0.3300658 -2.0818956 1.4217641
highest Catholic -0.0508162 -1.8561544 1.7545221
highest Agriculture 0.2351947 -1.0774665 1.5478558
highest Education -0.5220767 -1.9539828 0.9098295
Exercise solution
# Create categorical Education variable
swiss <- swiss %>%
  mutate(Education_cat = cut(Education,
    breaks = quantile(Education, probs = seq(0, 1, 0.25)),
    labels = c("lowest", "lower", "higher", "highest")
  ))
# table(swiss$Education, swiss$Education_cat)



data_plot <- swiss %>%
  filter(!is.na(Education_cat)) %>%
    select(Education_cat, Fertility, Agriculture, Catholic) %>%
  nest(data = c(Fertility, Agriculture, Catholic)) %>%
  mutate(
    fit = map(data, ~ lm(Fertility ~ Catholic + Agriculture, data = .))) %>%
    mutate(data_coefs = map(fit, ~tidy(.x, conf.int = TRUE, conf.level = 0.95))) %>%
  unnest(c(data_coefs)) %>%
  rename(
    Variable = term,
    Coefficient = estimate,
    SE = std.error
  ) %>%
  filter(Variable != "(Intercept)") %>%
  select(
    Education_cat,
    Variable,
    Coefficient,
    conf.low:conf.high
  )

data_plot %>% 
    kable("html") %>%
    kable_styling(font_size = 11)
Table 3: Dataframe with coefficients
Education_cat Variable Coefficient conf.low conf.high
higher Catholic 0.0976509 -0.0561241 0.2514258
higher Agriculture -0.1807685 -0.4235505 0.0620134
lowest Catholic 0.2061996 0.0904318 0.3219674
lowest Agriculture -0.2749726 -0.6365829 0.0866378
lower Catholic 0.1385127 0.0218421 0.2551832
lower Agriculture 0.0426792 -0.2608973 0.3462556
highest Catholic -0.1181184 -0.5143699 0.2781332
highest Agriculture 0.6355401 -0.1043334 1.3754137

6 Coefficient plots: Coloring

  • Figure 3 below provides a coefficient plot and colors estimations for subsets.
Figure 3: Coefficient plot: Coloring

6.1 Lab: Data & code

The code to estimate the models and extracting the coefficients is the same as in Section 5.

Plotting the data to obtain Figure 3 is straightforward again. We use the same code as above but now instead of facetting we simply specify colour = Examination_cat within aes().

  # GGPLOT
   ggplot(data_plot, aes(x = Variable, y = Coefficient, colour = Examination_cat)) +
        geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) +
        geom_point(aes(x = Variable, 
                    y = Coefficient), position=position_dodge(width=0.3)) + 
        geom_linerange(aes(x = Variable, 
                     ymin = conf.low,
                     ymax = conf.high),
                   lwd = 1, position=position_dodge(width=0.3)) + 
        ggtitle("Outcome: Fertility (Subsets: Examination)") +
        coord_flip()

7 Coefficient plots with facetting and coloring (skip!)

  • Figure 4 below provides a combination of facetting and coloring since models have been estimated in subsets of the data defined by two variables: Infant.Mortality_cat and Examination_cat.
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?
Figure 4: Coefficient plot: Facetting and Coloring

7.1 Lab: Data & code

Data preparations are somewhat more complicated when we want to show facets and colors, i.e., in other words we want to produce coefficient plots for subsets defined by two variables. Again we work with nested dataframes, as well as model results that are nested in a dataframe.

We proceed in several steps:

  1. We split the dataset into subsets according to Examination_cat AND Infant.Mortality_cat.
  2. We estimate the linear models in those subsets (see map(..lm(...))).
  3. We tidy the estimations as to obtain a nice dataframe.
  4. We estimate confidence intervals also obtaining nice dataframes (see map(fit, conf.level = 0.95, confint_tidy)).
  5. We rename the vars in the confidence intervals dataframes (see rename_all(...)).
  6. We unnest() the data obtaining one dataframe that contains the estimates and intervals across all subsets defined by Examination_cat AND Infant.Mortality_cat.
    • Here we add to delete the results for two subsets because they didn’t contain enough observations
    • see e.g., filter(!(Examination_cat == "higher" & Infant.Mortality_cat == "high"))
  7. Finally, we filter the intercepts in those estimations.
swiss <- swiss %>% 
         mutate(Infant.Mortality_cat = cut(Infant.Mortality, 
                                  breaks = quantile(Infant.Mortality, probs = seq(0, 1, 0.5)),
                                  labels = c("low", "high")))

swiss <- swiss %>% 
         mutate(Examination_cat = cut(Examination, 
                                  breaks = quantile(Examination, 
                                                    probs = seq(0, 1, 0.5)),
                                  labels = c("low", "high")))


data_plot <- swiss %>% 
  filter(!is.na(Infant.Mortality_cat), !is.na(Examination_cat)) %>%
  select(Examination_cat, Infant.Mortality_cat, Fertility, Agriculture, Education, Catholic) %>%
  nest(data = c(Fertility, Agriculture, Education, Catholic)) %>% 
  mutate(fit = map(data, ~ lm(Fertility ~ Catholic + Agriculture + Education, data = .)),
         data_coefs = map(fit, ~tidy(.x, conf.int = TRUE, conf.level = 0.95))) %>%
  unnest(c(data_coefs)) %>%
  rename(Variable = term,
                  Coefficient = estimate,
                  SE = std.error) %>%
  filter(Variable != "(Intercept)")  %>%
  select(
    Examination_cat,
    Infant.Mortality_cat,
    Variable,
    Coefficient,
    conf.low:conf.high
  )

data_plot %>% 
    kable("html") %>%
    kable_styling(font_size = 11)
Examination_cat Infant.Mortality_cat Variable Coefficient conf.low conf.high
low high Catholic 0.1393931 0.0482192 0.2305670
low high Agriculture -0.0540658 -0.3582447 0.2501132
low high Education 0.2850385 -1.1783226 1.7483997
high high Catholic -0.4467226 -1.3331392 0.4396939
high high Agriculture 0.0710211 -0.3069120 0.4489542
high high Education -0.2726475 -1.1437719 0.5984770
high low Catholic -0.3046756 -0.6845060 0.0751548
high low Agriculture -0.1800822 -0.3958094 0.0356451
high low Education -0.5728250 -1.0978119 -0.0478381
low low Catholic 0.1562665 0.0172678 0.2952652
low low Agriculture -0.5300158 -1.0041751 -0.0558564
low low Education -1.6249107 -2.3813772 -0.8684442



Plotting the data is straightforward again. We use the same code as above but now we combine facetting and coloring: * facet_grid(Examination_cat ~ .) * colour = Infant.Mortality_cat within aes()

 ggplot(data_plot,
        aes(x = Variable, y = Coefficient, colour = Infant.Mortality_cat)) +
        geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) +
        geom_point(aes(x = Variable, 
                    y = Coefficient), position=position_dodge(width=0.3)) + 
        geom_linerange(aes(x = Variable, 
                     ymin = conf.low,
                     ymax = conf.high),
                   lwd = 0.5, position=position_dodge(width=0.3)) +
        ggtitle("Outcome: Fertility (Subsets: Infant.Mortality, Examination)") +
        coord_flip() +
  facet_grid(Examination_cat ~ .) 

8 Predicted values

  • Figure 5 plots the variable Fertility (the actual values) against those predictd by the model (predicted values).
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?
Figure 5: Predicted values vs. real values plot

8.1 Lab: Data & code

The underlying functions are very simple. First we fit the model

fit <- lm(Fertility ~ Catholic + Agriculture + Education, data = swiss)



Then we augment the original data with the predicted values for each observed unit (they are called .fitted in the dataframe below).

data_predicted <- augment(fit)
data_predicted %>% kable("html") %>% kable_styling(font_size = 10)
.rownames Fertility Catholic Agriculture Education .fitted .resid .hat .sigma .cooksd .std.resid
Courtelary 80.2 9.96 17.0 12 71.35382 8.8461774 0.0950817 7.686424 0.0380384 1.2033653
Delemont 83.1 84.84 45.1 9 79.73758 3.3624191 0.0660897 7.800760 0.0035864 0.4502418
Franches-Mnt 92.5 93.40 39.7 5 86.36550 6.1345049 0.1266332 7.753333 0.0261545 0.8494302
Moutier 85.8 33.77 36.5 7 76.21257 9.5874340 0.0552259 7.669656 0.0238081 1.2763945
Neuveville 76.9 5.16 43.5 15 62.05992 14.8400828 0.0410200 7.461386 0.0411227 1.9610021
Porrentruy 76.1 90.57 35.3 7 84.70365 -8.6036475 0.1256133 7.689243 0.0509128 -1.1906315
Broye 83.8 92.85 70.2 7 77.94869 5.8513084 0.0581356 7.763671 0.0093930 0.7801989
Glane 92.4 97.16 67.8 8 77.98965 14.4103471 0.0617821 7.474642 0.0610151 1.9251701
Gruyere 82.4 97.67 53.3 7 82.07990 0.3201012 0.0753228 7.819045 0.0000378 0.0430763
Sarine 82.9 91.38 45.2 13 76.37831 6.5216935 0.0667781 7.749513 0.0136527 0.8736037
Veveyse 87.1 98.61 64.5 6 81.01451 6.0854871 0.0653090 7.758656 0.0115894 0.8145314
Aigle 64.1 8.52 62.0 12 62.00804 2.0919628 0.0628997 7.812100 0.0013123 0.2796453
Aubonne 66.9 2.27 67.5 7 65.34456 1.5554442 0.0740646 7.815234 0.0008750 0.2091754
Avenches 68.9 4.43 60.7 12 61.67811 7.2218873 0.0646019 7.733856 0.0161208 0.9662711
Cossonay 61.7 2.82 69.3 5 67.20324 -5.5032423 0.0763968 7.769129 0.0113547 -0.7410072
Echallens 68.3 24.20 72.6 2 72.85406 -4.5540632 0.0596883 7.785561 0.0058611 -0.6077286
Grandson 71.7 3.30 34.0 8 71.22373 0.4762715 0.0613355 7.818845 0.0000661 0.0636130
Lausanne 55.7 12.11 19.4 28 54.02437 1.6756342 0.0939560 7.814494 0.0013453 0.2277986
La Vallee 54.3 2.15 15.2 20 62.00809 -7.7080933 0.0767596 7.720612 0.0223991 -1.0380926
Lavaux 65.1 2.84 73.0 9 62.16632 2.9336804 0.0997879 7.804644 0.0044366 0.4001169
Morges 65.5 5.23 59.8 10 64.12130 1.3786987 0.0548483 7.816151 0.0004886 0.1835123
Moudon 65.0 4.52 55.1 3 72.47751 -7.4775133 0.0587956 7.728238 0.0155355 -0.9973825
Nyone 56.6 15.14 50.9 12 65.22299 -8.6229883 0.0328423 7.701273 0.0109292 -1.1346337
Orbe 57.4 4.20 54.1 6 69.41765 -12.0176461 0.0484880 7.584603 0.0323800 -1.5942587
Oron 72.5 2.40 71.2 1 71.04507 1.4549265 0.0857012 7.815688 0.0009085 0.1968990
Payerne 74.2 5.23 58.1 8 66.61076 7.5892409 0.0491488 7.726439 0.0131074 1.0071371
Paysd'enhaut 72.0 2.56 63.5 3 70.48740 1.5125978 0.0666287 7.815480 0.0007325 0.2026016
Rolle 60.5 7.72 60.8 10 64.27981 -3.7798150 0.0539845 7.796186 0.0036078 -0.5028841
Vevey 58.3 18.46 26.8 19 63.09324 -4.7932370 0.0468407 7.782428 0.0049589 -0.6353202
Yverdon 65.4 6.10 49.5 8 68.48321 -3.0832083 0.0409484 7.804108 0.0017717 -0.4074069
Conthey 75.5 99.71 85.9 2 81.11782 -5.6178154 0.0895095 7.766260 0.0142655 -0.7618619
Entremont 69.3 99.68 84.9 6 77.02791 -7.7279097 0.0917917 7.718450 0.0278221 -1.0493391
Herens 77.3 100.00 89.7 2 80.38838 -3.0883806 0.0992351 7.803075 0.0048836 -0.4210868
Martigwy 70.5 98.96 78.2 6 78.28372 -7.7837172 0.0750444 7.718843 0.0222476 -1.0473049
Monthey 79.4 98.22 64.9 3 84.09311 -4.6931098 0.0741677 7.782909 0.0079782 -0.6311623
St Maurice 65.0 99.06 75.9 9 75.54878 -10.5487835 0.0768672 7.633481 0.0420195 -1.4207472
Sierre 92.2 99.46 84.6 3 80.27332 11.9266828 0.0863920 7.578458 0.0616348 1.6146792
Sion 79.3 96.83 63.1 13 73.53528 5.7647206 0.0667967 7.764807 0.0106707 0.7722122
Boudry 70.4 5.62 38.4 12 66.37864 4.0213575 0.0390443 7.793550 0.0028624 0.5308447
La Chauxdfnd 65.7 13.79 7.7 11 74.87034 -9.1703410 0.1552742 7.666145 0.0766076 -1.2911424
Le Locle 72.7 11.22 16.7 13 70.52554 2.1744592 0.0903590 7.811295 0.0021616 0.2950277
Neuchatel 64.4 16.92 17.6 32 50.79966 13.6003353 0.1265242 7.489869 0.1284116 1.8830884
Val de Ruz 77.6 4.97 37.6 7 71.80743 5.7925740 0.0575751 7.764817 0.0091058 0.7721377
ValdeTravers 67.6 8.65 18.7 7 76.17918 -8.5791790 0.1278138 7.689659 0.0517707 -1.1887421
V. De Geneve 35.0 42.34 1.2 53 35.30542 -0.3054172 0.4543956 7.818953 0.0005961 -0.0535058
Rive Droite 44.7 50.43 46.6 29 52.99371 -8.2937095 0.1358999 7.697062 0.0524109 -1.1545515
Rive Gauche 42.8 58.33 27.7 29 57.97821 -15.1782122 0.1086014 7.415297 0.1318153 -2.0803249



Then we can plot the real outcome values of the variable Fertility against the predictions made by our model .fitted in Figure 6. Importantly, the gray line in the figure below is no the regression line!

ggplot(data_predicted, aes(x = Fertility, y = .fitted)) +
    geom_abline(intercept = 0, slope = 1, alpha = .2) +  # Line of perfect fit
    geom_point() +
    labs(y = "Predicted Value") +
    theme_bw()
Figure 6: Predicted vs. actual outcome values

8.2 Exercise

Below you find the code that we used above to obtain and visualize predicted values for a model that predicts Fertility using the variables Catholic + Agriculture + Education.

  • Please adapt the code and predict the outcome Catholic using the variables Agriculture and Fertility.
fit <- lm(Fertility ~ Catholic + Agriculture + Education, data = swiss)

data_predicted <- augment(fit)

ggplot(data_predicted, aes(x = Fertility, y = .fitted)) +
    geom_abline(intercept = 0, slope = 1, alpha = .2) +  # Line of perfect fit
    geom_point() +
    labs(y = "Predicted Value") +
    theme_bw()

9 Marginal effects

9.1 What are marginal effects?

  • ‘There is no common language across fields regarding a unique meaning of “marginal effects”. Thus, the wording throughout this package may vary.’ (Lüdecke on ggeffects package)
  • Marginal effect of a given variable: “the slope of the regression surface with respect to a given covariate […] the rate at which \(y\) changes at a given point in covariate space, with respect to one covariate dimension and holding all covariate values constant” (Leeper 2021, 7)
  • Linear regression: marginal effects correspond to values of the regression coefficients (\(\beta\) values)
  • Nonlinear regression models: marginal effects are not constant so different average effect indicators are used (MEMs, AMEs, MERs)
  • Marginal effects at representative values (MERs)
    • “calculate the marginal effect of each variable at a particular combination of \(X\) values that is theoretically interesting” (Leeper 2021, 7)
    • Pick particular, representative covariate values (e.g., John aged 40, middle-income etc.) and calculate effect for those values
  • Marginal Effects at Means (MEMs)
    • “calculate the marginal effects of each variable at the means of the covariates” (Leeper 2021, 7)
    • …the means may also correspond to particular individuals (that happen to have the mean value on all covariates)
  • Average Marginal Effects (AMEs) (cf. Leeper 2021):
    • “calculate marginal effects at every observed value of X and average across the resulting effect estimates” (Leeper 2021, 7)
  • See Leeper’s margins package and the paper (Leeper 2021) as well as a blog post by Andew Heiss

9.2 Marginal effects in R

9.3 Lab: Data & Code

Our lab is based on Lee, Du, and Guerzhoy (2020) and on James et al. (2013, chap. 4.6.2) with various modifications. We will be using the dataset at LINK that is described by Angwin et al. (2016). + It’s data based on the COMPAS risk assessment tools (RAT). RATs are increasingly being used to assess a criminal defendant’s probability of re-offending. While COMPAS seemingly uses a larger number of features/variables for the prediction, Dressel and Farid (2018) showed that a model that includes only a defendant’s sex, age, and number of priors (prior offences) can be used to arrive at predictions of equivalent quality.

We first import the data into R:

data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "1plUxvMIoieEcCZXkBpj4Bxw1rDwa27tj"))
nrow(data)

We select a subset of variables and also order our variables differently.

data <- data %>% select(
  id, name,
  compas_screening_date,
  is_recid,
  age, priors_count,
  sex, race
)
kable(head(data))
id name compas_screening_date is_recid age priors_count sex race
1 miguel hernandez 2013-08-14 0 69 0 Male Other
3 kevon dixon 2013-01-27 1 34 0 Male African-American
4 ed philo 2013-04-14 1 24 4 Male African-American
5 marcu brown 2013-01-13 0 23 1 Male African-American
6 bouthy pierrelouis 2013-03-26 0 43 2 Male Other
7 marsha miles 2013-11-30 0 44 0 Male Other

Install and load the marginaleffects package:

# install.packages("marginaleffects")
library(marginaleffects)

Estimate a logistic regression model in which we predict recidivism.

fit <- glm(is_recid ~ priors_count + sex + race + age,
  data = data,
  family = binomial
)
# summary(fit)

Use comparisons() to get predicted values for each observation/individual/row in the dataset that we used to fit the model. Here we get 7214 estimates of the difference between the probability of recidivating/reoffending between males and females.

cmp <- comparisons(fit,
  variables = list(sex = c("Male", "Female"))
)
kable(head(cmp %>% select(1:5)))
rowid term contrast estimate std.error
1 sex Female - Male -0.0234380 0.0048817
2 sex Female - Male -0.0800209 0.0143704
3 sex Female - Male -0.0825776 0.0156807
4 sex Female - Male -0.0876982 0.0162264
5 sex Female - Male -0.0687888 0.0125481
6 sex Female - Male -0.0565665 0.0105231

If we are interested in the MER we can proceed as follows:

cmp <- comparisons(fit,
  variables = list(sex = c("Male", "Female")),
  newdata = datagrid(age = 20, 
                                     race = "African-American")
)
kable(head(cmp %>% select(1:5)))
rowid term contrast estimate std.error
1 sex Female - Male -0.0802949 0.0153414

And we may be interested in the AME. By default comparisons() estimates quantities for all the actually observed units in our dataset. If we want to marginalize those conditional estimates, in order to obtain an ‘average contrast’ we can use the code below:

av_comp <- avg_comparisons(fit, 
                                                     variables = list(sex = c("Male", "Female")))
kable(av_comp)
term contrast estimate std.error statistic p.value s.value conf.low conf.high
sex Female - Male -0.0751052 0.0137965 -5.443794 1e-07 24.19255 -0.1021458 -0.0480646

To visualize such comparisons we can use plot_comparisons() that plots the effects on the y-axis (variables =) against values of one or more predictors on the x-axis (condition =). Below we see how the effect of age differs for males and female (effect is more negative for males).

plot_comparisons(fit,
  variables = "age", # Y-axis (effect)
  condition = "sex" # groups
)

Check out the Titanic example in the vignette for more.

10 Useful graphs & resources

Below some interesting graphs/resources that contain visualizations that you might want to check out:

11 Appendix: Several confidence intervals

11.1 Coefficient plots (several CIs)

  • Here we will work with the ‘famous’ swiss data set (?swiss)
  • Figure 7 below provides a simple coefficient plot.
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?
Figure 7: Coefficient plot

11.1.1 Lab: Data & code

  • Learning objectives
    • How to plot the results of statistical models
    • How to use the broom package/tidy function to extract coefficients
    • How to plot different confidence intervals

We start by preparing the data, i.e., estimating the model and extracting the coefficients of interest. Importantly, the table you see below is all you need as a basis for the ggplot functions.

library(broom)
fit <- lm(Fertility ~ Catholic + Agriculture + Education, data = swiss) # see ?swiss


data_coefs <- tidy(fit)



fit_cis_95 <- confint(fit, level = 0.95) %>% 
  data.frame() %>%
  rename("conf.low_95" = "X2.5..",
         "conf.high_95" = "X97.5..")
fit_cis_90 <- confint(fit, level = 0.90) %>% 
  data.frame() %>%
  rename("conf.low_90" = "X5..",
         "conf.high_90" = "X95..")

data_coefs <- bind_cols(data_coefs, 
                     fit_cis_95, 
                     fit_cis_90) %>%
           rename(Variable = term,
                  Coefficient = estimate,
                  SE = std.error) %>%
           filter(Variable != "(Intercept)")
data_coefs <- data_coefs %>% select(-SE, 
                              -statistic,
                              -p.value)

data_coefs %>% kable("html") %>% kable_styling(font_size = 10)
Variable Coefficient conf.low_95 conf.high_95 conf.low_90 conf.high_90
Catholic 0.1452013 0.0844049 0.2059978 0.0945227 0.1958800
Agriculture -0.2030377 -0.3465286 -0.0595467 -0.3226486 -0.0834268
Education -1.0721468 -1.3863512 -0.7579425 -1.3340607 -0.8102329



Subsequently, we feed those estimates into the ggplot function.

  • geom_point(), geom_linerange(): Plot estimates and intervals
  • geom_hline(): Plot 0 line
  • coord_flip(): Flip the plot
ggplot(data_coefs, 
       aes(x = Variable, y = Coefficient)) +
        geom_hline(yintercept = 0, 
                   colour = gray(1/2), lty = 2) +
        geom_point(aes(x = Variable, 
                    y = Coefficient)) + 
        geom_linerange(aes(x = Variable, 
                     ymin = conf.low_90,
                     ymax = conf.high_90),
                   lwd = 1) +
        geom_linerange(aes(x = Variable, 
                     ymin = conf.low_95,
                     ymax = conf.high_95),
                   lwd = 1/2) + 
        ggtitle("Outcome: Fertility") +
        coord_flip()

We can also add text (e.g., the coefficient estimates) in the graph. See below:

# With labels
data_coefs <- data_coefs %>%
    mutate(Model = paste0("Var ", 1:nrow(.))) # Q: ?

ggplot(data_coefs, 
       aes(x = Variable, y = Coefficient)) +
        geom_hline(yintercept = 0, 
                   colour = gray(1/2), lty = 2) +
        geom_point(aes(x = Variable, 
                    y = Coefficient)) + 
        geom_linerange(aes(x = Variable, 
                     ymin = conf.low_90,
                     ymax = conf.high_90),
                   lwd = 1) +
        geom_linerange(aes(x = Variable, 
                     ymin = conf.low_95,
                     ymax = conf.high_95),
                   lwd = 1/2) + 
  geom_text(aes(x = Variable, 
                y = Coefficient,
                label = round(Coefficient,2)),
            vjust = 2) +
  geom_text(aes(x = Variable, 
                y = conf.high_95,
                label = Model),
            vjust = 0,
            hjust =0) +
        ggtitle("Outcome: Fertility") +
        coord_flip()

# Without flipping

# ggplot(data_coefs, aes(y = Variable, x = Coefficient)) +
#         geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) +
#         geom_point(aes(y = Variable, 
#                     x = Coefficient)) + 
#         geom_linerange(aes(y = Variable, 
#                      xmin = conf.low_90,
#                      xmax = conf.high_90),
#                    lwd = 1) +
#         geom_linerange(aes(y = Variable, 
#                      xmin = conf.low_95,
#                      xmax = conf.high_95),
#                    lwd = 1/2) + 
#         ggtitle("Outcome: Fertility") +
#         coord_flip()

11.1.2 Exercise

  1. Use the code from above but change the model for which you generate the coefficient plot.
    • Visualize a coefficient plot in which the outcome variable is Infant.Mortality instead of Fertility.
    • And add an additional explanatory variable, namely Examination.
  2. Try to reduce the ink that is used in the plot above.
  3. Try to add a third set of confidence intervals (80%) to the plot.

11.2 Graph: Coefficient plots with facetting & nesting (several CIs)

  • Figure 8 below provides a coefficient plot for subsets/facets of the data.
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?
Figure 8: Coefficient plot: Facetting

11.2.1 Lab: Data & code (several CIs)

  • Learning objectives
    • How to plot the results of statistical models accross datasubsets (using faceting)
    • How to categorize a continuous variable
    • Work with nested dataframes: How to estimate models in subsets using nest() and map()
    • How to use the broom package/tidy function to extract coefficients
    • How to plot different confidence intervals

Data preparations are somewhat more complicated when we want to show facets. We work with nested dataframes, as well as model results that are nested in a dataframe.

We proceed in several steps:

  1. We split the dataset into subsets according to values of Examination_cat using the nest() function.
  2. We estimate the linear models in those subsets (see map(..lm(...))).
  3. We tidy the estimations as to obtain a nice dataframe.
  4. We estimate confidence intervals also obtaining nice dataframes (see map(fit, conf.level = 0.90, confint_tidy)).
  5. We rename the vars in the confidence intervals dataframes (see rename_all(...)).
  6. We unnest() the data obtaining one dataframe that contains the estimates and intervals across all subsets.
  7. Finally, we filter the intercepts in those estimations and the result is shown in Table @ref(tab:code-coefficient-plot-facetting1).
# Create categorical examination variable
swiss <- swiss %>% 
         mutate(Examination_cat = cut(Examination, 
                                  breaks = quantile(Examination, 
                                                    probs = seq(0, 1, 0.25)),
                                  labels = c("lowest", "lower", 
                                             "higher", "highest")))



#table(swiss$Examination, swiss$Examination_cat)
data_plot <- swiss %>% 
  filter(!is.na(Examination_cat)) %>%
  nest(data = c(Fertility, Agriculture, Examination, Education, Catholic, Infant.Mortality)) %>% 
  mutate(fit = map(data, ~ lm(Fertility ~ Catholic + Agriculture + Education, data = .)),
         data_coefs = map(fit, tidy),
         data_coefs_90 = map(fit, level = 0.90, confint),
         data_coefs_95 = map(fit, level = 0.95, confint)) %>%
       mutate(data_coefs_90 = map(data_coefs_90, ~ data.frame(.)),
            data_coefs_95 = map(data_coefs_95, ~ data.frame(.))) %>%
  unnest(c(data_coefs, data_coefs_90, data_coefs_95)) %>%
  rename(Variable = term,
                  Coefficient = estimate,
                  SE = std.error) %>%
  filter(Variable != "(Intercept)") %>%
  rename("conf.low_95" = "X2.5..",
         "conf.high_95" = "X97.5..",
         "conf.low_90" = "X5..",
         "conf.high_90" = "X95..")

data_plot %>% 
    select(-data, -fit, -statistic, -p.value, -SE) %>% 
    kable("html") %>%
    kable_styling(font_size = 11)
Examination_cat Education_cat Infant.Mortality_cat Variable Coefficient conf.low_90 conf.high_90 conf.low_95 conf.high_95
lower higher high Catholic NA NA NA NA NA
lower higher high Agriculture NA NA NA NA NA
lower higher high Education NA NA NA NA NA
lowest higher high Catholic NA NA NA NA NA
lowest higher high Agriculture NA NA NA NA NA
lowest higher high Education NA NA NA NA NA
lowest lowest high Catholic -2.7178423 NaN NaN NaN NaN
lowest lowest high Agriculture NA NA NA NA NA
lowest lowest high Education NA NA NA NA NA
lowest lower high Catholic -0.1620300 NaN NaN NaN NaN
lowest lower high Agriculture 0.4139118 NaN NaN NaN NaN
lowest lower high Education NA NA NA NA NA
higher highest high Catholic NA NA NA NA NA
higher highest high Agriculture NA NA NA NA NA
higher highest high Education NA NA NA NA NA
lower lower high Catholic 0.0890572 NaN NaN NaN NaN
lower lower high Agriculture 1.0322651 NaN NaN NaN NaN
lower lower high Education 10.6935997 NaN NaN NaN NaN
lower highest high Catholic NA NA NA NA NA
lower highest high Agriculture NA NA NA NA NA
lower highest high Education NA NA NA NA NA
lower lowest high Catholic 0.2348815 NaN NaN NaN NaN
lower lowest high Agriculture NA NA NA NA NA
lower lowest high Education NA NA NA NA NA
higher higher low Catholic -1.3826949 NaN NaN NaN NaN
higher higher low Agriculture -0.1489586 NaN NaN NaN NaN
higher higher low Education 1.7383876 NaN NaN NaN NaN
lower lower low Catholic 3.9629630 NaN NaN NaN NaN
lower lower low Agriculture NA NA NA NA NA
lower lower low Education NA NA NA NA NA
higher higher high Catholic NA NA NA NA NA
higher higher high Agriculture NA NA NA NA NA
higher higher high Education NA NA NA NA NA
higher lowest low Catholic -3.1159420 NaN NaN NaN NaN
higher lowest low Agriculture NA NA NA NA NA
higher lowest low Education NA NA NA NA NA
higher lowest high Catholic NA NA NA NA NA
higher lowest high Agriculture NA NA NA NA NA
higher lowest high Education NA NA NA NA NA
higher lower low Catholic NA NA NA NA NA
higher lower low Agriculture NA NA NA NA NA
higher lower low Education NA NA NA NA NA
highest highest high Catholic 1.4686118 NaN NaN NaN NaN
highest highest high Agriculture -0.9088763 NaN NaN NaN NaN
highest highest high Education NA NA NA NA NA
highest highest NA Catholic NA NA NA NA NA
highest highest NA Agriculture NA NA NA NA NA
highest highest NA Education NA NA NA NA NA
lowest NA high Catholic NA NA NA NA NA
lowest NA high Agriculture NA NA NA NA NA
lowest NA high Education NA NA NA NA NA
lowest lowest low Catholic 0.0821520 NaN NaN NaN NaN
lowest lowest low Agriculture -0.1879328 NaN NaN NaN NaN
lowest lowest low Education -2.2189471 NaN NaN NaN NaN
lower higher low Catholic NA NA NA NA NA
lower higher low Agriculture NA NA NA NA NA
lower higher low Education NA NA NA NA NA
lowest higher low Catholic NA NA NA NA NA
lowest higher low Agriculture NA NA NA NA NA
lowest higher low Education NA NA NA NA NA
lower highest low Catholic 0.7456897 NaN NaN NaN NaN
lower highest low Agriculture NA NA NA NA NA
lower highest low Education NA NA NA NA NA
highest higher high Catholic -0.5752754 NaN NaN NaN NaN
highest higher high Agriculture NA NA NA NA NA
highest higher high Education NA NA NA NA NA
higher highest low Catholic -0.6346848 NaN NaN NaN NaN
higher highest low Agriculture NA NA NA NA NA
higher highest low Education NA NA NA NA NA
highest lower low Catholic NA NA NA NA NA
highest lower low Agriculture NA NA NA NA NA
highest lower low Education NA NA NA NA NA
highest highest low Catholic NA NA NA NA NA
highest highest low Agriculture NA NA NA NA NA
highest highest low Education NA NA NA NA NA



Plotting the data is straightforward again. We use the same code as above but now specify the facetting with facet_grid(Examination_cat ~ .).

  # GGPLOT
  ggplot(data_plot, aes(x = Variable, y = Coefficient)) +
        geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) +
        geom_point(aes(x = Variable, 
                    y = Coefficient)) + 
        geom_linerange(aes(x = Variable, 
                     ymin = conf.low_90,
                     ymax = conf.high_90),
                   lwd = 1) +
        geom_linerange(aes(x = Variable, 
                     ymin = conf.low_95,
                     ymax = conf.high_95),
                   lwd = 1/2) + 
        ggtitle("Outcome: Fertility (Subsets: Examination)") +
        coord_flip() +
  facet_grid(Examination_cat ~ .) 

References

Angwin, Julia, Jeff Larson, Lauren Kirchner, and Surya Mattu. 2016. “Machine Bias.” https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
Dressel, Julia, and Hany Farid. 2018. “The Accuracy, Fairness, and Limits of Predicting Recidivism.” Sci Adv 4 (1): eaao5580.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.
Lee, Claire S, Jeremy Du, and Michael Guerzhoy. 2020. “Auditing the COMPAS Recidivism Risk Assessment Tool: Predictive Modelling and Algorithmic Fairness in CS1.” In Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education, 535–36. ITiCSE ’20. New York, NY, USA: Association for Computing Machinery.
Leeper, Thomas J. 2021. “Interpreting Regression Results Using Average Marginal Effects with r’s Margins.” https://cran.hafro.is/web/packages/margins/vignettes/TechnicalDetails.pdf; cran.hafro.is.
Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” J. Comput. Graph. Stat. 19 (1): 3–28.