# Visualizing statistical models

• Learning outcomes: Learn how to…
• …efficiently extract model parameters, e.g., coefficients.
• …use coefficient plots.
• …use nested dataframes.
• (Update: Marginal/interaction plot)

Sources: Original material; Wickham (2010)

# 1 Data

• Here we will work with the ‘famous’ swiss data set (?swiss)
kable(head(swiss))
Fertility Agriculture Examination Education Catholic Infant.Mortality
Courtelary 80.2 17.0 15 12 9.96 22.2
Delemont 83.1 45.1 6 9 84.84 22.2
Franches-Mnt 92.5 39.7 5 5 93.40 20.2
Moutier 85.8 36.5 12 7 33.77 20.3
Neuveville 76.9 43.5 17 15 5.16 20.6
Porrentruy 76.1 35.3 9 7 90.57 26.6

• Modelling results: Stored in model objects (Extract with modelname$...) • Approach 1: Estimate models and extract estimates/predictions from model objects using broom package and visualize (I prefer this one!) • broom::tidy(...): Use the tidy function to extract estimates etc. • conf.int = TRUE, conf.level = 0.95: Add arguments to calculate confidence intervals • broom::augment(...): Generate & extract predictions • Approach 2: Use plotting functions that directly plot estimates from model objects • Additional functions • geom_linerange(): Plot a line for an interval • coord_flip(): Flip a plot # 3 Extracting coefficients library(broom) # Simple linear regression fit <- lm(Fertility ~ Catholic + Agriculture + Education, data = swiss) # see ?swiss # fit$...
coef(fit)
(Intercept)    Catholic Agriculture   Education
86.2250198   0.1452013  -0.2030377  -1.0721468 
library(broom)
# Simple linear regression
fit <- lm(Fertility ~ Catholic + Agriculture + Education, data = swiss) # see ?swiss
# fit$... coef(fit) (Intercept) Catholic Agriculture Education 86.2250198 0.1452013 -0.2030377 -1.0721468  kable(tidy(fit, conf.int = TRUE, conf.level = 0.95)) term estimate std.error statistic p.value conf.low conf.high (Intercept) 86.2250198 4.7347179 18.211226 0.0000000 76.6765511 95.7734885 Catholic 0.1452013 0.0301466 4.816504 0.0000184 0.0844049 0.2059978 Agriculture -0.2030377 0.0711516 -2.853591 0.0066235 -0.3465286 -0.0595467 Education -1.0721468 0.1558018 -6.881477 0.0000000 -1.3863512 -0.7579425 # 4 Coefficient plots • Here we will work with the ‘famous’ swiss data set (?swiss) • Figure 1 below provides a simple coefficient plot. • Questions: • What does the graph show? What are the underlying variables (and data)? • How many scales/mappings does it use? Could we reduce them? • What do you like, what do you dislike about the figure? What is good, what is bad? • What kind of information could we add to the graph (if any)? • How would you approach a replication of the graph? ## 4.1 Lab: Data & code • Learning objectives • How to plot the results of statistical models • How to use the broom package/tidy function to extract coefficients • How to plot different confidence intervals We start by preparing the data, i.e., estimating the model and extracting the coefficients of interest. Importantly, the table you see below is all you need as a basis for the ggplot functions. library(broom) fit <- lm(Fertility ~ Catholic + Agriculture + Education, data = swiss) # see ?swiss data_coefs <- tidy(fit, conf.int = TRUE, conf.level = 0.95) data_coefs <- data_coefs %>% rename(Variable = term, Coefficient = estimate, SE = std.error) %>% filter(Variable != "(Intercept)") data_coefs <- data_coefs %>% select(-SE, -statistic, -p.value) data_coefs %>% kable("html") %>% kable_styling(font_size = 10) Variable Coefficient conf.low conf.high Catholic 0.1452013 0.0844049 0.2059978 Agriculture -0.2030377 -0.3465286 -0.0595467 Education -1.0721468 -1.3863512 -0.7579425 Subsequently, we feed those estimates into the ggplot function. • geom_point(), geom_linerange(): Plot estimates and intervals • geom_hline(): Plot 0 line • coord_flip(): Flip the plot ggplot(data_coefs, aes(x = Variable, y = Coefficient)) + geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) + geom_point(aes(x = Variable, y = Coefficient)) + geom_linerange(aes(x = Variable, ymin = conf.low, ymax = conf.high), lwd = 0.5) + ggtitle("Outcome: Fertility") + coord_flip() ## 4.2 Exercise 1. Use the code from above but change the model for which you generate the coefficient plot. • Visualize a coefficient plot in which the outcome variable is Infant.Mortality instead of Fertility. • And add an additional explanatory variable, namely Examination. 2. Try to reduce the ink that is used in the plot above. 3. Change the confidence level of the confidence intervals to 80%. Exercise solution library(broom) fit <- lm(Infant.Mortality ~ Catholic + Agriculture + Education + Examination, data = swiss) # see ?swiss data_coefs <- tidy(fit, conf.int = TRUE, conf.level = 0.80) data_coefs <- data_coefs %>% rename(Variable = term, Coefficient = estimate, SE = std.error) %>% filter(Variable != "(Intercept)") ggplot(data_coefs, aes(x = Variable, y = Coefficient)) + geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) + geom_point(aes(x = Variable, y = Coefficient)) + geom_linerange(aes(x = Variable, ymin = conf.low, ymax = conf.high), lwd = 0.5) + ggtitle("Outcome: Infant Mortality") + coord_flip() + theme_light() # 5 Coefficient plots with facetting & nesting • Figure 2 below provides a coefficient plot for subsets/facets of the data. • Questions: • What does the graph show? What are the underlying variables (and data)? • How many scales/mappings does it use? Could we reduce them? • What do you like, what do you dislike about the figure? What is good, what is bad? • What kind of information could we add to the graph (if any)? • How would you approach a replication of the graph? ## 5.1 Lab: Data & code • Learning objectives • How to estimate models in data subsets using nest() and map() and work with nested dataframes • How to plot the results of statistical models accross data subsets (using faceting) • How to use the broom package/tidy function to extract coefficients Data preparations are somewhat more complicated when we want to show facets. We work with nested dataframes, as well as model results that are nested in a dataframe. We proceed in several steps: 1. We split the dataset into subsets according to values of Examination_cat using the nest() function. 2. We estimate the linear models in those subsets (see map(..lm(...))). 3. We tidy the estimations as to obtain a nice dataframe. 4. We estimate confidence intervals also obtaining nice dataframes (see map(fit, conf.level = 0.90, confint_tidy)). 5. We rename the vars in the confidence intervals dataframes (see rename_all(...)). 6. We unnest() the data obtaining one dataframe that contains the estimates and intervals across all subsets. 7. Finally, we filter the intercepts in those estimations and the result is shown in Table 1. # Create categorical examination variable swiss <- swiss %>% mutate(Examination_cat = cut(Examination, breaks = quantile(Examination, probs = seq(0, 1, 0.25)), labels = c("lowest", "lower", "higher", "highest") )) # table(swiss$Examination, swiss$Examination_cat) data_plot <- swiss %>% filter(!is.na(Examination_cat)) %>% nest(data = c(Fertility, Agriculture, Examination, Education, Catholic, Infant.Mortality)) %>% mutate( fit = map(data, ~ lm(Fertility ~ Catholic + Agriculture + Education, data = .))) %>% mutate(data_coefs = map(fit, ~tidy(.x, conf.int = TRUE, conf.level = 0.99))) %>% unnest(c(data_coefs)) %>% rename( Variable = term, Coefficient = estimate, SE = std.error ) %>% filter(Variable != "(Intercept)") %>% select( Examination_cat, Variable, Coefficient, conf.low:conf.high ) data_plot %>% kable("html") %>% kable_styling(font_size = 11) Plotting the data is straightforward again. We use the same code as above but now specify the facetting with facet_grid(Examination_cat ~ .) to get Figure 2. # GGPLOT ggplot( data_plot, aes(x = Variable, y = Coefficient) ) + geom_hline(yintercept = 0, colour = gray(1 / 2), lty = 2) + geom_point(aes( x = Variable, y = Coefficient )) + geom_linerange( aes( x = Variable, ymin = conf.low, ymax = conf.high ), lwd = 0.5 ) + ggtitle("Outcome: Fertility (Subsets: Examination)") + coord_flip() + facet_grid(Examination_cat ~ .) ## 5.2 Exercise This exercise only concerns creating the dataframe of coefficients relying on the nesting approach. Please use the code below but adapt as follows. • Instead of estimating the model across subsets of Examination_cat generate an equivalent grouping variable called Education_cat (use copy/replace). • Afterwards the estimated models should only include the independent variables Catholic and Agriculture and they should be estimated for subsets of Education_cat. • Finally, please estimate 0.95% confidence intervals. • Tip: Keep only the necessary variables before nesting: select(Education_cat, Fertility, Agriculture, Catholic) # Create categorical examination variable swiss <- swiss %>% mutate(Examination_cat = cut(Examination, breaks = quantile(Examination, probs = seq(0, 1, 0.25)), labels = c("lowest", "lower", "higher", "highest") )) # table(swiss$Examination, swiss$Examination_cat) data_plot <- swiss %>% filter(!is.na(Examination_cat)) %>% nest(data = c(Fertility, Agriculture, Examination, Education, Catholic, Infant.Mortality)) %>% mutate( fit = map(data, ~ lm(Fertility ~ Catholic + Agriculture + Education, data = .))) %>% mutate(data_coefs = map(fit, ~tidy(.x, conf.int = TRUE, conf.level = 0.99))) %>% unnest(c(data_coefs)) %>% rename( Variable = term, Coefficient = estimate, SE = std.error ) %>% filter(Variable != "(Intercept)") %>% select( Examination_cat, Variable, Coefficient, conf.low:conf.high ) data_plot %>% kable("html") %>% kable_styling(font_size = 11) Exercise solution # Create categorical Education variable swiss <- swiss %>% mutate(Education_cat = cut(Education, breaks = quantile(Education, probs = seq(0, 1, 0.25)), labels = c("lowest", "lower", "higher", "highest") )) # table(swiss$Education, swiss$Education_cat) data_plot <- swiss %>% filter(!is.na(Education_cat)) %>% select(Education_cat, Fertility, Agriculture, Catholic) %>% nest(data = c(Fertility, Agriculture, Catholic)) %>% mutate( fit = map(data, ~ lm(Fertility ~ Catholic + Agriculture, data = .))) %>% mutate(data_coefs = map(fit, ~tidy(.x, conf.int = TRUE, conf.level = 0.95))) %>% unnest(c(data_coefs)) %>% rename( Variable = term, Coefficient = estimate, SE = std.error ) %>% filter(Variable != "(Intercept)") %>% select( Education_cat, Variable, Coefficient, conf.low:conf.high ) data_plot %>% kable("html") %>% kable_styling(font_size = 11) # 6 Coefficient plots: Coloring • Figure 3 below provides a coefficient plot and colors estimations for subsets. ## 6.1 Lab: Data & code The code to estimate the models and extracting the coefficients is the same as in Section 5. Plotting the data to obtain Figure 3 is straightforward again. We use the same code as above but now instead of facetting we simply specify colour = Examination_cat within aes().  # GGPLOT ggplot(data_plot, aes(x = Variable, y = Coefficient, colour = Examination_cat)) + geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) + geom_point(aes(x = Variable, y = Coefficient), position=position_dodge(width=0.3)) + geom_linerange(aes(x = Variable, ymin = conf.low, ymax = conf.high), lwd = 1, position=position_dodge(width=0.3)) + ggtitle("Outcome: Fertility (Subsets: Examination)") + coord_flip() # 7 Coefficient plots with facetting and coloring (skip!) • Figure 4 below provides a combination of facetting and coloring since models have been estimated in subsets of the data defined by two variables: Infant.Mortality_cat and Examination_cat. • Questions: • What does the graph show? What are the underlying variables (and data)? • How many scales/mappings does it use? Could we reduce them? • What do you like, what do you dislike about the figure? What is good, what is bad? • What kind of information could we add to the graph (if any)? • How would you approach a replication of the graph? ## 7.1 Lab: Data & code Data preparations are somewhat more complicated when we want to show facets and colors, i.e., in other words we want to produce coefficient plots for subsets defined by two variables. Again we work with nested dataframes, as well as model results that are nested in a dataframe. We proceed in several steps: 1. We split the dataset into subsets according to Examination_cat AND Infant.Mortality_cat. 2. We estimate the linear models in those subsets (see map(..lm(...))). 3. We tidy the estimations as to obtain a nice dataframe. 4. We estimate confidence intervals also obtaining nice dataframes (see map(fit, conf.level = 0.95, confint_tidy)). 5. We rename the vars in the confidence intervals dataframes (see rename_all(...)). 6. We unnest() the data obtaining one dataframe that contains the estimates and intervals across all subsets defined by Examination_cat AND Infant.Mortality_cat. • Here we add to delete the results for two subsets because they didn’t contain enough observations • see e.g., filter(!(Examination_cat == "higher" & Infant.Mortality_cat == "high")) 7. Finally, we filter the intercepts in those estimations. swiss <- swiss %>% mutate(Infant.Mortality_cat = cut(Infant.Mortality, breaks = quantile(Infant.Mortality, probs = seq(0, 1, 0.5)), labels = c("low", "high"))) swiss <- swiss %>% mutate(Examination_cat = cut(Examination, breaks = quantile(Examination, probs = seq(0, 1, 0.5)), labels = c("low", "high"))) data_plot <- swiss %>% filter(!is.na(Infant.Mortality_cat), !is.na(Examination_cat)) %>% select(Examination_cat, Infant.Mortality_cat, Fertility, Agriculture, Education, Catholic) %>% nest(data = c(Fertility, Agriculture, Education, Catholic)) %>% mutate(fit = map(data, ~ lm(Fertility ~ Catholic + Agriculture + Education, data = .)), data_coefs = map(fit, ~tidy(.x, conf.int = TRUE, conf.level = 0.95))) %>% unnest(c(data_coefs)) %>% rename(Variable = term, Coefficient = estimate, SE = std.error) %>% filter(Variable != "(Intercept)") %>% select( Examination_cat, Infant.Mortality_cat, Variable, Coefficient, conf.low:conf.high ) data_plot %>% kable("html") %>% kable_styling(font_size = 11) Examination_cat Infant.Mortality_cat Variable Coefficient conf.low conf.high low high Catholic 0.1393931 0.0482192 0.2305670 low high Agriculture -0.0540658 -0.3582447 0.2501132 low high Education 0.2850385 -1.1783226 1.7483997 high high Catholic -0.4467226 -1.3331392 0.4396939 high high Agriculture 0.0710211 -0.3069120 0.4489542 high high Education -0.2726475 -1.1437719 0.5984770 high low Catholic -0.3046756 -0.6845060 0.0751548 high low Agriculture -0.1800822 -0.3958094 0.0356451 high low Education -0.5728250 -1.0978119 -0.0478381 low low Catholic 0.1562665 0.0172678 0.2952652 low low Agriculture -0.5300158 -1.0041751 -0.0558564 low low Education -1.6249107 -2.3813772 -0.8684442 Plotting the data is straightforward again. We use the same code as above but now we combine facetting and coloring: * facet_grid(Examination_cat ~ .) * colour = Infant.Mortality_cat within aes()  ggplot(data_plot, aes(x = Variable, y = Coefficient, colour = Infant.Mortality_cat)) + geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) + geom_point(aes(x = Variable, y = Coefficient), position=position_dodge(width=0.3)) + geom_linerange(aes(x = Variable, ymin = conf.low, ymax = conf.high), lwd = 0.5, position=position_dodge(width=0.3)) + ggtitle("Outcome: Fertility (Subsets: Infant.Mortality, Examination)") + coord_flip() + facet_grid(Examination_cat ~ .)  # 8 Predicted values • Figure 5 plots the variable Fertility (the actual values) against those predictd by the model (predicted values). • Questions: • What does the graph show? What are the underlying variables (and data)? • How many scales/mappings does it use? Could we reduce them? • What do you like, what do you dislike about the figure? What is good, what is bad? • What kind of information could we add to the graph (if any)? • How would you approach a replication of the graph? ## 8.1 Lab: Data & code The underlying functions are very simple. First we fit the model fit <- lm(Fertility ~ Catholic + Agriculture + Education, data = swiss) Then we augment the original data with the predicted values for each observed unit (they are called .fitted in the dataframe below). data_predicted <- augment(fit) # New data: augment(fit, newdata = ...) data_predicted %>% head() %>% kable("html") %>% kable_styling(font_size = 10) .rownames Fertility Catholic Agriculture Education .fitted .resid .hat .sigma .cooksd .std.resid Courtelary 80.2 9.96 17.0 12 71.35382 8.846177 0.0950817 7.686424 0.0380384 1.2033653 Delemont 83.1 84.84 45.1 9 79.73758 3.362419 0.0660897 7.800760 0.0035864 0.4502418 Franches-Mnt 92.5 93.40 39.7 5 86.36550 6.134505 0.1266332 7.753333 0.0261545 0.8494302 Moutier 85.8 33.77 36.5 7 76.21257 9.587434 0.0552259 7.669656 0.0238081 1.2763945 Neuveville 76.9 5.16 43.5 15 62.05992 14.840083 0.0410200 7.461386 0.0411227 1.9610021 Porrentruy 76.1 90.57 35.3 7 84.70365 -8.603647 0.1256133 7.689243 0.0509128 -1.1906315 Then we can plot the real outcome values of the variable Fertility against the predictions made by our model .fitted in Figure 6. Importantly, the gray line in the figure below is no the regression line! ggplot(data_predicted, aes(x = Fertility, y = .fitted)) + geom_abline(intercept = 0, slope = 1, alpha = .2) + # Line of perfect fit geom_point() + labs(y = "Predicted Value") + theme_bw() ## 8.2 Exercise Below you find the code that we used above to obtain and visualize predicted values for a model that predicts Fertility using the variables Catholic + Agriculture + Education. • Please adapt the code and predict the outcome Catholic using the variables Agriculture and Fertility. fit <- lm(Fertility ~ Catholic + Agriculture + Education, data = swiss) data_predicted <- augment(fit) ggplot(data_predicted, aes(x = Fertility, y = .fitted)) + geom_abline(intercept = 0, slope = 1, alpha = .2) + # Line of perfect fit geom_point() + labs(y = "Predicted Value") + theme_bw() # 9 Marginal effects ## 9.1 What are marginal effects? • ‘There is no common language across fields regarding a unique meaning of “marginal effects”. Thus, the wording throughout this package may vary.’ (Lüdecke on ggeffects package) • Marginal effect of a given variable: “the slope of the regression surface with respect to a given covariate […] the rate at which $$y$$ changes at a given point in covariate space, with respect to one covariate dimension and holding all covariate values constant” • Linear regression: marginal effects correspond to values of the regression coefficients ($$\beta$$ values) • Nonlinear regression models: marginal effects are not constant so different average effect indicators are used (MEMs, AMEs, MERs) • Marginal effects at representative values (MERs) • “calculate the marginal effect of each variable at a particular combination of $$X$$ values that is theoretically interesting” • Pick particular, representative covariate values (e.g., John aged 40, middle-income etc.) and calculate effect for those values • Marginal Effects at Means (MEMs) • “calculate the marginal effects of each variable at the means of the covariates” • …the means may also correspond to particular individuals (that happen to have the mean value on all covariates) • Average Marginal Effects (AMEs) (cf. Leeper 2021): • “calculate marginal effects at every observed value of X and average across the resulting effect estimates” • See Leeper’s margins package and the paper as well as a blog post by Andew Heiss ## 9.2 Marginal effects in R ## 9.3 Lab: Data & Code Our lab is based on Lee, Du, and Guerzhoy (2020) and on James et al. (2013, chap. 4.6.2) with various modifications. We will be using the dataset at LINK that is described by Angwin et al. (2016). + It’s data based on the COMPAS risk assessment tools (RAT). RATs are increasingly being used to assess a criminal defendant’s probability of re-offending. While COMPAS seemingly uses a larger number of features/variables for the prediction, Dressel and Farid (2018) showed that a model that includes only a defendant’s sex, age, and number of priors (prior offences) can be used to arrive at predictions of equivalent quality. We first import the data into R: data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download", "1plUxvMIoieEcCZXkBpj4Bxw1rDwa27tj")) nrow(data) We select a subset of variables and also order our variables differently. data <- data %>% select( id, name, compas_screening_date, is_recid, age, priors_count, sex, race ) kable(head(data)) id name compas_screening_date is_recid age priors_count sex race 1 miguel hernandez 2013-08-14 0 69 0 Male Other 3 kevon dixon 2013-01-27 1 34 0 Male African-American 4 ed philo 2013-04-14 1 24 4 Male African-American 5 marcu brown 2013-01-13 0 23 1 Male African-American 6 bouthy pierrelouis 2013-03-26 0 43 2 Male Other 7 marsha miles 2013-11-30 0 44 0 Male Other Install and load the marginaleffects package: # install.packages("marginaleffects") library(marginaleffects) Estimate a logistic regression model in which we predict recidivism. fit <- glm(is_recid ~ priors_count + sex + race + age, data = data, family = binomial ) # summary(fit) Use comparisons() to get predicted values for each observation/individual/row in the dataset that we used to fit the model. Here we get 7214 estimates of the difference between the probability of recidivating/reoffending between males and females. cmp <- comparisons(fit, variables = list(sex = c("Male", "Female")) ) kable(head(cmp %>% select(1:5))) rowid term contrast estimate std.error 1 sex Female - Male -0.0234380 0.0048817 2 sex Female - Male -0.0800209 0.0143704 3 sex Female - Male -0.0825776 0.0156807 4 sex Female - Male -0.0876982 0.0162264 5 sex Female - Male -0.0687888 0.0125481 6 sex Female - Male -0.0565665 0.0105231 If we are interested in the MER we can proceed as follows: cmp <- comparisons(fit, variables = list(sex = c("Male", "Female")), newdata = datagrid(age = 20, race = "African-American") ) kable(head(cmp %>% select(1:5))) rowid term contrast estimate std.error 1 sex Female - Male -0.0802949 0.0153414 And we may be interested in the AME. By default comparisons() estimates quantities for all the actually observed units in our dataset. If we want to marginalize those conditional estimates, in order to obtain an ‘average contrast’ we can use the code below: av_comp <- avg_comparisons(fit, variables = list(sex = c("Male", "Female"))) kable(av_comp) term contrast estimate std.error statistic p.value s.value conf.low conf.high sex Female - Male -0.0751052 0.0137965 -5.443794 1e-07 24.19255 -0.1021458 -0.0480646 To visualize such comparisons we can use plot_comparisons() that plots the effects on the y-axis (variables =) against values of one or more predictors on the x-axis (condition =). Below we see how the effect of age differs for males and female (effect is more negative for males). plot_comparisons(fit, variables = "age", # Y-axis (effect) condition = "sex" # groups ) Check out the Titanic example in the vignette for more. # 10 Useful graphs & resources Below some interesting graphs/resources that contain visualizations that you might want to check out: # 11 Appendix: Several confidence intervals ## 11.1 Coefficient plots (several CIs) • Here we will work with the ‘famous’ swiss data set (?swiss) • Figure 7 below provides a simple coefficient plot. • Questions: • What does the graph show? What are the underlying variables (and data)? • How many scales/mappings does it use? Could we reduce them? • What do you like, what do you dislike about the figure? What is good, what is bad? • What kind of information could we add to the graph (if any)? • How would you approach a replication of the graph? ### 11.1.1 Lab: Data & code • Learning objectives • How to plot the results of statistical models • How to use the broom package/tidy function to extract coefficients • How to plot different confidence intervals We start by preparing the data, i.e., estimating the model and extracting the coefficients of interest. Importantly, the table you see below is all you need as a basis for the ggplot functions. library(broom) fit <- lm(Fertility ~ Catholic + Agriculture + Education, data = swiss) # see ?swiss data_coefs <- tidy(fit) fit_cis_95 <- confint(fit, level = 0.95) %>% data.frame() %>% rename("conf.low_95" = "X2.5..", "conf.high_95" = "X97.5..") fit_cis_90 <- confint(fit, level = 0.90) %>% data.frame() %>% rename("conf.low_90" = "X5..", "conf.high_90" = "X95..") data_coefs <- bind_cols(data_coefs, fit_cis_95, fit_cis_90) %>% rename(Variable = term, Coefficient = estimate, SE = std.error) %>% filter(Variable != "(Intercept)") data_coefs <- data_coefs %>% select(-SE, -statistic, -p.value) data_coefs %>% kable("html") %>% kable_styling(font_size = 10) Variable Coefficient conf.low_95 conf.high_95 conf.low_90 conf.high_90 Catholic 0.1452013 0.0844049 0.2059978 0.0945227 0.1958800 Agriculture -0.2030377 -0.3465286 -0.0595467 -0.3226486 -0.0834268 Education -1.0721468 -1.3863512 -0.7579425 -1.3340607 -0.8102329 Subsequently, we feed those estimates into the ggplot function. • geom_point(), geom_linerange(): Plot estimates and intervals • geom_hline(): Plot 0 line • coord_flip(): Flip the plot ggplot(data_coefs, aes(x = Variable, y = Coefficient)) + geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) + geom_point(aes(x = Variable, y = Coefficient)) + geom_linerange(aes(x = Variable, ymin = conf.low_90, ymax = conf.high_90), lwd = 1) + geom_linerange(aes(x = Variable, ymin = conf.low_95, ymax = conf.high_95), lwd = 1/2) + ggtitle("Outcome: Fertility") + coord_flip() We can also add text (e.g., the coefficient estimates) in the graph. See below: # With labels data_coefs <- data_coefs %>% mutate(Model = paste0("Var ", 1:nrow(.))) # Q: ? ggplot(data_coefs, aes(x = Variable, y = Coefficient)) + geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) + geom_point(aes(x = Variable, y = Coefficient)) + geom_linerange(aes(x = Variable, ymin = conf.low_90, ymax = conf.high_90), lwd = 1) + geom_linerange(aes(x = Variable, ymin = conf.low_95, ymax = conf.high_95), lwd = 1/2) + geom_text(aes(x = Variable, y = Coefficient, label = round(Coefficient,2)), vjust = 2) + geom_text(aes(x = Variable, y = conf.high_95, label = Model), vjust = 0, hjust =0) + ggtitle("Outcome: Fertility") + coord_flip() # Without flipping # ggplot(data_coefs, aes(y = Variable, x = Coefficient)) + # geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) + # geom_point(aes(y = Variable, # x = Coefficient)) + # geom_linerange(aes(y = Variable, # xmin = conf.low_90, # xmax = conf.high_90), # lwd = 1) + # geom_linerange(aes(y = Variable, # xmin = conf.low_95, # xmax = conf.high_95), # lwd = 1/2) + # ggtitle("Outcome: Fertility") + # coord_flip() ### 11.1.2 Exercise 1. Use the code from above but change the model for which you generate the coefficient plot. • Visualize a coefficient plot in which the outcome variable is Infant.Mortality instead of Fertility. • And add an additional explanatory variable, namely Examination. 2. Try to reduce the ink that is used in the plot above. 3. Try to add a third set of confidence intervals (80%) to the plot. ## 11.2 Graph: Coefficient plots with facetting & nesting (several CIs) • Figure 8 below provides a coefficient plot for subsets/facets of the data. • Questions: • What does the graph show? What are the underlying variables (and data)? • How many scales/mappings does it use? Could we reduce them? • What do you like, what do you dislike about the figure? What is good, what is bad? • What kind of information could we add to the graph (if any)? • How would you approach a replication of the graph? ### 11.2.1 Lab: Data & code (several CIs) • Learning objectives • How to plot the results of statistical models accross datasubsets (using faceting) • How to categorize a continuous variable • Work with nested dataframes: How to estimate models in subsets using nest() and map() • How to use the broom package/tidy function to extract coefficients • How to plot different confidence intervals Data preparations are somewhat more complicated when we want to show facets. We work with nested dataframes, as well as model results that are nested in a dataframe. We proceed in several steps: 1. We split the dataset into subsets according to values of Examination_cat using the nest() function. 2. We estimate the linear models in those subsets (see map(..lm(...))). 3. We tidy the estimations as to obtain a nice dataframe. 4. We estimate confidence intervals also obtaining nice dataframes (see map(fit, conf.level = 0.90, confint_tidy)). 5. We rename the vars in the confidence intervals dataframes (see rename_all(...)). 6. We unnest() the data obtaining one dataframe that contains the estimates and intervals across all subsets. 7. Finally, we filter the intercepts in those estimations and the result is shown in Table @ref(tab:code-coefficient-plot-facetting1). # Create categorical examination variable swiss <- swiss %>% mutate(Examination_cat = cut(Examination, breaks = quantile(Examination, probs = seq(0, 1, 0.25)), labels = c("lowest", "lower", "higher", "highest"))) #table(swiss$Examination, swiss\$Examination_cat)
data_plot <- swiss %>%
filter(!is.na(Examination_cat)) %>%
nest(data = c(Fertility, Agriculture, Examination, Education, Catholic, Infant.Mortality)) %>%
mutate(fit = map(data, ~ lm(Fertility ~ Catholic + Agriculture + Education, data = .)),
data_coefs = map(fit, tidy),
data_coefs_90 = map(fit, level = 0.90, confint),
data_coefs_95 = map(fit, level = 0.95, confint)) %>%
mutate(data_coefs_90 = map(data_coefs_90, ~ data.frame(.)),
data_coefs_95 = map(data_coefs_95, ~ data.frame(.))) %>%
unnest(c(data_coefs, data_coefs_90, data_coefs_95)) %>%
rename(Variable = term,
Coefficient = estimate,
SE = std.error) %>%
filter(Variable != "(Intercept)") %>%
rename("conf.low_95" = "X2.5..",
"conf.high_95" = "X97.5..",
"conf.low_90" = "X5..",
"conf.high_90" = "X95..")

data_plot %>%
select(-data, -fit, -statistic, -p.value, -SE) %>%
kable("html") %>%
kable_styling(font_size = 11)
Examination_cat Education_cat Infant.Mortality_cat Variable Coefficient conf.low_90 conf.high_90 conf.low_95 conf.high_95
lower higher high Catholic NA NA NA NA NA
lower higher high Agriculture NA NA NA NA NA
lower higher high Education NA NA NA NA NA
lowest higher high Catholic NA NA NA NA NA
lowest higher high Agriculture NA NA NA NA NA
lowest higher high Education NA NA NA NA NA
lowest lowest high Catholic -2.7178423 NaN NaN NaN NaN
lowest lowest high Agriculture NA NA NA NA NA
lowest lowest high Education NA NA NA NA NA
lowest lower high Catholic -0.1620300 NaN NaN NaN NaN
lowest lower high Agriculture 0.4139118 NaN NaN NaN NaN
lowest lower high Education NA NA NA NA NA
higher highest high Catholic NA NA NA NA NA
higher highest high Agriculture NA NA NA NA NA
higher highest high Education NA NA NA NA NA
lower lower high Catholic 0.0890572 NaN NaN NaN NaN
lower lower high Agriculture 1.0322651 NaN NaN NaN NaN
lower lower high Education 10.6935997 NaN NaN NaN NaN
lower highest high Catholic NA NA NA NA NA
lower highest high Agriculture NA NA NA NA NA
lower highest high Education NA NA NA NA NA
lower lowest high Catholic 0.2348815 NaN NaN NaN NaN
lower lowest high Agriculture NA NA NA NA NA
lower lowest high Education NA NA NA NA NA
higher higher low Catholic -1.3826949 NaN NaN NaN NaN
higher higher low Agriculture -0.1489586 NaN NaN NaN NaN
higher higher low Education 1.7383876 NaN NaN NaN NaN
lower lower low Catholic 3.9629630 NaN NaN NaN NaN
lower lower low Agriculture NA NA NA NA NA
lower lower low Education NA NA NA NA NA
higher higher high Catholic NA NA NA NA NA
higher higher high Agriculture NA NA NA NA NA
higher higher high Education NA NA NA NA NA
higher lowest low Catholic -3.1159420 NaN NaN NaN NaN
higher lowest low Agriculture NA NA NA NA NA
higher lowest low Education NA NA NA NA NA
higher lowest high Catholic NA NA NA NA NA
higher lowest high Agriculture NA NA NA NA NA
higher lowest high Education NA NA NA NA NA
higher lower low Catholic NA NA NA NA NA
higher lower low Agriculture NA NA NA NA NA
higher lower low Education NA NA NA NA NA
highest highest high Catholic 1.4686118 NaN NaN NaN NaN
highest highest high Agriculture -0.9088763 NaN NaN NaN NaN
highest highest high Education NA NA NA NA NA
highest highest NA Catholic NA NA NA NA NA
highest highest NA Agriculture NA NA NA NA NA
highest highest NA Education NA NA NA NA NA
lowest NA high Catholic NA NA NA NA NA
lowest NA high Agriculture NA NA NA NA NA
lowest NA high Education NA NA NA NA NA
lowest lowest low Catholic 0.0821520 NaN NaN NaN NaN
lowest lowest low Agriculture -0.1879328 NaN NaN NaN NaN
lowest lowest low Education -2.2189471 NaN NaN NaN NaN
lower higher low Catholic NA NA NA NA NA
lower higher low Agriculture NA NA NA NA NA
lower higher low Education NA NA NA NA NA
lowest higher low Catholic NA NA NA NA NA
lowest higher low Agriculture NA NA NA NA NA
lowest higher low Education NA NA NA NA NA
lower highest low Catholic 0.7456897 NaN NaN NaN NaN
lower highest low Agriculture NA NA NA NA NA
lower highest low Education NA NA NA NA NA
highest higher high Catholic -0.5752754 NaN NaN NaN NaN
highest higher high Agriculture NA NA NA NA NA
highest higher high Education NA NA NA NA NA
higher highest low Catholic -0.6346848 NaN NaN NaN NaN
higher highest low Agriculture NA NA NA NA NA
higher highest low Education NA NA NA NA NA
highest lower low Catholic NA NA NA NA NA
highest lower low Agriculture NA NA NA NA NA
highest lower low Education NA NA NA NA NA
highest highest low Catholic NA NA NA NA NA
highest highest low Agriculture NA NA NA NA NA
highest highest low Education NA NA NA NA NA

Plotting the data is straightforward again. We use the same code as above but now specify the facetting with facet_grid(Examination_cat ~ .).

  # GGPLOT
ggplot(data_plot, aes(x = Variable, y = Coefficient)) +
geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) +
geom_point(aes(x = Variable,
y = Coefficient)) +
geom_linerange(aes(x = Variable,
ymin = conf.low_90,
ymax = conf.high_90),
lwd = 1) +
geom_linerange(aes(x = Variable,
ymin = conf.low_95,
ymax = conf.high_95),
lwd = 1/2) +
ggtitle("Outcome: Fertility (Subsets: Examination)") +
coord_flip() +
facet_grid(Examination_cat ~ .) 

## References

Angwin, Julia, Jeff Larson, Lauren Kirchner, and Surya Mattu. 2016. “Machine Bias.” https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
Dressel, Julia, and Hany Farid. 2018. “The Accuracy, Fairness, and Limits of Predicting Recidivism.” Sci Adv 4 (1): eaao5580.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.
Lee, Claire S, Jeremy Du, and Michael Guerzhoy. 2020. “Auditing the COMPAS Recidivism Risk Assessment Tool: Predictive Modelling and Algorithmic Fairness in CS1.” In Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education, 535–36. ITiCSE ’20. New York, NY, USA: Association for Computing Machinery.
Leeper, Thomas J. 2021. “Interpreting Regression Results Using Average Marginal Effects with r’s Margins.” https://cran.hafro.is/web/packages/margins/vignettes/TechnicalDetails.pdf; cran.hafro.is.
Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” J. Comput. Graph. Stat. 19 (1): 3–28.