7.9 Lab
7.9.1 Study
- Negative Experiences and Trust: A Causal Analysis of the Effects of Victimization on Generalized Trust (Bauer 2015): What is the causal effect of victimization on social trust?
- Generalized trust is praised by many researchers as the foundation of functioning social systems. An ongoing debate concerns the question whether and to what extent experiences impact individuals’ generalized trust, as measured with the standard trust survey question. So far, reliable empirical evidence regarding the causal effect of experiences on generalized trust is scarce. Studies either do notdirectly measure the quality of experiences or use designs that are prone to selection bias. In the present study, we investigate a unique panel data set from Switzerland that contains measures of trustand measures of negative experiences, i.e. victimization. We use change score analysis and ‘genetic matching’ to investigate the causal effect of victimization on generalized trust and find no substantiallystrong effect that is consistent across panel data waves. (Bauer 2015)
- We use this data because we can discuss several identifcation strategies relying on the same dataset (+ I know it)
7.9.2 Data
- Data and files can be directly loaded with the command given below or downloaded from the data folder.
data-selection-on-observables.csv
contains a subset of the Bauer (2015) data that we will use for our exercise (Reproduction files). We’ll use this data to discuss selection-on-observables and also panel data later on.
The individual-level dataset covers both victimization (experiencing threats), trust (generalized trust) and various covariates for the period from 2005 to 2008. Below a description where *
is replaced with the corresponding year. Analogue to our theoretical sessions treatment variables are generally named d_...
, outcome variables y_...
and covariates x_...
.
y_trust*
: Generalized trust (0-10) at t (Outcome Y)d_threat*
: Experiencing a threat (0,1) in year before t (Treatment D)x_age*
: Age measure at tx_male*
: Gender at t (Male = 1, Female = 0)x_education*
: Level of education (0-10) at tx_income*
: Income categorical (0,3) at tQ: The data is in wide format. What does that look like?
7.9.3 Summary Statistics
Below summary statistics of the data in Table 7.1.
# Directly import data from shared google folder into R
data <- readr::read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download", "1WotERFf-yhnKzo48VlCOiJMfutxXyvfT"))
# Or download and import with: data <- readr::read_csv("data-selection-on-observables.csv")
# Create a summary table
stargazer(data.frame(data),
type = "html",
summary = TRUE,
title = "(#tab:selectiononobservables4)Summary statistics")
Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Pctl(75) | Max |
y_trust2005 | 6,515 | 6.073 | 2.378 | 0.000 | 5.000 | 8.000 | 10.000 |
y_trust2006 | 6,638 | 6.131 | 2.302 | 0.000 | 5.000 | 8.000 | 10.000 |
y_trust2007 | 6,959 | 6.145 | 2.313 | 0.000 | 5.000 | 8.000 | 10.000 |
y_trust2008 | 6,875 | 6.215 | 2.286 | 0.000 | 5.000 | 8.000 | 10.000 |
x_age2005 | 11,159 | 38.002 | 21.593 | 0.000 | 19.000 | 54.000 | 95.000 |
x_age2006 | 10,859 | 38.522 | 21.765 | 0.000 | 19.000 | 55.000 | 96.000 |
x_age2007 | 11,000 | 39.609 | 21.941 | 0.000 | 20.000 | 56.000 | 95.000 |
x_age2008 | 10,879 | 40.042 | 21.976 | 0.000 | 20.000 | 57.000 | 96.000 |
x_male2005 | 11,164 | 0.489 | 0.500 | 0.000 | 0.000 | 1.000 | 1.000 |
x_male2006 | 10,863 | 0.487 | 0.500 | 0.000 | 0.000 | 1.000 | 1.000 |
x_male2007 | 11,002 | 0.486 | 0.500 | 0.000 | 0.000 | 1.000 | 1.000 |
x_male2008 | 10,889 | 0.484 | 0.500 | 0.000 | 0.000 | 1.000 | 1.000 |
d_threat2005 | 6,547 | 0.089 | 0.285 | 0.000 | 0.000 | 0.000 | 1.000 |
d_threat2006 | 6,666 | 0.100 | 0.301 | 0.000 | 0.000 | 0.000 | 1.000 |
d_threat2007 | 6,988 | 0.097 | 0.297 | 0.000 | 0.000 | 0.000 | 1.000 |
d_threat2008 | 6,903 | 0.105 | 0.307 | 0.000 | 0.000 | 0.000 | 1.000 |
x_education2005 | 10,271 | 4.306 | 3.172 | 0.000 | 1.000 | 6.000 | 10.000 |
x_education2006 | 10,038 | 4.302 | 3.212 | 0.000 | 1.000 | 6.000 | 10.000 |
x_education2007 | 10,181 | 4.358 | 3.223 | 0.000 | 1.000 | 7.000 | 10.000 |
x_education2008 | 10,155 | 4.379 | 3.245 | 0.000 | 1.000 | 7.000 | 10.000 |
x_income2005 | 4,363 | 1.410 | 1.139 | 0.000 | 0.000 | 3.000 | 3.000 |
x_income2006 | 4,406 | 1.409 | 1.143 | 0.000 | 0.000 | 3.000 | 3.000 |
x_income2007 | 4,681 | 1.394 | 1.158 | 0.000 | 0.000 | 3.000 | 3.000 |
x_income2008 | 4,645 | 1.439 | 1.167 | 0.000 | 0.000 | 3.000 | 3.000 |
idpers | 23,243 | 11,622.000 | 6,709.820 | 1 | 5,811.5 | 17,432.5 | 23,243 |
7.9.4 Descriptive exploration
For now we focus on data from a single year namely 2006
.
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s |
---|---|---|---|---|---|---|
0 | 5 | 7 | 6.131365 | 8 | 10 | 16605 |
- Q: What is the median and the mean? What does the distribution look like if the median lies to the right of the mean?
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0.10036 | 0 | 1 | 16577 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
0.05 | 0.01 | 0.03 | 0.04 | 0.06 | 0.19 | 0.13 | 0.2 | 0.2 | 0.05 | 0.05 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 259 | 36 | 135 | 214 | 320 | 1142 | 782 | 1228 | 1193 | 326 | 331 |
1 | 44 | 6 | 37 | 56 | 48 | 139 | 70 | 114 | 101 | 27 | 25 |
# Contingency: Relative frequencies (shares)
round(prop.table(table(data$d_threat2006, data$y_trust2006)), 2)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.04 | 0.01 | 0.02 | 0.03 | 0.05 | 0.17 | 0.12 | 0.19 | 0.18 | 0.05 | 0.05 |
1 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.02 | 0.01 | 0.02 | 0.02 | 0.00 | 0.00 |
7.9.5 Naive estimate of the ATE
First, let us simply compare the trust mean of non-victims with the one of the victims. The difference is the naive estimate of the ATE.
## [1] 6.204157
## [1] 5.478261
We can also plot this. In Figure 7.1 red points are the means, transparent points are observations.
data$d_threat2006.jitter <- jitter(data$d_threat2006, factor = 0.3)
data$y_trust2006.jitter <- jitter(data$y_trust2006)
plot_ly(data = data, x = ~d_threat2006.jitter, y = ~y_trust2006.jitter,
type = "scatter", mode = "markers", hoverinfo = "none",
marker = list(color = '#000000',
size = 2,
opacity = 0.05,
symbol = 'circle',
line = list(color = '#000000',
width = 4)),
name = "Observations") %>%
layout(xaxis = list(range = list(-1,2),
tickvals = c(0,1),
zeroline = FALSE,
title = "Victimization 2006"),
yaxis = list(range = list(-1,11),
tickvals = c(0,1,2,3,4,5,6,7,8,9,10),
zeroline = FALSE),
title = "Trust 2006") %>%
add_markers(x = c(0,1),
y = c(mean(data$y_trust2006[data$d_threat2006==0], na.rm = TRUE),
mean(data$y_trust2006[data$d_threat2006==1], na.rm = TRUE)),
marker = list(color = 'red',
size = 4,
opacity = 1,
symbol = 'circle',
line = list(color = 'red',
width = 4)),
name = "Means",
hoverinfo = "text"
)
- Q: What is jitter? Why did I add some to the visualization?
Instead of comparing means we can also run a simple regression with the function lm()
using d_threat.2006
as the sole explanatory variable: \(y_trust_{i} = \beta_{0} + \beta_{1} d_threat_{i} + \epsilon_{i}\).
##
## Call:
## lm(formula = y_trust2006 ~ d_threat2006, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2042 -1.2042 0.7958 1.7958 4.5217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.20416 0.02968 209.021 < 2e-16 ***
## d_threat2006 -0.72590 0.09360 -7.755 1.02e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.293 on 6631 degrees of freedom
## (16610 observations deleted due to missingness)
## Multiple R-squared: 0.008988, Adjusted R-squared: 0.008839
## F-statistic: 60.14 on 1 and 6631 DF, p-value: 1.015e-14
# Access coefficients with fit$coeff[[1]]
# For other components see ?lm
# Nice output: stargazer(fit, "fit", type = "html")
- Q: What is the intercept
6.2041569
in this regression? - Q: What is the coefficient
-0.725896
and how does it relate to the intercept?- Tipp: Think of the two means we calculated before.
7.9.6 Controlling/Conditioning
Let’s assume selection on observables: There is/we observe the set of covariates such that treatment assignment is random conditional on these covariates [Don’t forget that we assume that D and any Xs occur before Y. Q: What?]
Then we can identify the causal effect controlling for those covariates.
For the model below we assume that assignment of victimization (d_threat2006
) is random (conditionally independent) controlling for x_age2006
, x_male2006
. We rerun the regression adding those covariates to the model in Table 7.2.
fit <- lm(y_trust2006 ~ d_threat2006 + x_age2006 + x_male2006, data = data)
# stargazer(fit, type="html")
Dependent variable: | |
y_trust2006 | |
d_threat2006 | -0.715*** |
(0.095) | |
x_age2006 | 0.001 |
(0.002) | |
x_male2006 | -0.036 |
(0.057) | |
Constant | 6.187*** |
(0.084) | |
Observations | 6,633 |
R2 | 0.009 |
Adjusted R2 | 0.009 |
Residual Std. Error | 2.293 (df = 6629) |
F Statistic | 20.251*** (df = 3; 6629) |
Note: | p<0.1; p<0.05; p<0.01 |
Nice, we found something. And given our selection-on-observables assumption is valid, we would interpret our estimate as the causal effect of victimization on trust.
Q: Can we stop here? Do you think the assumption is valid?
Q: Imagine we would have found no effect? Should we stop our investigation? Why or why not?
7.9.7 Controlling: Conceptually
Conceptually: When we control, we divide the sample into subsets. Namely subsets pertaining to values of one (or several) covariates. Remember our discussion of variables/joint distributions/cells at the beginning and the notion of “filtering”! (See Section Data: (Empirical) Joint distributions).
For instance, the variable x_income2006
divides the sample into 4 different income groups. Let’s reestimate the model in those subgroups and check the results in Table 7.3.
fit <- lm(y_trust2006 ~ d_threat2006 + x_income2006, data = data)
fit1 <- lm(y_trust2006 ~ d_threat2006, data = data[data$x_income2006==0,])
fit2 <- lm(y_trust2006 ~ d_threat2006, data = data[data$x_income2006==1,])
fit3 <- lm(y_trust2006 ~ d_threat2006, data = data[data$x_income2006==2,])
fit4 <- lm(y_trust2006 ~ d_threat2006, data = data[data$x_income2006==3,])
stargazer(fit, fit1, fit2, fit3, fit4, type = "html",
column.labels = c("All", "Income = 0", "Income = 1",
"Income = 2", "Income = 3"),
title = "(#tab:selectiononobservables12)Results: Subsets of income groups",
omit.stat=c("f", "ser"))
Dependent variable: | |||||
y_trust2006 | |||||
All | Income = 0 | Income = 1 | Income = 2 | Income = 3 | |
(1) | (2) | (3) | (4) | (5) | |
d_threat2006 | -0.788*** | -1.007*** | -0.554*** | -0.963*** | -0.637*** |
(0.103) | (0.190) | (0.209) | (0.241) | (0.189) | |
x_income2006 | 0.112*** | ||||
(0.029) | |||||
Constant | 6.194*** | 6.324*** | 6.184*** | 6.277*** | 6.628*** |
(0.054) | (0.070) | (0.070) | (0.079) | (0.058) | |
Observations | 4,394 | 1,229 | 1,246 | 808 | 1,111 |
R2 | 0.017 | 0.022 | 0.006 | 0.019 | 0.010 |
Adjusted R2 | 0.017 | 0.022 | 0.005 | 0.018 | 0.009 |
Note: | p<0.1; p<0.05; p<0.01 |
Take the estimates from the subgroups -1.0065519
, -0.5536807
, -0.9630548
, -0.6372947
and calculate the mean -0.7901455
. That’s similar to the one we got pooling all income groups (Column 1): -0.7881753
.
We can also visualize those subgroups.
for (i in 0:3){
d.temp <- data[data$x_income2006==i,]
means.temp <- data.frame(x = c(0,1),
y = c(mean(d.temp$y_trust2006[d.temp$d_threat2006==0], na.rm = TRUE),
mean(d.temp$y_trust2006[d.temp$d_threat2006==1], na.rm = TRUE)))
d.temp$d_threat2006.jitter <- jitter(d.temp$d_threat2006, factor = 0.3)
d.temp$y_trust2006.jitter <- jitter(d.temp$y_trust2006)
assign(paste("p", i, sep = ""),
plot_ly(data = d.temp,
x = d.temp$d_threat2006.jitter,
y = d.temp$y_trust2006.jitter,
type = "scatter", mode = "markers",
marker = list(color = '#000000',
size = 2,
opacity = 0.05,
symbol = 'circle',
line = list(color = '#000000',
width = 4)),
name = "Observations",
hoverinfo = "none") %>%
layout(xaxis = list(range = list(-1,2),
tickvals = c(0,1),
zeroline = FALSE,
title = paste("Threat 2006 (jitt.)", " [income = ", i+1 ,"]", sep="")),
yaxis = list(range = list(-1,11),
tickvals = c(0,1,2,3,4,5,6,7,8,9,10),
zeroline = FALSE,
title = "Trust 2006 (jitt.)")) %>%
add_markers(data = means.temp,
x = ~x,
y = ~y,
marker = list(color = 'red',
size = 4,
opacity = 1,
symbol = 'circle',
line = list(color = 'red',
width = 4)),
name = "Means",
text = paste(round(means.temp$y, 2)),
hoverinfo = "text"
)
)
}
p.sm <- subplot(p0, p1, p2, p3, nrows=2, shareX = FALSE, shareY = FALSE,
titleX = T, titleY = T, margin = c(0.1,0.1,0.1,0.1)) %>%
layout(showlegend = FALSE)
p.sm %>% config(p = ., displayModeBar = FALSE)
We don’t discuss in detail how a multivariate regression coefficient is estimated/calculated (it’s a weighted average across cells). But keep in mind the idea of subsets. Values of variables slice the joint distribution into groups.
References
Bauer, Paul C. 2015. “Negative Experiences and Trust: A Causal Analysis of the Effects of Victimization on Generalized Trust.” Eur. Sociol. Rev. 31 (4): 397–417.