7.9 Lab

7.9.1 Study

Negative Experiences and Trust: A Causal Analysis of the Effects of Victimization on Generalized Trust (Bauer 2015): What is the causal effect of victimization on social trust?
- Generalized trust is praised by many researchers as the foundation of functioning social systems. An ongoing debate concerns the question whether and to what extent experiences impact individuals’ generalized trust, as measured with the standard trust survey question. So far, reliable empirical evidence regarding the causal effect of experiences on generalized trust is scarce. Studies either do notdirectly measure the quality of experiences or use designs that are prone to selection bias. In the present study, we investigate a unique panel data set from Switzerland that contains measures of trustand measures of negative experiences, i.e. victimization. We use change score analysis and ‘genetic matching’ to investigate the causal effect of victimization on generalized trust and find no substantiallystrong effect that is consistent across panel data waves. (Bauer 2015)
We use this data because we can discuss several identifcation strategies relying on the same dataset (+ I know it)

7.9.2 Data

Data and files can be directly loaded with the command given below or downloaded from the data folder.

data-selection-on-observables.csv contains a subset of the Bauer (2015) data that we will use for our exercise (Reproduction files). We’ll use this data to discuss selection-on-observables and also panel data later on.
The individual-level dataset covers both victimization (experiencing threats), trust (generalized trust) and various covariates for the period from 2005 to 2008. Below a description where * is replaced with the corresponding year. Analogue to our theoretical sessions treatment variables are generally named d_..., outcome variables y_... and covariates x_....

y_trust*: Generalized trust (0-10) at t (Outcome Y)
d_threat*: Experiencing a threat (0,1) in year before t (Treatment D)
x_age*: Age measure at t
x_male*: Gender at t (Male = 1, Female = 0)
x_education*: Level of education (0-10) at t
x_income*: Income categorical (0,3) at t
Q: The data is in wide format. What does that look like?

7.9.3 Summary Statistics

Below summary statistics of the data in Table 7.1.

# Directly import data from shared google folder into R
  data <- readr::read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download", "1WotERFf-yhnKzo48VlCOiJMfutxXyvfT"))
  # Or download and import with: data <- readr::read_csv("data-selection-on-observables.csv")

# Create a summary table
  stargazer(data.frame(data), 
            type = "html", 
            summary = TRUE,
            title = "(#tab:selectiononobservables4)Summary statistics")

**Table 7.1: Summary statistics**

Statistic	N	Mean	St. Dev.	Min	Pctl(25)	Pctl(75)	Max

y_trust2005	6,515	6.073	2.378	0.000	5.000	8.000	10.000
y_trust2006	6,638	6.131	2.302	0.000	5.000	8.000	10.000
y_trust2007	6,959	6.145	2.313	0.000	5.000	8.000	10.000
y_trust2008	6,875	6.215	2.286	0.000	5.000	8.000	10.000
x_age2005	11,159	38.002	21.593	0.000	19.000	54.000	95.000
x_age2006	10,859	38.522	21.765	0.000	19.000	55.000	96.000
x_age2007	11,000	39.609	21.941	0.000	20.000	56.000	95.000
x_age2008	10,879	40.042	21.976	0.000	20.000	57.000	96.000
x_male2005	11,164	0.489	0.500	0.000	0.000	1.000	1.000
x_male2006	10,863	0.487	0.500	0.000	0.000	1.000	1.000
x_male2007	11,002	0.486	0.500	0.000	0.000	1.000	1.000
x_male2008	10,889	0.484	0.500	0.000	0.000	1.000	1.000
d_threat2005	6,547	0.089	0.285	0.000	0.000	0.000	1.000
d_threat2006	6,666	0.100	0.301	0.000	0.000	0.000	1.000
d_threat2007	6,988	0.097	0.297	0.000	0.000	0.000	1.000
d_threat2008	6,903	0.105	0.307	0.000	0.000	0.000	1.000
x_education2005	10,271	4.306	3.172	0.000	1.000	6.000	10.000
x_education2006	10,038	4.302	3.212	0.000	1.000	6.000	10.000
x_education2007	10,181	4.358	3.223	0.000	1.000	7.000	10.000
x_education2008	10,155	4.379	3.245	0.000	1.000	7.000	10.000
x_income2005	4,363	1.410	1.139	0.000	0.000	3.000	3.000
x_income2006	4,406	1.409	1.143	0.000	0.000	3.000	3.000
x_income2007	4,681	1.394	1.158	0.000	0.000	3.000	3.000
x_income2008	4,645	1.439	1.167	0.000	0.000	3.000	3.000
idpers	23,243	11,622.000	6,709.820	1	5,811.5	17,432.5	23,243

7.9.4 Descriptive exploration

For now we focus on data from a single year namely 2006.

  summary(data$y_trust2006) # Summary statistics

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.	NA’s
0	5	7	6.131365	8	10	16605

Q: What is the median and the mean? What does the distribution look like if the median lies to the right of the mean?

  summary(data$d_threat2006) # Summary statistics

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.	NA’s
0	0	0	0.10036	0	1	16577

# Relative frequencies
  round(prop.table(table(data$y_trust2006)), 2)

0	1	2	3	4	5	6	7	8	9	10
0.05	0.01	0.03	0.04	0.06	0.19	0.13	0.2	0.2	0.05	0.05

# Contingency: Absolute frequencies
  table(data$d_threat2006, data$y_trust2006)

	0	1	2	3	4	5	6	7	8	9	10
0	259	36	135	214	320	1142	782	1228	1193	326	331
1	44	6	37	56	48	139	70	114	101	27	25

# Contingency: Relative frequencies (shares)
  round(prop.table(table(data$d_threat2006, data$y_trust2006)), 2)

	0	1	2	3	4	5	6	7	8	9	10
0	0.04	0.01	0.02	0.03	0.05	0.17	0.12	0.19	0.18	0.05	0.05
1	0.01	0.00	0.01	0.01	0.01	0.02	0.01	0.02	0.02	0.00	0.00

7.9.5 Naive estimate of the ATE

First, let us simply compare the trust mean of non-victims with the one of the victims. The difference is the naive estimate of the ATE.

mean(data$y_trust2006[data$d_threat2006==0], na.rm = TRUE)

## [1] 6.204157

mean(data$y_trust2006[data$d_threat2006==1], na.rm = TRUE)

## [1] 5.478261

We can also plot this. In Figure 7.1 red points are the means, transparent points are observations.

data$d_threat2006.jitter <- jitter(data$d_threat2006, factor = 0.3)
data$y_trust2006.jitter <- jitter(data$y_trust2006)
plot_ly(data = data, x = ~d_threat2006.jitter, y = ~y_trust2006.jitter, 
        type = "scatter", mode = "markers", hoverinfo = "none",
        marker = list(color = '#000000',
        size = 2,
        opacity = 0.05,
        symbol = 'circle',
        line = list(color = '#000000',
                    width = 4)),
        name = "Observations") %>%
            layout(xaxis = list(range = list(-1,2),
                                tickvals = c(0,1),
                                zeroline = FALSE,
                                title = "Victimization 2006"),
                   yaxis = list(range = list(-1,11),
                                tickvals = c(0,1,2,3,4,5,6,7,8,9,10),
                                zeroline = FALSE),
                                title = "Trust 2006") %>%
  add_markers(x = c(0,1),
              y = c(mean(data$y_trust2006[data$d_threat2006==0], na.rm = TRUE),
                    mean(data$y_trust2006[data$d_threat2006==1], na.rm = TRUE)),
              marker = list(color = 'red',
              size = 4,
              opacity = 1,
              symbol = 'circle',
              line = list(color = 'red',
                          width = 4)),
              name = "Means",
              hoverinfo = "text"
              
)

Figure 7.1: Means of trust in treatment and control

Q: What is jitter? Why did I add some to the visualization?

Instead of comparing means we can also run a simple regression with the function lm() using d_threat.2006 as the sole explanatory variable: $y_trust_{i} = \beta_{0} + \beta_{1} d_threat_{i} + \epsilon_{i}$ .

fit <- lm(y_trust2006 ~ d_threat2006, data = data)
summary(fit)

## 
## Call:
## lm(formula = y_trust2006 ~ d_threat2006, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2042 -1.2042  0.7958  1.7958  4.5217 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.20416    0.02968 209.021  < 2e-16 ***
## d_threat2006 -0.72590    0.09360  -7.755 1.02e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.293 on 6631 degrees of freedom
##   (16610 observations deleted due to missingness)
## Multiple R-squared:  0.008988,   Adjusted R-squared:  0.008839 
## F-statistic: 60.14 on 1 and 6631 DF,  p-value: 1.015e-14

# Access coefficients with fit$coeff[[1]]
# For other components see ?lm
# Nice output: stargazer(fit, "fit", type = "html")

Q: What is the intercept 6.2041569 in this regression?
Q: What is the coefficient -0.725896 and how does it relate to the intercept?
- Tipp: Think of the two means we calculated before.

7.9.6 Controlling/Conditioning

Let’s assume selection on observables: There is/we observe the set of covariates such that treatment assignment is random conditional on these covariates [Don’t forget that we assume that D and any Xs occur before Y. Q: What?] Then we can identify the causal effect controlling for those covariates.
For the model below we assume that assignment of victimization (d_threat2006) is random (conditionally independent) controlling for x_age2006, x_male2006. We rerun the regression adding those covariates to the model in Table 7.2.

fit <- lm(y_trust2006 ~ d_threat2006 + x_age2006 + x_male2006, data = data)
# stargazer(fit, type="html")

**Table 7.2: Estimation**

	Dependent variable:

	y_trust2006

d_threat2006	-0.715^***
	(0.095)

x_age2006	0.001
	(0.002)

x_male2006	-0.036
	(0.057)

Constant	6.187^***
	(0.084)


Observations	6,633
R²	0.009
Adjusted R²	0.009
Residual Std. Error	2.293 (df = 6629)
F Statistic	20.251^*** (df = 3; 6629)

Note:	p<0.1; p<0.05; p<0.01

Nice, we found something. And given our selection-on-observables assumption is valid, we would interpret our estimate as the causal effect of victimization on trust.

Q: Can we stop here? Do you think the assumption is valid?

Q: Imagine we would have found no effect? Should we stop our investigation? Why or why not?

7.9.7 Controlling: Conceptually

Conceptually: When we control, we divide the sample into subsets. Namely subsets pertaining to values of one (or several) covariates. Remember our discussion of variables/joint distributions/cells at the beginning and the notion of “filtering”! (See Section Data: (Empirical) Joint distributions).
For instance, the variable x_income2006 divides the sample into 4 different income groups. Let’s reestimate the model in those subgroups and check the results in Table 7.3.

fit <- lm(y_trust2006 ~ d_threat2006 + x_income2006, data = data)
fit1 <- lm(y_trust2006 ~ d_threat2006, data = data[data$x_income2006==0,])
fit2 <- lm(y_trust2006 ~ d_threat2006, data = data[data$x_income2006==1,])
fit3 <- lm(y_trust2006 ~ d_threat2006, data = data[data$x_income2006==2,])
fit4 <- lm(y_trust2006 ~ d_threat2006, data = data[data$x_income2006==3,])
stargazer(fit, fit1, fit2, fit3, fit4, type = "html",
          column.labels = c("All", "Income = 0", "Income = 1",
                            "Income = 2", "Income = 3"),
          title = "(#tab:selectiononobservables12)Results: Subsets of income groups",
          omit.stat=c("f", "ser"))

**Table 7.3: Results: Subsets of income groups**

	Dependent variable:

	y_trust2006
	All	Income = 0	Income = 1	Income = 2	Income = 3
	(1)	(2)	(3)	(4)	(5)

d_threat2006	-0.788^***	-1.007^***	-0.554^***	-0.963^***	-0.637^***
	(0.103)	(0.190)	(0.209)	(0.241)	(0.189)

x_income2006	0.112^***
	(0.029)

Constant	6.194^***	6.324^***	6.184^***	6.277^***	6.628^***
	(0.054)	(0.070)	(0.070)	(0.079)	(0.058)


Observations	4,394	1,229	1,246	808	1,111
R²	0.017	0.022	0.006	0.019	0.010
Adjusted R²	0.017	0.022	0.005	0.018	0.009

Note:	p<0.1; p<0.05; p<0.01

Take the estimates from the subgroups -1.0065519, -0.5536807, -0.9630548, -0.6372947 and calculate the mean -0.7901455. That’s similar to the one we got pooling all income groups (Column 1): -0.7881753.

We can also visualize those subgroups.

for (i in 0:3){
  
d.temp <- data[data$x_income2006==i,]

means.temp <- data.frame(x = c(0,1), 
                         y =  c(mean(d.temp$y_trust2006[d.temp$d_threat2006==0], na.rm = TRUE),
 mean(d.temp$y_trust2006[d.temp$d_threat2006==1], na.rm = TRUE)))

d.temp$d_threat2006.jitter <- jitter(d.temp$d_threat2006, factor = 0.3)
d.temp$y_trust2006.jitter <- jitter(d.temp$y_trust2006)

assign(paste("p", i, sep = ""),
 plot_ly(data = d.temp,
             x = d.temp$d_threat2006.jitter, 
             y = d.temp$y_trust2006.jitter, 
        type = "scatter", mode = "markers",
        marker = list(color = '#000000',
                      size = 2,
                      opacity = 0.05,
                      symbol = 'circle',
                      line = list(color = '#000000',
                      width = 4)),
        name = "Observations",
        hoverinfo = "none") %>%
            layout(xaxis = list(range = list(-1,2),
                                tickvals = c(0,1),
                                zeroline = FALSE,
                                title = paste("Threat 2006 (jitt.)", " [income = ", i+1 ,"]", sep="")),
                   yaxis = list(range = list(-1,11),
                                tickvals = c(0,1,2,3,4,5,6,7,8,9,10),
                                zeroline = FALSE,
                                title = "Trust 2006 (jitt.)")) %>%
  add_markers(data = means.temp,
              x = ~x,
              y = ~y,
              marker = list(color = 'red',
              size = 4,
              opacity = 1,
              symbol = 'circle',
              line = list(color = 'red',
                          width = 4)),
              name = "Means",
              text = paste(round(means.temp$y, 2)),
              hoverinfo = "text"
              )
)

}

p.sm <- subplot(p0, p1, p2, p3, nrows=2, shareX = FALSE, shareY = FALSE,
                titleX = T, titleY = T, margin = c(0.1,0.1,0.1,0.1))  %>%
  layout(showlegend = FALSE)
p.sm  %>% config(p = ., displayModeBar = FALSE)

We don’t discuss in detail how a multivariate regression coefficient is estimated/calculated (it’s a weighted average across cells). But keep in mind the idea of subsets. Values of variables slice the joint distribution into groups.