0. Recommendation

This is totally a personal opinion, but I recommend you to use R Markdown when you write up your solutions that contain some R codes. R Markdown is an authoring framework which allows you to easily make reproducible documents in various formats (e.g., PDF (LaTeX), HTML, Word, PowerPoint). And a whole bunch of free official resources are available online:

There are also many extensions that are usuful, too (e.g., bookdown, rticle, blogdown). BUT, you don’t have to feel forced to use R Markdown for the submission. It’s true that the initial learning or switching cost is not small. Rather, I really want you guys to focus on the materials and don’t want you to be distracted by the oft-frustrating learning process. Any format is fine (as far as I can tell what’s written). But, if you are going to do serious research using R in the future, R Markdown is definitely one of, if not the best tools to use.

1. Data for this session


We will use a smaller dataset resampled from the original dataset for the Problem Set 2. Do NOT use this smaller dataset for your actual solutions.


CLICK HERE TO DOWNLOAD THE SAMPLE DATA




Import data. This sample data is a bit differenct from the original one. You might need a small cleaning process for the original one.

mydata <- read.csv("sample.csv")




2. (a) Balance check


Basically, there are two ways:

  1. Calculate the means and SDs of pre-treatment covariates by control and treatment groups.
  2. Regress the treatment status on pre-treatment covariates.

Here, let me show you how to do the latter. It’s simple. Just use the lm function.

fit1 <- lm(civic ~ female + yob, mydata)
fit2 <- lm(hawthorne ~ female + yob, mydata)
fit3 <- lm(self ~ female + yob, mydata)
fit4 <- lm(neighbor ~ female + yob, mydata)

stargazer::stargazer(fit1, fit2, fit3, fit4,
                     type = "html", # change to "latex" or "text"
                     title = "Table 1: Balance Check",
                     keep.stat = "N")
Table 1: Balance Check
Dependent variable:
civic hawthorne self neighbor
(1) (2) (3) (4)
female -0.015 0.014 -0.004 -0.011
(0.011) (0.011) (0.011) (0.011)
yob -0.0003 -0.001* -0.00001 -0.0003
(0.0004) (0.0004) (0.0004) (0.0004)
Constant 0.740 1.424* 0.123 0.657
(0.776) (0.773) (0.769) (0.755)
Observations 3,000 3,000 3,000 3,000
Note: p<0.1; p<0.05; p<0.01



Be careful when you interpret the results. Usually, we “stargaze” (ah-oh!), but we don’t when we want to check the balance.2


3. (b) Observed difference in means


Just calculate. R is really good at calculation. Way better than me. Just in case, you might find the following functions. Of course, you prefer a more “modern” way to calculate (e.g., using tidyverse), it’s totally fine. But, please don’t do crazy things such as unnecessarily wrapping C/C++ or Python to show off your computing ability ;-).

mean(mydata$voted)  # average
subset(mydata, civic == 1) # subset data such that `subset(original data, conditions)
subset(mydata, civic == 1 | hawthorne == 1) # OR (logical disjunction)
subset(mydata, civic == 1 & hawthorne == 1) # AND (logical conjunction)




4. (c) Standard errors


Recall that \[ \widehat{\text{var}(\tilde{\tau} \mid \mathcal{O})} = \frac{S_{1}^2}{n_1} + \frac{S_{0}^2}{n_0}. \] Calculate it following the definition. Don’t forget to take…?



5. (d) Regression 1


WE ALWAY REGRESS TO REGRESSION! Regression is great. It saves our life. It saves the whole world.



For the standard errors, we can use the vcovHC function in sandwich package. This automatically estimates the heteroskedasticity-consistent variance covariance matrix from regression results. Astute attendees must have noticed that vcovHC stands for Heteroskedasticity-Consistent Variance COVariance matrix estimation.

result <- lm(voted ~ civic, mydata)

library(sandwich)
vcm <- vcovHC(result1, type = "HC2") # estimates variance covariance matrix
v <- diag(vcm) # we only need the diagonal elements
seHC <- sqrt(v) # take the square root

stargazer(result, type = "html",
                  se=list(seHC),
                  keep.stat = c("N"),
                  title = "Table 2: Linear Regression Result",
                  digits = 5)
Table 2: Linear Regression Result
Dependent variable:
voted
civic -0.01869
(0.02682)
Constant 0.32408***
(0.00907)
Observations 3,000
Note: p<0.1; p<0.05; p<0.01



6. (e) Regression 2


Just the same as (d) with more independent variables.

7. (f) What is the “baseline”?


Let me give you another example. Say you have data of individuals’ annual wage and education. Assume, for the sake of simplicty, the education variable only tells us whether individuals went to college but didn’t go on to graduate school or not and went to graduate school or not. Namely, it has three categories: high school diploma or lower, college degree, and graduate degree holders. If you are interested in the wage returns to graduate education, which of the following models is more appropriate (here, assume there is no concern of endogeneity.)? Why?

  • \(\log wage = \beta_0 + \beta_1 graduate + \varepsilon\)
  • \(\log wage = \beta_0 + \beta_1 college + \beta_2 graduate + \varepsilon\)

8. (g) What is the formal “baseline”?


Don’t confuse the two different baselines.

9. (h) Sandwich again


My hypothesis is many people attending this session will have a sandwich for lunch. Again, we can use sandwich package. This time we use the vcovCL function. For comparison, let us align three columns with different SEs.

se <- sqrt(diag(vcov(result))) # usual SEs
seCL <- sqrt(diag(vcovCL(result, cluster = ~ hh_id)))

stargazer::stargazer(result, result, result, 
                     type = "html",
                     se=list(se, seHC, seCL),
                     keep.stat = c("N"),
                     title = "Table 3: Different SEs",
                     add.lines = list(c("SE type", 'Normal', 'HC2', "CL")),
                     digits = 5)
Table 3: Different SEs
Dependent variable:
voted
(1) (2) (3)
civic -0.01869 -0.01869 -0.01869
(0.02713) (0.02682) (0.02695)
Constant 0.32408*** 0.32408*** 0.32408***
(0.00905) (0.00907) (0.00908)
SE type Normal HC2 CL
Observations 3,000 3,000 3,000
Note: p<0.1; p<0.05; p<0.01



Observe that \(\text{"Standard" SEs} < \text{HC SEs} < \text{CL SEs}\) in this case. How can this possiblly affect your conclusion drawn from statistical tests? How does it actually turn out?

-1. Desserts


Sometimes we see a different spelling for “heteroskedasticity” or “homoskedasticity”. It’s “heteroscedasticity” or “homoscedasticity”. There used to be a controversy on which spelling is correct until McCulloch (1985) (in Econometrica!) pointed out that “-skedasticity” is correct in terms of etymology. The “-skedastic” part means “scatter” in greek (\(\sigma\kappa\varepsilon\delta\alpha\gamma\gamma\nu\nu\upsilon\mu\iota\)). And in the English language, the counterpart letter for the Greek letter “\(\kappa\)” is “k”, so “-skedasticity” should be the right one. There is also a fascinating paper that studies how the different spellings were used in academic literature by counting the frequency using Google Books Ngram Viewer (See Paloyo (2011)).


  1. Graduate School of Economics, Waseda University,

  2. Here, we use the stargazer function from stargazer package. This package is actually a little obsolte, but we use it anyway.