0. Recommendation

This is totally a personal opinion, but I recommend you to use R Markdown when you write up your solutions that contain some R codes. R Markdown is an authoring framework which allows you to easily make reproducible documents in various formats (e.g., PDF (LaTeX), HTML, Word, PowerPoint). And a whole bunch of free official resources are available online:

There are also many extensions that are usuful, too (e.g., bookdown, rticle, blogdown). BUT, you don’t have to feel forced to use R Markdown for the submission. It’s true that the initial learning or switching cost is not small. Rather, I really want you guys to focus on the materials and don’t want you to be distracted by the oft-frustrating learning process. Any format is fine (as far as I can tell what’s written). But, if you are going to do serious research using R in the future, R Markdown is definitely one of, if not the best tools to use.

1. Data for this session

We will use a smaller dataset resampled from the original dataset for the Problem Set 2. Do NOT use this smaller dataset for your actual solutions.


Import data. This sample data is a bit differenct from the original one. You might need a small cleaning process for the original one.

mydata <- read.csv("sample.csv")

2. (a) Balance check

Basically, there are two ways:

  1. Calculate the means and SDs of pre-treatment covariates by control and treatment groups.
  2. Regress the treatment status on pre-treatment covariates.

Here, let me show you how to do the latter. It’s simple. Just use the lm function.

fit1 <- lm(civic ~ female + yob, mydata)
fit2 <- lm(hawthorne ~ female + yob, mydata)
fit3 <- lm(self ~ female + yob, mydata)
fit4 <- lm(neighbor ~ female + yob, mydata)

stargazer::stargazer(fit1, fit2, fit3, fit4,
                     type = "html", # change to "latex" or "text"
                     title = "Table 1: Balance Check",
                     keep.stat = "N")
Table 1: Balance Check
Dependent variable:
civic hawthorne self neighbor
(1) (2) (3) (4)
female -0.015 0.014 -0.004 -0.011
(0.011) (0.011) (0.011) (0.011)
yob -0.0003 -0.001* -0.00001 -0.0003
(0.0004) (0.0004) (0.0004) (0.0004)
Constant 0.740 1.424* 0.123 0.657
(0.776) (0.773) (0.769) (0.755)
Observations 3,000 3,000 3,000 3,000
Note: p<0.1; p<0.05; p<0.01

Be careful when you interpret the results. Usually, we “stargaze” (ah-oh!), but we don’t when we want to check the balance.2

3. (b) Observed difference in means

Just calculate. R is really good at calculation. Way better than me. Just in case, you might find the following functions. Of course, you prefer a more “modern” way to calculate (e.g., using tidyverse), it’s totally fine. But, please don’t do crazy things such as unnecessarily wrapping C/C++ or Python to show off your computing ability ;-).

mean(mydata$voted)  # average
subset(mydata, civic == 1) # subset data such that `subset(original data, conditions)
subset(mydata, civic == 1 | hawthorne == 1) # OR (logical disjunction)
subset(mydata, civic == 1 & hawthorne == 1) # AND (logical conjunction)

4. (c) Standard errors

Recall that \[ \widehat{\text{var}(\tilde{\tau} \mid \mathcal{O})} = \frac{S_{1}^2}{n_1} + \frac{S_{0}^2}{n_0}. \] Calculate it following the definition. Don’t forget to take…?

5. (d) Regression 1

WE ALWAY REGRESS TO REGRESSION! Regression is great. It saves our life. It saves the whole world.