0. Recommendation

This is totally a personal opinion, but I recommend you to use R Markdown when you write up your solutions that contain some R codes. R Markdown is an authoring framework which allows you to easily make reproducible documents in various formats (e.g., PDF (LaTeX), HTML, Word, PowerPoint). And a whole bunch of free official resources are available online:

There are also many extensions that are usuful, too (e.g., bookdown, rticle, blogdown). BUT, you don’t have to feel forced to use R Markdown for the submission. It’s true that the initial learning or switching cost is not small. Rather, I really want you guys to focus on the materials and don’t want you to be distracted by the oft-frustrating learning process. Any format is fine (as far as I can tell what’s written). But, if you are going to do serious research using R in the future, R Markdown is definitely one of, if not the best tools to use.

1. Data for this session


We will use a smaller dataset resampled from the original dataset for the Problem Set 2. Do NOT use this smaller dataset for your actual solutions.


CLICK HERE TO DOWNLOAD THE SAMPLE DATA




Import data. This sample data is a bit differenct from the original one. You might need a small cleaning process for the original one.

mydata <- read.csv("sample.csv")




2. (a) Balance check


Basically, there are two ways:

  1. Calculate the means and SDs of pre-treatment covariates by control and treatment groups.
  2. Regress the treatment status on pre-treatment covariates.

Here, let me show you how to do the latter. It’s simple. Just use the lm function.

fit1 <- lm(civic ~ female + yob, mydata)
fit2 <- lm(hawthorne ~ female + yob, mydata)
fit3 <- lm(self ~ female + yob, mydata)
fit4 <- lm(neighbor ~ female + yob, mydata)

stargazer::stargazer(fit1, fit2, fit3, fit4,
                     type = "html", # change to "latex" or "text"
                     title = "Table 1: Balance Check",
                     keep.stat = "N")
Table 1: Balance Check
Dependent variable:
civic hawthorne self neighbor
(1) (2) (3) (4)
female -0.015 0.014 -0.004 -0.011
(0.011) (0.011) (0.011) (0.011)
yob -0.0003 -0.001* -0.00001 -0.0003
(0.0004) (0.0004) (0.0004) (0.0004)
Constant 0.740 1.424* 0.123 0.657
(0.776) (0.773) (0.769) (0.755)
Observations 3,000 3,000 3,000 3,000
Note: p<0.1; p<0.05; p<0.01



Be careful when you interpret the results. Usually, we “stargaze” (ah-oh!), but we don’t when we want to check the balance.2


3. (b) Observed difference in means


Just calculate. R is really good at calculation. Way better than me. Just in case, you might find the following functions. Of course, you prefer a more “modern” way to calculate (e.g., using tidyverse), it’s totally fine. But, please don’t do crazy things such as unnecessarily wrapping C/C++ or Python to show off your computing ability ;-).

mean(mydata$voted)  # average
subset(mydata, civic == 1) # subset data such that `subset(original data, conditions)
subset(mydata, civic == 1 | hawthorne == 1) # OR (logical disjunction)
subset(mydata, civic == 1 & hawthorne == 1) # AND (logical conjunction)




4. (c) Standard errors


Recall that \[ \widehat{\text{var}(\tilde{\tau} \mid \mathcal{O})} = \frac{S_{1}^2}{n_1} + \frac{S_{0}^2}{n_0}. \] Calculate it following the definition. Don’t forget to take…?



5. (d) Regression 1


WE ALWAY REGRESS TO REGRESSION! Regression is great. It saves our life. It saves the whole world.