7 Sample Script
By now, you have all the tools you need to conduct a simple analysis in R from start to finish. This section gives an example that combines everything from start to finish.
7.1 Script Hygiene
I always do four things at the top of my scripts to set up my environment before doing anything else.
- Write a brief comment about the purpose of the script.
- Write
rm(list = ls())
, which deletes any objects that you’ve created in the past. It’s good to start with an empty workspace, since then you’ll know your script is not relying on any objects that were created with code that isn’t in the script. - Set your working directory.
- Load all libraries that you’ll be using.
7.2 Saving output
So far, we haven’t talked about storing the output of our scripts. Saving is straightforward. Each type of object to save has a different function.
- Datasets:
write.csv(dataset, file)
- Figures:
ggsave(figure, file)
- Tables: Add an option to your
etable()
line.- For saving tables as PNG files
I include code for saving in my sample script below.
7.3 A simple recipe for a good R script
Most scripts follow the following sequence.
- Introductory block
- Load data
- Execute tasks (e.g, clean data, create figures, run analyses)
- Save output (e.g, new datasets, figures, tables)
It’s a good idea to visually separate your code into sections using header comments. I do so in the example. Use Ctrl + Shift + R
to create these headers in RStudio.
7.4 A sample R script
The following script loads, cleans, and analyzes the Opportunity Insights colleges data. This script should run start-to-finish with no errors as long as you have all the packages installed, data downloaded in a datasets/
subdirectory, and an output/
subdirectory to hold the saved files.
# Author: Matt Brown
# Date: 09/01/22
# Purpose:
# Setup -------------------------------------------------------------------
rm(list = ls())
setwd('/Users/mattbrown/Documents/Teaching/EnvEconClass/bookdown-Rguide')
library(dplyr)
library(fixest)
library(ggplot2)
# Load data ---------------------------------------------------------------
df_main = read.csv('datasets/mrc_table2.csv')
df_IPEDS_Scorecard = read.csv('datasets/mrc_table10.csv')
# Clean data --------------------------------------------------------------
### Select key variables
df_main = df_main %>%
select(super_opeid, name, tier, mr_kq5_pq1, par_median, par_q1)
df_IPEDS_Scorecard = df_IPEDS_Scorecard %>%
select(super_opeid, public, hbcu, flagship, scorecard_netprice_2013, state)
### Add supplemental variables to main data
df_main = df_main %>% left_join(df_IPEDS_Scorecard, by = 'super_opeid')
### Drop NAs
df_main = na.omit(df_main)
### Sample restriction (public institutions only)
df_main = df_main %>% filter(public == 1)
### Create standardized price varaible
netprice_mn = mean(df_main$scorecard_netprice_2013)
netprice_sd = sd(df_main$scorecard_netprice_2013)
df_main = df_main %>%
mutate(netprice_std = (scorecard_netprice_2013 - netprice_mn )/netprice_sd)
# Plots --------------------------------------------------------------------
### Histograms for key continuous variables
hist_mr_kq5_pq1 = ggplot(data = df_main, aes(x = mr_kq5_pq1)) +
geom_histogram()
hist_par_median = ggplot(data = df_main, aes(x = par_median)) +
geom_histogram()
hist_price = ggplot(data = df_main, aes(x = scorecard_netprice_2013)) +
geom_histogram()
### Scatter plots
uncond_scatter_mobility_price =
ggplot(data = df_main, aes(y = mr_kq5_pq1, x = scorecard_netprice_2013)) +
geom_point()
# It's a good idea to view these plots as you work interactively.
# This can help you decide how to define variables, handle outliers, etc.
# Regressions -------------------------------------------------------------
# What variables are associated with mobility index?
dict = c('mr_kq5_pq1' = 'Mobility Rate',
'netprice_std' = 'Standardized cost of attendance',
'flagship' = 'Flagship',
'hbcu' = 'HBCU',
'par_q1' = 'Frac w/ parents in bottom income quintile')
reg_nocontrols = feols(
data = df_main,
fml = mr_kq5_pq1 ~ netprice_std + flagship + hbcu,
vcov = 'HC1'
)
reg_statefe = feols(
data = df_main,
fml = mr_kq5_pq1 ~ netprice_std + flagship + hbcu | state
)
reg_statefe_parq1= feols(
data = df_main,
fml = mr_kq5_pq1 ~ netprice_std + flagship + hbcu + par_q1| state
)
result_list =
list(reg_nocontrols, reg_statefe, reg_statefe_parq1)
# Save output -------------------------------------------------------------
write.csv(df_main, file = 'output/analysis_data.csv')
ggsave(hist_mr_kq5_pq1, file = 'output/hist_mr_kq5_pq1.png')
ggsave(hist_par_median, file = 'output/hist_par_median.png')
ggsave(hist_price, file = 'output/hist_price.png')
ggsave(uncond_scatter_mobility_price,
file = 'output/uncond_scatter_mobility_price.png')
etable(result_list,
dict = dict,
style.tex = style.tex('aer'),
file = 'output/reg_table.tex')
etable(result_list,
dict = dict,
style.tex = style.tex('aer'),
export = 'output/reg_table.png')
## model 1 model 2 model 3
## Dependent Var.: Mobility Rate Mobility Rate Mobility Rate
##
## Constant 0.0188*** (0.0004)
## Standardized cost of attendance -0.0013*** (0.0004) -0.0004 (0.0005) 0.0014*** (0.0004)
## Flagship -8.07e-5 (0.0016) 0.0007 (0.0011) 0.0053** (0.0019)
## HBCU 0.0110*** (0.0016) 0.0122*** (0.0019) 0.0045 (0.0037)
## Frac w/ parents in bottom income quintile 0.0727** (0.0216)
## Fixed-Effects: ------------------- ------------------ ------------------
## state No Yes Yes
## ________________________________________ ___________________ __________________ __________________
## S.E. type Heteroskedast.-rob. by: state by: state
## Observations 1,189 1,189 1,189
## R2 0.02720 0.33576 0.46679
## Within R2 -- 0.03314 0.22387
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1