Chapter 4 Estimating models
Now we know how to deal with data, we can think about fitting some models! There are two main ways to do this in R: using routines in packages, or programming the mathematical operations yourself using matrices. Both have their upsides and downsides. R is very good at the first, and still very decent - though not the best language out there - for the second.
The first is typically a lot easier. The main upside of R is its amazing array of different packages containing almost every model you would like to use, with lots of options and community support. The downside is that you have less control. Someone else has decided what summary statistics to report, which optimiser to use, how to deal with NAs and so on. These are not always what you would like. We should note, however, that even the package functions in R are typically much more flexible than alternatives like Stata. It is much easier to work out, and change, what is going on under the hood. This makes R a great choice for applied statistics, econometrics, and data science, and is part of why R is so popular in these communities.
The second is typically a bit harder to do. You have to know how to program the models, all your desired summary statistics etc yourself. The upside is that you can choose exactly what you want to do and how to do it. Also, R is less optimised for matrix operations than dedicated high-performance computing languages like Julia, or Python 3 with Numba. Thus, if you only want to program your own estimators and never use packages, maybe consider these instead (see https://julialang.org/ or https://numba.pydata.org/)
We are going to show you how to fit estimators using functions first.
We will introduce is using the lm()
function for linear regression that comes
inbuilt in R, and sandwich
package for computing different covariance matrices.
Through it, we are going to encounter some useful tips and tricks
for using functions from packages. Afterwards, we will deal with programming your
own estimators using matrix programming.
4.1 Modelling with functions
4.1.1 Linear regression with lm()
The lm
function is how
we perform OLS in R. Most of the models you will encounter in R packages are built
off of this base function. Through it, we are going to encounter some useful tips and tricks
for using inbuilt functions. Thus, there are high returns to knowing how it works and reports
summary statistics. In the workflow section, we will see how to automatically
make nice latex regression tables from lm
objects
and sandwich
standard errors with stargazer()
.
# lets read in a new dataframe
# read.csv allows us to read a csv file into R. The `as.data.frame' wrapper
# tells R that we want the object to be represented as a dataframe
# For illustration, we are going to use some of the data from Dell and Querubin (2017)
# `Nation building through foreign interventions'. This estimates the causal effect of
# military firepower on insurgent support on the intensive margin
# discontinuities in US strategies across regions in South Vietnam during their
# occupation. We have observations by hamlet
# To download the original data, go to
# https://scholar.harvard.edu/files/dell/files/nationbuilding.pdf
# The file we will use is there as `firstclose_post.dta'
<- as.data.frame(read.csv("vietnam_war.csv", stringsAsFactors = F)) df
lm()
is the command, included in base R, that allows us to perform linear regression.
To carry out a regression, we take a dataframe object including all of our dependent
and independent variables.
The first argument to lm()
is a ‘formula.’ This is
the formula for our regression model. We specify it with the following syntax:
‘dependent_variable ~ independent variable 1 + independent variable 2 + … +
independent variable n.’ The variable names have to be the same as the names of the
variables in the dataframe. Otherwise, lm()
will not recognise the name of the
variable. We do not have to pass the formula or variable names as
strings - we just write them out in text, and lm()
finds the variable with
that name in our dataframe. lm()
automatically includes an intercept. If we
want to exclude this, write 0 as the first variable in our formula.
The second argument to lm()
is ‘data.’ We need to pass the name of the dataframe
containing our data. It is important to know what type of vector each of your
variables is within the dataframe. If your variable is one of the numeric vector types,
lm()
will treat it as a continuous variable. If it is a factor, it will treat it
as a categorical variable, with an ordering given by the ordering of the underlying
factor. Thus, it will include a series of dummy variables or ‘fixed effects’ for each
level of the factor, omitting the one corresponding to the lowest level to avoid
pure multicolinearity.
This is basically it! There are other additional arguments that allow you to specify vectors of weights for weighted least squares, different fitting methods, and so on.
# Lets run a few different linear regressions with lm
# regressing mean airstrikes on whether the hamlet was above or below the algorithm
# scoring threshold US planners used to assign planned airstrikes
lm(fr_strikes_mean ~ below, data=df)
##
## Call:
## lm(formula = fr_strikes_mean ~ below, data = df)
##
## Coefficients:
## (Intercept) below
## 0.257591 0.009856
# lets drop the intercept
lm(fr_strikes_mean ~ 0 + below, data=df)
##
## Call:
## lm(formula = fr_strikes_mean ~ 0 + below, data = df)
##
## Coefficients:
## below
## 0.2674
# regressing mean airstrikes on distance from the threshold with a second-order
# polynomial trend
# we will save this as an object m2 so we can take a look at it later
"md_ab_square"]] <- df[["md_ab"]]^2
df[[
<- lm(fr_strikes_mean ~ md_ab + md_ab_square, data=df)
m2
# now lets use factors to add year fixed effects
<-lm(fr_strikes_mean ~ as.factor(yr) +md_ab + md_ab_square, data=df) m_time
As we can see, the output from the lm()
command by itself it quite ugly. It is
better to save it as a variable, and then we can look at it more nicely and do
things with it. The first thing is to make it look a (little) nicer. We can
do this using the base R summary
function. Running summary(model_name)
will output us a regression table containing OLS standard errors, p-values, R-squared,
and the F statistic for joint significance of all coefficients.
summary(m2)
##
## Call:
## lm(formula = fr_strikes_mean ~ md_ab + md_ab_square, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3079 -0.2412 -0.1059 0.1609 1.0993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.305918 0.002982 102.59 < 2e-16 ***
## md_ab -0.303735 0.043391 -7.00 2.69e-12 ***
## md_ab_square -11.375115 0.345114 -32.96 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2946 on 12204 degrees of freedom
## Multiple R-squared: 0.08179, Adjusted R-squared: 0.08164
## F-statistic: 543.5 on 2 and 12204 DF, p-value: < 2.2e-16
# lets save it as an object so we can inspect it later too
<- summary(m2) s2
Both of these objects are structured as lists of lists. The whole object itself is a list, and each of the relevant outputs - e.g the coefficients - are a list within that list. We can eyeball what is in there by saving our summary object as an object, and clicking on the object name in our environment. This brings up the following.
We can see here that our object is a list, and that it contains a set of other lists. Clicking on the lists with arrows on their left reveals the content of the list, and shows us the names attached to each entry. In general, this shows us what we have in our object and thus what we can slice out and present.
The objects also store things like the variable names of each coefficient. The list structure and attached metadata means that we can slice the object using similar slicing techniques to the ones we learned earlier to extract what we want. This can be useful if, for example, we want to do some plotting, get a single test statistic, or use the coefficient values to predict out of sample.
# lets get some results from our regressions
# imagine for example that we want to plot the residuals to look for some
# heteroskedasticity
<- s2[["residuals"]]
resids plot(df[["md_ab"]], resids, col="red", xlab="Distance from threshold", ylab="Residual",
main = "Regression residuals against distance from score threshold")
# we want to look at the 35th residual from this model for some reason
"residuals"]][35] s2[[
## 35
## 0.04256734
# or we just want to look at the coefficient on md_ab
"coefficients"]][["md_ab"]] m2[[
## [1] -0.303735
Most other regression routines in R either use a version of this function and output structure,
or something deliberately very similar (e.g plm()
for linear
panel data models, pgmm()
for panel data GMM models, dynlm()
for time series).
Thus, if you fit one of these models you get a lm()
type object out that you can
inspect, and then slice to get what you need, in a similar way.
# An example of other lm-type objects in R - IV models
# IV regression in R
library(AER)
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
## Warning: package 'survival' was built under R version 4.0.5
# lets do a fuzzy-rdd type IV regression to demonstrate
# We look at the causal effect of airstrikes on local insurgent infiltration
# using distance from the threshold as an instrument for airstrikes (and including
# a quadratic trend)
# syntax is second stage regression|dependent variables in first stage
<- ivreg(guer_squad ~fr_strikes_mean + md_ab_square|md_ab + md_ab_square, data=df)
m_iv <- summary(m_iv)
s_iv
# see how this package uses the base lm object to build up the more complex
# econometric model, so that we can slice the resulting object in the same
# way as before
The lm
object can also evaluate the model as strings. Thus, a neat trick we
can use to fit large models or many similar models is to construct the model formula
as a string using paste()
, and then pass that string to the formula argument
of lm()
.
# Imagine we want to include the variables in columns 4:50 of our dataframe into
# our model as regressors
# this syntax may look intimidating, so lets look at it piece by piece
# As we learned when we looked at slicing earlier, df[,c(4:50)] will select
# all rows (as the left hand side of the comma is empty) of columns 4:50 (as we
# pass the vector of numbers 4:50 on the right hand side of the column) from our
# dataframe df.
# names() returns the column name of a given dataframe object. Thus, names(df[,c(4:50)])
# returns a vector of names of all of the columns in our slice as strings
# paste() takes in vectors of strings, or multiple strings manually, and returns
# a single string. If we specify the 'collapse' argument, it takes any vector
# of strings and returns a single string made up of each of the elements of the
# vector pasted together, and separated by the thing we pass to 'collapse'. In
# this case, it takes all of our names(df[,c(4:50)]) and puts them in a single
# string, each separated by a '+' sign
<- paste(names(df[,c(4:50)]), collapse="+")
dep_vars
# next we need to add our independent variable. We can do this with 'paste' again.
# If we specify 'sep' and pass strings as individual arguments (instead of a vector),
# paste takes each string we pass and puts them together in
# one string separated by the symbol we pass to 'sep'.
<- paste("fr_strikes_mean", dep_vars, sep="~")
formula_string
# now we have our formula, we can run the regression model
<- lm(formula_string, data=df)
m_string summary(m_string)
##
## Call:
## lm(formula = formula_string, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92720 -0.11849 -0.02094 0.08878 0.92147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.460e-02 2.955e-02 1.509 0.131249
## fr_forces_mean 1.651e-01 7.819e-03 21.110 < 2e-16 ***
## pop_g -1.426e-01 1.472e-02 -9.687 < 2e-16 ***
## naval_attack 6.203e-02 8.221e-02 0.755 0.450564
## sh_pf_presence -1.430e-02 6.744e-03 -2.120 0.033994 *
## sh_rf_presence -3.333e-02 1.163e-02 -2.866 0.004162 **
## fr_init -8.907e-02 5.815e-02 -1.532 0.125634
## fw_init -1.415e+00 4.552e-01 -3.108 0.001890 **
## en_d 3.473e-05 1.881e-04 0.185 0.853526
## fw_d -3.587e-02 1.168e-02 -3.073 0.002128 **
## fr_d 7.137e-04 1.154e-03 0.618 0.536382
## fr_opday_dummy 1.011e-01 5.724e-02 1.765 0.077516 .
## fw_opday_dummy 1.306e+00 4.589e-01 2.846 0.004434 **
## vc_infr_vilg -7.916e-03 9.467e-03 -0.836 0.403053
## part_vc_cont 8.758e-03 3.903e-02 0.224 0.822450
## guer_squad 8.530e-02 7.902e-03 10.795 < 2e-16 ***
## mainforce_squad 1.286e-01 8.226e-03 15.629 < 2e-16 ***
## en_base 3.031e-03 8.903e-03 0.340 0.733517
## entax_vilg 5.570e-02 8.042e-03 6.926 4.59e-12 ***
## phh_psdf -1.480e-02 1.160e-02 -1.276 0.201842
## psdf_dummy -4.234e-02 1.197e-02 -3.538 0.000405 ***
## chief_visit 5.704e-02 1.352e-02 4.219 2.48e-05 ***
## village_comm -4.277e-02 9.169e-03 -4.665 3.13e-06 ***
## gvn_taxes -4.322e-02 7.493e-03 -5.767 8.30e-09 ***
## rdc_active -5.534e-02 6.824e-03 -8.110 5.68e-16 ***
## civic_org_part -6.545e-02 1.134e-02 -5.772 8.08e-09 ***
## vilg_council_meet 3.150e-02 6.653e-03 4.736 2.22e-06 ***
## youth_act 3.903e-03 8.058e-03 0.484 0.628098
## p_own_vehic -5.310e-02 1.635e-02 -3.248 0.001166 **
## nonrice_food 1.841e-02 9.894e-03 1.861 0.062810 .
## manuf_avail -1.264e-02 9.302e-03 -1.359 0.174230
## surplus_goods 3.832e-02 6.667e-03 5.748 9.28e-09 ***
## econ_train 5.062e-02 1.235e-02 4.100 4.17e-05 ***
## self_dev_part -2.065e-02 1.055e-02 -1.957 0.050347 .
## selfdev_vilg -6.659e-02 1.504e-02 -4.428 9.60e-06 ***
## pworks_under_constr 2.016e-02 7.014e-03 2.875 0.004049 **
## prim_access 3.258e-02 1.087e-02 2.996 0.002741 **
## sec_school_vilg 3.092e-02 6.321e-03 4.891 1.02e-06 ***
## p_require_assist 4.766e-02 2.051e-02 2.323 0.020192 *
## nofarm_sec 4.624e-02 7.037e-03 6.571 5.24e-11 ***
## urban -4.235e-02 1.035e-02 -4.091 4.33e-05 ***
## buddhist 1.146e-02 5.236e-03 2.189 0.028649 *
## farming -1.872e-02 7.110e-03 -2.633 0.008482 **
## all_atk 2.945e-01 1.794e-02 16.418 < 2e-16 ***
## en_pres 1.711e-01 1.821e-02 9.399 < 2e-16 ***
## en_prop -1.060e-01 2.338e-02 -4.533 5.89e-06 ***
## admin_p1 1.017e-01 2.800e-02 3.634 0.000281 ***
## health_p1 -4.880e-02 8.736e-03 -5.586 2.39e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2027 on 9761 degrees of freedom
## (2398 observations deleted due to missingness)
## Multiple R-squared: 0.5178, Adjusted R-squared: 0.5155
## F-statistic: 223 on 47 and 9761 DF, p-value: < 2.2e-16
4.1.2 Computing covariance matrices with sandwich
lm
only computes standard errors based on the standard OLS variance-covariance matrix.
We often want to do something different, for example computing Newey-West standard
errors. The package sandwich
allows us to easily compute these kind of ‘sandwich’
covariance matrices.
There are multiple different sandwich
commands, corresponding to different types
of covariance matrix. Generally, we need to pass these object a fitted model object
like our lm()
model. Next, we can then specify whether we need to do a finite
sample adjustment (if we want to make our output consistent with Stata, which
often does this automatically), if we want to pass a weights matrix, and so on.
If we want to specify the outsides and insides of the covariance matrix structure
manually, we can do so using bread()
and meat()
.
# Lets compute some different types of covariance
# matrices with sandwich
library(sandwich)
# heteroskedasticity and autocorrelation robust covariance matrix
<- vcovHAC(m2)
m2_HAC_cov_mat
# Panel-corrected covariance matrix
<- vcovPC(m2, cluster=df[["usid"]])
m2_pan_cov_mat
# now if we want to get the standard errors back, we need to take the square root
# of the diagonal of the matrix
<- sqrt(diag(m2_HAC_cov_mat)) m2_HAC_ses
Once we have these, we might want to look quickly at which variables are significant.
We can do this by using the coeftest()
function from the package lmtest
. The
package lmtest
contains functions for lots of specification tests for different
linear models (to name three examples: likelihood ratio tests, reset tests, and tests for Granger causality).
coeftest()
is a counterpart
to the lm()
object. We pass it a lm()
object and a method for computing a
sandwich covariance matrix. The function then prints out the object summary with the appropriate significance
levels and p-values from our covariance matrix.
library(lmtest)
coeftest(m2, vcov=vcovHAC)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3059175 0.0098534 31.0469 < 2.2e-16 ***
## md_ab -0.3037350 0.0645894 -4.7026 2.597e-06 ***
## md_ab_square -11.3751147 0.6366604 -17.8668 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
4.1.3 Useful packages for econometrics
A (not exhaustive) list of some useful packages for econometrics with R is: - base - the inbuilt set of functions that come with R; - ctest for classical statistical tests; - MASS, lmtest, and AER for cross-sectional models with OLS; - nls for non-linear regression - MatchIt, and RCT for causal inference; - plm for panel data analysis; - ts, timeseries, vars, forecast, dynlm, lubridate, zoo for time series analysis; and - igraph for network analysis.
4.2 Matrix algebra
If we want to program our own estimators, we should use matrices. This means we have to know how to do matrix operations in R. Here, we go through some of the basic matrix operations in R. We use them to program a least-squares estimator directly from one of the groups of the data we created above.
# Common matrix operations
# first way we can create a matrix is by taking the entries of a dataframe,
# and converting them to a matrix with as.matrix
<- as.matrix(df[["md_ab"]], df[["md_ab_square"]])
X_mat <- as.matrix(df[["fr_strikes_mean"]])
y_mat
# create a matrix using matrix, and specifying the rows (nrow) and columns (ncol)
# Let i,j denote the entry on the ith row and jth column
# we pass a vector of numbers to get the entries
# the vector should have the form
# c(1,1; 1,2; 1,3; ..... 2,1; 2,2; ..... n,n)
<- matrix(c(1,4,1,3,2,2,5,4,3), nrow=3, ncol=3)
A_mat <- matrix(c(1,1,1,2,2,2,3,3,3), nrow=3, ncol=3)
B_mat
# elementwise multiplication
<- A_mat * B_mat
elementwise_mat
# matrix multiplication
<- A_mat %*% B_mat
mult_mat
# outer product
<- A_mat %o% B_mat
out_mat
# transpose
<- t(A_mat)
A_t
# cross product i.e A'B - two ways of doing it
<- t(A_mat)%*%B_mat
cross_mat_1 <- crossprod(A_mat, B_mat)
cross_mat_2
# Matrix of squares A'A - two ways of doing it
<- t(A_mat)%*%A_mat
sq_mat_1 <- crossprod(A_mat, A_mat)
sq_mat_2
# creating diagonal matrices
# wrapping a matrix in diag returns the diagonal matrix from the principal
# diagonal of the matrix
<- diag(A_mat)
A_diag
# passing a vector to diag gives a matrix with that on the principal diagonal
# If we pass only a scalar a, it returns an a by a identity matrix instead (no idea why
# this is just how they set it up)
<- diag(c(1,2,3))
A_diag <- diag(3)
I_mat
# Inverses - solve() inverts the matrix, ginv() computes the Moore-Penrose
# generalised pseudo-inverse (need the MASS package for this)
<- solve(A_mat)
A_inv
library(MASS)
<- ginv(A_mat)
A_inv_mp
# Eigenvalues and eigenvectors - eigen() computes eigenvectors [["vectors"]] and
# corresponding eigenvalues [["values"]]
<- eigen(A_mat)
A_eigens "vectors"]] A_eigens[[
## [,1] [,2] [,3]
## [1,] -0.5994073 -0.4695498 -0.01833495
## [2,] -0.6888071 0.8410090 -0.85517793
## [3,] -0.4077447 -0.2687506 0.51801017
"values"]] A_eigens[[
## [1] 7.8486741 -1.5114986 -0.3371754
# singular value decomposition
<- svd(A_mat)
y
# finally, lets compute one of our least squares estimators
<- solve((t(X_mat) %*% X_mat)) %*% t(X_mat) %*% y_mat
beta_vec
# now computing variance-covariance matrix
<- y_mat - X_mat %*% beta_vec
resid_vec <- solve((t(X_mat) %*% X_mat)) %*% t(X_mat) %*% resid_vec %*% t(resid_vec) %*% X_mat %*% solve((t(X_mat) %*% X_mat)) vcov