Chapter 6 Advanced programming techniques
By this point, we have covered the basics of programming in R. Now we have done this, we will present a little more advanced material that can be useful and make your code better. These are based around programming your own estimators. First, we will show you how to simulate some data to use with Monte-Carlo simulation. We will use the vectorising tricks we learnt in the first section. Then, we will use this data to demonstrate optimisation in R. We will do this by programming a maximum-likelihood estimator. Finally, we will demonstrate how to speed up code by profiling it.
6.1 Advanced iteration and Monte-Carlo simulation
First, we will generate some of our own data to use by doing some Monte-Carlo simulations. This is a useful thing to be able to do by itself. Furthermore, we will use it to reinforce some of the principles of R we learned earlier, and introduce some more cool advanced features you might want to use. Here, we do not simulate our data in the most efficient way that we could. For some more advanced ways of doing so, see this great article by Grant McDermott.
We are generating some random numbers. So we should start by setting a seed for our
random number generator. Seeding sets the starting value for the random number
generators in your program. As random number generators are deterministic, setting
a seed ensures that you get the same random numbers each time that you run your code.
This helps with debugging, and ensures others can reproduce our code.
We can do this in R with set.seed()
.
# lets set a seed for reproducibility
rm(list=ls())
set.seed(854)
Next, we need to generate some random numbers. Base R has a lot of nice inbuilt commands for sampling from different distributions. The synatx is generally as follows. In the first argument, you specify the number of draws you want to make. In the next arguments, you specify the parameters of the distribution - e.g the mean and standard deviation for a normal distribution. R then outputs a numeric vector composed of draws from the distribution.
# Some examples of drawing random numbers in R
# the command 'runif' gives draws from a uniform [0,1] distribution if you do
# not specify a support
<- runif(35)
unif_1
# but you can specify a support, using the 'min' and 'max' parameters
<- runif(35, min= 20, max= 40)
unif_2
# the command 'rnorm' gives draws from a normal distribution with parameters 'mean', 'sd'
# lets do 20 draws from a standard normal distribution
<- rnorm(20, mean=0, sd=1)
norm_1
# the command rchisq gives draws from a chi-square distribution with degrees of
# freedom 'df'
<- rchisq(34, df=100)
chi_sq_1
# finally, rpois gives draws from a poisson distribution with rate parameter
# 'lambda'
<- rpois(22, 6) pois_1
In general, the syntax for random number draws in R is composed of a root name, that we precede with a prefix from (p,q,d,r). The root name is a shorthand for the name of the distribution. P gives us the cumulative probability for a given value from the c.d.f, q gives us the inverse of the c.d.f (i.e quantiles), d gives us the probability mass from the density function, and r draws random numbers from the set distribution. For a list of the roots for common variables in R, see here.
Now we can generate some random numbers, we need to efficiently generate lots of them. For illustration, lets simulate a linear regression model in three variables. Each of the variables is itself normally distributed, and we have a normally distributed error term. We will also only do small simulations of 1000 data points for ease.
Our first instinct when doing something like this might be to just use a for
or
while
loop. We might also think that to store our values, we should initialise
an empty vector and then fill it with values.
# How not to do simulation in R
<- c()
vec_x1 <- c()
vec_x2 <- c()
vec_x3
<- 0
i while (i <100){
<- c(vec_x1, rnorm(1, 2, 1))
vec_x1 <- i + 1
i
}
<- 0
j while (j <100){
<- c(vec_x2, rnorm(1, 2, 1))
vec_x2 <- j + 1
j
}
<- 0
k while (k <100){
<- c(vec_x3, rnorm(1, 2, 1))
vec_x3 <- k + 1
k }
This is not a good way to go about this for two reasons.
Firstly, we do not want to do our simulation using loops because R is a vectorised language. Practically, being vectorised means that operations involving vectors are much much faster than loops. R is different in this way to other languages like Python and C++, which are object-oriented instead. In object-oriented languages, loops are very efficient and can even sometimes be faster than operations on vectors. As the amount of data we want to simulate here is small, this does not matter so much. It will begin to matter more if we try to simulate a lot more data, however.
Secondly, we do not want to store more data by expanding a vector. The way R
stores objects in memory makes this very memory intensive and inefficient. It is
much better to create an object of right length, and then replace the values
we need to use. We can create a vector of a given length using the rep
function.
The first argument of this is some value. A good choice to use is NA, so we can
work out if our filling steps later fail. The second argument is the length of
the vector you want to produce.
Lets do the second first. We now reproduce one of the simulations above, but instead of growing a vector we create a vector of NAs, and then fill it up.
# Some more efficient, vectorised, simulation code
# creating empty vectors with rep
<- rep(NA, 100)
vec_x1
<- 0
i while (i <100){
# note we have to use i+1 here as we start at 0, but R starts counting lengths
# at 1
+1] <- rnorm(1, 2, 1)
vec_x1[i<- i + 1
i }
Next, lets replace the loop. We already know that we can draw vectors of random variables directly. We might not be able to do this if our use case is a bit more complicated though. Thus, we should also look at applying a function to generate the data. We will do this by generating 100 sets of our observations, storing these as a dataframe, and then applying a function to fit regression models to these observations.
#--------------------------------------------------------------------------------
# Some more efficient Monte-Carlo
# Say we want to find the distribution of the estimates of \beta_{1} in a
# linear regression model of the form y_{i}= \alpha + \sum_{j}\beta_{j}x_{i}^{j} + e_{i}
# up to the third order
#--------------------------------------------------------------------------------
# Parameters -------------------------------------------------------------------
# Number of simulations
<- 100
mc
# Number o data points within each simulation
<- 100
length_within_sim
# 'True parameters' of the regression model
= 2.2
alpha = 0.5
beta_1 = 0.3
beta_2 = 0.2
beta_3
# Functions -------------------------------------------------------------------
# here we are going to group our dataframe by a group indicator, and
# then fit the linear regression to subsets of the data by the group
# indicator
<- function(group, group_col, df, reg_formula){
fit_reg
# note that it is acutally more efficient here to use the lm.fit
# method - if you would like to do something more advanced take
# a look at that with ?lm.fit
<- lm(reg_formula, data=mc_df[mc_df[[group_col]]==group,])
mod return(mod[["coefficients"]][2])
}
# Running code -----------------------------------------------------------------
# Generating our variables
<- rep(mc*length_within_sim, alpha)
int <- rnorm(mc*length_within_sim , 2, 1)
vec_x1 <- rnorm(mc*length_within_sim, 0,1)
vec_err
# Here we are fitting a cubic function. Thus, we simulate the new
# variables by operating on the old ones. This is more efficient
# (Thanks to Katya Ugulava for pointing this out to me)
<- vec_x1^2
vec_x2 <- vec_x1^3
vec_x3
# now creating a group indicator for each of our simulations using the each
# command in rep
<- rep(c(1:100),each=100)
group
# putting these ina dataframe to then fit the regression
<- as.data.frame(cbind(group, int, vec_x1, vec_x2, vec_x3))
mc_df
names(mc_df) <- c("group","int", "x1", "x2", "x3")
# lets just change the rownames now for ease of use
# notice we are using fast vectorised multiplication here!
"y"]] <- mc_df[["int"]] + beta_1 * mc_df[["x1"]] + beta_2 * mc_df[["x2"]] + beta_3 * mc_df[["x3"]] + vec_err
mc_df[[
# finally, lets apply our fitting function above to the dataframe
# to get a set of estimates of \beta_{1}
<- unlist(lapply(1:100, FUN=fit_reg, group_col="group", df=mc_df,
beta_vecreg_formula = "y~x1+x2+x3"))
hist(beta_vec, breaks=25, col="red", main="Distribution of the estimator for beta_{1}")
If we want to do this more times, we should write the whole Monte-Carlo operation as a function where we pass the parameter values as an argument.
We can still make this more efficient though, by replacing our dataframe with
an object called a data.table
or a tibble
, and then doing something called
‘nesting’ the operation within the table. Here we cover how to do this with a
data.table
- for tibbles
see https://r4ds.had.co.nz/many-models.html .
A data.table
is effectively an improved version of the base R data.frame
. It
comes in its own package with the same name. data.tables
keep the same functionality
as data,frames
, but add the ability to perform operations on the rows and columns
as part of the object. We will demonstrate these functionalities while performing
our simulations. The most important of these is the by
argument. by
is an inbuilt
argument to group data by the values in a column within the data.table
.
# Demonstrating some data table functionalities
# before performing our Monte-Carlo
library(data.table)
# lets generate some data and use it
# one random variable and two different
# groups
<- rnorm(10000, 0,1)
col_1 <- rep(c("A", "B"), each=5000)
col_2
# the command as.data.table allows us to convert an object into a data.table
# if we already have a dataframe or list object, we can use DT() instead
<- as.data.table(cbind(col_1, col_2))
dt names(dt) <- c("r_v", "grp")
# this code ensures our r_v column is a numeric variable
:=as.numeric(r_v)]
dt[,r_v
# In addition to selecting columns by passing strings, in data.table we can now
# just select columns based on names
# data.table also recognises automatically which of our conditions are rows and
# column names, so we do not have to include the data.table name to carry out
# operations as in a data.frame
< -1 & grp =="A"] dt[r_v
## r_v grp
## 1: -1.703836 A
## 2: -1.674758 A
## 3: -1.641446 A
## 4: -1.029205 A
## 5: -1.831224 A
## ---
## 766: -1.729828 A
## 767: -1.713903 A
## 768: -1.821262 A
## 769: -1.930295 A
## 770: -1.403510 A
# One inbuilt functionality is ordering!
# ordering lexically by group, and then by the value of the random variable
<- dt[order(r_v)]
dt
# Some useful inbuilt functions
# .N counts the number of observations with a preceeding condition
<-1 & grp=="A", .N ] dt[r_v
## [1] 5000
# by allows us to start doing some aggregation
# lets compute the number of observations in each group
=grp] dt[,.N, by
## grp N
## 1: A 5000
## 2: B 5000
# we have to enclose custom functions in a bracket, and precede them with
# a . to apply them to all observations that satisfy any condition in
# the preceding argument
mean(r_v)), by=grp] dt[,.(
## grp V1
## 1: A 0.009828461
## 2: B -0.012719875
< -1,.(mean(r_v)), by=grp] dt[r_v
## grp V1
## 1: A -1.521316
## 2: B -1.551410
# we can chain these operations on the rows by adding [] containing a new
# operation after the end of our code above
< -1,.(mean(r_v)), by=grp][order(grp)] dt[r_v
## grp V1
## 1: A -1.521316
## 2: B -1.551410
Above, we only provide a brief overview of what data.tables
can do. Also read
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html .
Now lets use data.table
to nest our Monte-Carlo routine. The idea of nesting is the
following. Typically, when we create an object like a data.frame
or data.table
,
we think of each entry as being a single thing like a number or word. But this does
not have to be the case. We can actually fill the entries in these objects with
other objects, like lists or data.tables
. data.table
provides us with a set
of inbuilt iteration operations that are more efficient than just using the
base apply
over subsets of a given data.frame
or data.table
. Thus, by storing
subsets of our data as observations in our data.table
we can exploit the inbuilt
iteration operations to do any iteration we want in a more efficient way.
If that sounds abstract, lets take a look at a concrete example by making our
Monte-Carlo code a bit more efficient. We will use the .SD
inbuilt function
in data.table
to create a data.table containing just nested subsets of our data
by the value of a variable (in our case, our group variable we make during the
simulation).
#--------------------------------------------------------------------------------
# Some even more efficient Monte-Carlo using nested operations in data.table
# Say we want to find the distribution of the estimates of \beta_{1} in a
# linear regression model of the form y_{i}= \alpha + \sum_{j}\beta_{j}x_{i}^{j} + e_{i}
# up to the third order
#--------------------------------------------------------------------------------
# Parameters -------------------------------------------------------------------
# Number of simulations
<- 100
mc
# Number o data points within each simulation
<- 100
length_within_sim
# 'True parameters' of the regression model
= 2.2
alpha = 0.5
beta_1 = 0.3
beta_2 = 0.2
beta_3
# Functions -------------------------------------------------------------------
# here we are going to group our dataframe by a group indicator, and
# then fit the linear regression to subsets of the data by the group
# indicator
<- function(group, group_col, df, reg_formula){
fit_reg
# note that it is acutally more efficient here to use the lm.fit
# method - if you would like to do something more advanced take
# a look at that with ?lm.fit
<- lm(reg_formula, data=mc_df[mc_df[[group_col]]==group,])
mod return(mod[["coefficients"]][2])
}
# Running the code -------------------------------------------------------------
# Generating our variables
<- rep(mc*length_within_sim, alpha)
int <- rnorm(mc*length_within_sim , 2, 1)
vec_x1 <- rnorm(mc*length_within_sim, 0,1)
vec_err
# Here we are fitting a cubic function. Thus, we simulate the new
# variables by operating on the old ones. This is more efficient
# (Thanks to Katya Ugulava for pointing this out to me)
<- vec_x1^2
vec_x2 <- vec_x1^3
vec_x3
# now creating a group indicator for each of our simulations using the each
# command in rep
<- rep(c(1:100),each=100)
group
# putting these in a dataframe to then fit the regression
<- as.data.table(cbind(group, int, vec_x1, vec_x2, vec_x3))
mc_df names(mc_df) <- c("group","int", "x1", "x2", "x3")
# lets just change the rownames now for ease of use
# notice we are using fast vectorised multiplication here!
"y"]] <- mc_df[["int"]] + beta_1 * mc_df[["x1"]] + beta_2 * mc_df[["x2"]] + beta_3 * mc_df[["x3"]] + vec_err
mc_df[[
# now lets define a data.table containing a new column 'data' which is our
# data subsetted by the value of the group variable
= mc_df[, list(data=list(.SD)), by=group]
mc_dat
# now we can use the inbuilt data.table iterators to fit all of our regression
# models
# what this does here is define a new column (:=) by applyng the function
# from our function above to all of the elements x of the column
# As we do it within the data.table, it is really fast!
:= lapply(data, function(x) lm(y ~ x1 + x2 + x3, x)[["coefficients"]][2])]
mc_dat[, model
# now mc_dat[, model[[1]], by=group] gives us our data back as a column
# (automatic name V1) with the group. To plot it, we need to extract it as
# a vector. Thus, we slice it out just using standard dataframe syntax for ease.
hist(mc_dat[,model[[1]], by=group][["V1"]], breaks=25, col="red", main="Distribution of the estimator for beta_{1}")
6.2 Optimisation
If we choose to program our own estimators, we often compute the value of the
estimator by maximising or minimising an objective function. Thus, we need to
know how to do optimisation in R. Here, we cover the basic way to do optimisation
in R using the optim
function. As an example, we show how to program a maximum-likelihood
estimator from scratch (as opposed to using say mle
).
The optim
function in base R is a minimiser. It takes the value of a objective
function, which we write as a function. It then minimises the function using an
algorithm that we specify, given a starting value that we also specify. The first
argument is the starting value. The second argument is the function that we want
to minimise. The third argument is the method that we want to use to compute it.
The default method is the Nelder-Mead algorithm; value “BFGS” gives the BFGS
quasi Newton-Raphson algorithm. Adding hessian=T
gives us the Hessian matrix
as an output.
When we write the function, the vector that we are trying to optimise must be the first argument of the function. We must return the value of the objective function that we are trying to minimise. Thus, if we want to maximise something, we must return the negative of that thing.
Lets try it out by programming the maximum likelihood estimator for a linear regression with normally distributed errors using the data from one of the runs of our Monte-Carlo simulation earlier.
# Example of optimisation in R - programming the maximum likelihood estimator
# for a linear regression with normally distributed errors
# Here I borrow from this guide https://www.ime.unicamp.br/~cnaber/optim_1.pdf
# we will minimise the negative log-likelihood (equivalent of course to
# maximising the normal likelihood function as the location of optima are
# preserved under monotonic transformations)
# we just do it for one parameter for ease
<- function(param_vec, X_mat, y_vec){
negative_log_likelihood <- param_vec[1:4]
beta_vec <- param_vec[5]
sigma2 <- nrow(y_vec)
n <-
mu <- sum((y_vec-X_mat %*% beta_vec)**2)
inner_sum <- -0.5*n*log(2*pi) -0.5*n*log(sigma2) - (1/(2*sigma2))*sum((y_vec-X_mat %*% beta_vec)**2)
log_l return(-log_l)
}
<- optim(c(0,0, 0, 0,1), negative_log_likelihood, method="BFGS", hessian = T,
beta_vec X_mat = as.matrix(mc_df[mc_df[["group"]]==1,c(2:5)]),
y_vec = as.matrix(mc_df[mc_df[["group"]]==1,c(6)]))
## Warning in log(sigma2): NaNs produced
beta_vec
## $par
## [1] 0.9999929 0.9724584 -0.2416691 0.3128821 3.5391276
##
## $value
## [1] 169.0368
##
## $counts
## function gradient
## 60 11
##
## $convergence
## [1] 0
##
## $message
## NULL
##
## $hessian
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2.825555e+09 5.362870e+05 1.293351e+06 3.533071e+06 2.0816105462
## [2,] 5.362870e+05 1.293351e+02 3.533071e+02 1.054175e+03 0.0006779075
## [3,] 1.293351e+06 3.533071e+02 1.054175e+03 3.345331e+03 0.0022651463
## [4,] 3.533071e+06 1.054175e+03 3.345331e+03 1.111938e+04 0.0076538527
## [5,] 2.081611e+00 6.779075e-04 2.265146e-03 7.653853e-03 -1.7645778030
We see here that this outputs the parameters, the value of the log-likelihood at the minimum, and then some additional statistics. As we selected the Hessian, we can compute the standard errors.
# We get the standard errors by taking the square root of the principal diagonal
# of the inverse of the Hessian
sqrt(diag(solve(beta_vec[["hessian"]])))
## Warning in sqrt(diag(solve(beta_vec[["hessian"]]))): NaNs produced
## [1] 6.556477e-05 1.126158e+00 6.547648e-01 1.124175e-01 NaN
6.3 Profiling
Imagine we want to make our code more efficient so it runs quicker. This may sound unnecessary, but it can be important in applied econometric. If we are working with large datasets with hundreds of thousands of rows and doing lots of iteration small inefficiencies can make our code run very slowly as we might be hitting them thousands and thousands of times.
The standard way to do this is by ‘profiling’ our code. The aim is to find the
biggest bottleneck we have not dealt with yet, deal with it as best we can, and
repeat until our code performs well enough. This might sound trivial. It is not.
Thus, programmers typically find the bottlenecks by iteratively testing and
timing sections of the code using small, manageable inputs. Here, we take a
short look at how to do this in R with the profvis
package. For more details,
look at the chapter on this in Hadley Wickham’s ‘Advanced R’ (we follow the first
section of this chapter here).
# loading the library containing our profiler
library(profvis)
## Warning: package 'profvis' was built under R version 4.0.5
<- function() {
f_1 pause(0.5)
f_2
}
<- function(){
f_2 print("hello world")
}
# we can profile by taking our function, and wrapping it in the function 'lineprof()'
# we wrap all of the code we want to profile in curly brackets {}
profvis({f_1()
f_2()})
## [1] "hello world"
The profvis
function will open up an interactive profiler where we can
look at the relative speed of different parts of our code. This allows us to
isolate which bits of the code take more or less time. We can test different
versions by wrapping them in the profvis wrapper, and seeing if they take more
or less time.
# now make the pause slower
# we should see that it runs a lot quicker
<- function() {
f_3 pause(0.25)
f_2
}
# it does!
profvis({f_3()
f_2()})
## [1] "hello world"