<- read.csv("minimum wage.csv", as.is = TRUE) x
Learning Microeconometrics with R
Introduction
The Intern
You have been hired as a summer intern for a right-of-center think tank in Washington DC. It is going to be a great summer! You will play softball on the Mall. Go to Nats games. Hang out with friends interning on the Hill. And melt onto the sidewalk when you commute to work.
First Day
The think tank is arguing against a federal increase in the minimum wage. You have been asked to predict what will happen if the minimum wage increases from $7.25 to $15.00.
You have a vague memory of a discussion of the minimum wage in your Econ 101 class. To refresh your memory you google “minimum wage Khan Academy.” You listen to Sol explain that the equilibrium wage is $6 per hour and workers work 22 million hours per month. Sol shows that a minimum wage of $7 leads to 2 million hours of unemployment and $1 million of output per month lost in the economy. This seems straightforward.
But what actually happens in the real world? Your supervisor suggests looking up minimum wages for each state and state level unemployment levels from the Bureau of Labor Statistics (https://bls.gov). She says that different states have changed their minimum wage over time and a number of states have minimum wages that are above $7.25, although none as high as $15.
You download the data on each state’s current minimum wage and unemployment rate. You put everything in a spreadsheet. A fellow intern shows you how to save it as a csv
file. He says this will allow importing the data into R
, which is the statistical language of choice at your think tank.
You then download R
and RStudio
(the IDE you are told, whatever that is). Your colleague shows you how to get set up. He shows you how to open up RStudio
and then create a new script file. You call the file minwage.R
and save it to the minimum wage folder where you have the data set. He then tells you to go to Session > Set Working Directory > To Source File Location
. “Trust me. It makes coding easier,” he says.
Now you are ready to write your first line of code.
Your colleague explains that read.csv
will import the data set that you created. The data set is simply called x
. He explains that you must use the assign <-
arrow. You ask why. He shrugs, “that was what I was told when I started.” Also, he says you should probably write as.is = TRUE
because R has a habit of changing numbers to characters and other, even stranger, things.
You click Run
. It worked! The letter x
appears in the Global Environment. You click on it. A tab with your data appears.
You want to calculate the relationship between the minimum wage and unemployment. You want to run a regression.1 You ask your cubicle neighbor how to do that. She tells you to write the following.
<- lm(x$Unemployment.Rate ~ x$Minimum.Wage) lm1
You ask about the whole thing with <-
. Your neighbor says that you must do it that way but refuses to explain why.
You write out the code and hit Run
. Nothing happens. Actually, lm1
appears in the box in the upper right of the screen. Apparently it is a List of 12
. You were hoping to see a table with regression results and t-statistics. But nothing. You ask for help from your neighbor. She rolls her eyes. “No. You just created an object called lm1
. To look at it, use summary
.”
summary(lm1)[4]
$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.49960275 0.32494078 10.7699709 1.62017e-14
x$Minimum.Wage 0.01743224 0.03733321 0.4669365 6.42615e-01
Cool. You got what you were looking for. The minimum wage increases unemployment! It increases it by 0.01743. You wonder what that means. Another intern comes by, looks at what you did, and then types the following code on your computer. He leaves with a “you’re welcome.” Boy, is that guy annoying.
<- (15-7.25)*lm1$coefficients[2]
a1 /mean(x$Unemployment.Rate) a1
x$Minimum.Wage
0.03710335
Your neighbor explains it all. You want to know what happens to unemployment when the minimum wage increase from $7.25 to $15. The second coefficient states that amount. Then you can put it in percentage terms relative to the current unemployment rate.
You go back to your supervisor. You say you found that a minimum wage increase to $15 would increase the unemployment rate by four percent. “The unemployment rate would go to 8%!” she exclaims. No, no, no. You clarify that it would increase by 4 percent not 4 percentage points. From say 4% to 4.15%. “Oh. Still that is a big increase.” Then she says, “But are you sure? How can you tell what will happen in states that haven’t changed their minimum wage?” You respond accurately. “I don’t know.” As you ponder this, you notice everyone is getting dressed for softball.
Second Day
On your way into the building you run into your supervisor. You explain how you were able to beat CBO’s Dynamic Scorers by just one run. She congratulates you and says, “You should plot the relationship between the minimum wage and unemployment.”
After some coffee and some googling you find code to do what your supervisor suggested.
plot(x$Minimum.Wage, x$Unemployment.Rate)
The annoying guy from yesterday breezes by and says. “Oh no. Looks like it is censored. You will need to use a Tobit.” At this, someone a few cubicles away pops up like a meerkat. He says, “No. Don’t use a Tobit, use a Heckit. The data is probably selected.” Then he is gone.
What the heck is a Heckit? What the heck is a Tobit? What the heck is a meerkat?
The Book
The book is designed to help our fictitious intern hero survive the summer in DC.
What Does it Cover?
The book is based on a course I have taught at Johns Hopkins University as part of their Masters of Applied Economics. The book and the course aim to provide an introduction to applied microeconometrics. The goal is for the reader to have competence in using standard tools of microeconometrics including ordinary least squares (OLS), instrumental variables (IV), probits and logits. Not only should you be able to understand how these models work, but more importantly, you should be able to understand when they don’t.
In addition to these standard models, the book and the course introduce important models commonly used in microeconometrics. These include the Tobit and the Heckman selection model (Heckit). The book introduces approaches that have become common in my subfield of empirical industrial organization. These approaches center around the analysis of games. That is, situations where a small number of individuals or firms interact strategically.
Lastly, the book introduces some new techniques that have been developed to analyze panel data models and other situations where an object of interest is measured repeatedly. It discusses difference in difference as well as stochastic controls. It touches on how machine learning techniques can be applied in microeconometrics. The book also introduces to a broader economic audience ideas of Harvard statistician, Herbert Robbins, called empirical Bayesian analysis.
What is the Approach?
The book teaches microeconometrics through R. It is not primarily aimed at teaching R. Rather, it is primarily aimed at teaching microeconometrics. This idea of using computer programming as a tool of instruction goes back to at least Seymour Papert and MIT’s AI lab in the 1970s.2 South African-born Papert helped develop a programming language called Logo
. The goal of Logo
was to teach mathematics by programming how a turtle moves around the screen. You may have used one of the offspring of Logo
, such as Scratch
or Lego Mindstorms
.
I learned math through Logo
. When I was a pre-teen, Logo
became available on the personal computer, the Apple II
. My parents taught me to program in Logo
and I learned a number of geometric concepts such as Euclidean distance and the Pythagorean theorem by programming up a model of how a moth circled a light source.
The book uses Papert’s ideas to teach microeconometrics. You will learn the math of the estimator and then how to program up that estimator.3 The book makes particular use of the computer’s ability to simulate data. This allows us to compare our estimates to what we know to be the true values. In some cases these simulations illustrate that our estimator is correct; in others, the simulations help us to understand why our estimator is incorrect. Testing models on simulated data has the additional benefit of allowing you to check your programming.
The book is written in RStudio
using Sweave
. Sweave
allows to be integrated into R
. is a free type-setting language that is designed for writing math. Almost all the code that is used in the book is actually presented in the book. On occasion it is more practical to create a data set outside the book. In those cases, the data and the code that created the data are available here https://sites.google.com/view/microeconometricswithr/table-of-contents. In a couple of other cases, the preferred code does not produce nice output for the book. I have highlighted those cases in the text. I also generally hide repetitive code. For the most part, the coding in the book is in base R . The book makes little use of packages. This shows you the underlying code and illustrates the econometrics. That said, there are a few packages that I really like, including stargazer
and xtable
which both make nice tables in .
What are POs, DAGs, and Do Operators?
POs, DAGs and Do Operators sound like school-yard put-downs, but they form the core of the book’s approach to econometrics. This approach is heavily influenced by Northwestern econometrician, Charles Manski. I was lucky enough to have Chuck teach me econometrics in my first year of graduate school. It was a mind-altering experience. I had taken a number of econometrics classes as an undergraduate. I thought much of it was bunk. Manski said much of what you learned in standard econometrics classes was bunk. Manski gave meat to the bones of my queasiness with econometrics.
The book focuses on the question of identification. Does the algorithm estimate the parameter we want to know? Other books spend an inordinate amount of time on the accuracy of the parameter estimate or the best procedure for calculating the estimate. This book steps back and ask whether the procedure works at all. Can the data even answer the question? This seems to be fundamental to econometrics, yet is given short shrift in many presentations.
The book focuses on identifying the causal effect. What happens to the outcome of interest when the policy changes? What happens if college becomes free? What happens if prices are increased? What happens if the federal minimum wage is increased?
To answer these causal questions, the book uses directed acyclic graphs (DAG) and do operators. The book, particularly the early chapters, relies on ideas of Israeli and UCLA computer scientist, Judea Pearl. DAGs help us to understand whether the parameter of interest can be identified from the data we have available. These diagrams can be very useful models of the data generating process. Hopefully, it will be clear that DAGs are models, and as such, they highlight some important issues while suppressing others.
Pearl’s do operator helps illustrate the old statistical chestnut, “correlation is not causality.” Observing the unemployment rate for two different minimum wage laws in two different states is quite different from changing the minimum wage law for one state. In the first instance we observe a statistical quantity, the unemployment rate conditional on the minimum wage law. In the second case we are making a prediction, what will happen to the unemployment rate if the law is changed?
In some cases it is useful to illustrate these issues using the potential outcome (PO) model of former Harvard statistician, Donald Rubin. This model highlights the fundamental identification problem of statistics. We can never observe the difference in state unemployment rates for two different minimum wage laws. Sure, we can observe unemployment rates for two different states with different minimum wage laws. We can even observe the difference in unemployment rates for the same state before and after a change in the minimum wage law. However, we cannot observe the unemployment rate for the same state at the same time with two different minimum wage laws.
In addition, the PO model illustrates that causal effect is not single valued. A policy that encourages more people to attend college may allow many people to earn higher incomes, but it may not help all people. It is even possible that some people are made worse off by the policy. There is a distribution of causal effects.
What About the Real World?
The course I have taught at Hopkins is for a Masters of Applied Economics. I take the applied part of this seriously. The course and this book aim to show how to do microeconometrics. I have spent my career using data to answer actual policy questions. Did a realtor group’s policies lead to higher prices for housing transactions?4 Did Google’s changes to the search results page harm competitors or help consumers?
The book presents interesting and important questions. One of the most important is measuring “returns to schooling.” What is the causal effect on income of having one more year of school? It is easy to see that people with college degrees earn more than those with high school diplomas. It is much harder to determine if a policy that encourages someone to finish college actually leads that person to earn more money. I throw lots of data, economic theory and statistical techniques at this question. Hopefully, by the end you will see how analysis of survey data with OLS, IV, Heckman selection and GMM models helps us answer this question. You will also see how mixture models can be used to analyze comparisons of twins.
The book discusses important questions beyond returns to schooling. It discusses racism in mortgage lending. It discusses gender bias in labor market earnings. It discusses increasing the federal minimum wage. It discusses the effect of guns on crime. It even discusses punting on fourth down. I hope the book points you to new questions and new data to answer existing questions.
The book does not recommend policies. The government economist and founding Director of the Congressional Budget Office, Alice Rivlin, argued that it is extremely important to provide policy makers with objective analysis. In a memo to staff she said the following.5
We are not to be advocates. As private citizens, we are entitled to our own views on the issues of the day, but as members of CBO, we are not to make recommendations, or characterize, even by implication, particular policy questions as good or bad, wise or unwise.
Economists in government, the private sector and the academy work on important policy questions. I believe that economists are most effective when they do not advocate for policy positions, but present objective analysis of the economics and the data. I hope that this book presents objective analysis of interesting policy questions and you have no idea whether I think particular policy positions are good or bad, wise or unwise.6
The Outline
The book’s twelve chapters are broken into three parts based on the main approach to identification. The first part presents methods that rely on the existence of an experiment. This part includes chapters covering ordinary least squares (OLS), instrumental variables (IV), randomized controlled trials (RCTs) and Manski bounds. The second part presents methods that rely on economic theory to identify parameters of interest. This is often referred to as a structural approach. These chapters discuss demand models and discrete estimators such as logits and probits, censored and selection models, non-parametric auction models and generalized method of moments (GMM). The third part presents methods that rely on the existence of repeated measurement in the data. These methods include difference in difference, fixed effects, synthetic controls and factor models.
Experiments
The first four chapters rely on experiments, broadly construed, to identify the causal effects of policies.
Chapter 1 introduces the work-horse algorithm of economics, ordinary least squares (OLS). This model is simple and quick to estimate and often produces reasonable results. The chapter illustrates how the model is able to disentangle the effects on the outcome of interest. OLS relies on strong assumptions. In particular, the model assumes that the policy variable of interest affects the outcome independently of any unobserved term.
Chapter 2 considers how additional observed characteristics improve our estimates. It shows when adding more control variables improves the estimation and when it produces garbage. The chapter discusses the problem of multicollinearity. It discusses an alternative to the standard approach based on the work of Judea Pearl. The chapter replicates the OLS model used in Card (1995) to estimate returns to schooling. The chapter uses a DAG and Pearl’s approach to help determine whether there exists evidence of systematic racism in mortgage lending.
Chapter 3 introduces the instrumental variables model. This model allows the independence assumption to be weakened. The model allows the policy variable to be affected by unobserved characteristics that also determine the outcome. The chapter presents IV estimates of returns to schooling by replicating Card (1995). DAGs are used to illustrate and test the assumptions. The Local Average Treatment Effect (LATE) is proposed as an estimator when the researcher is unwilling to assume the treatment effects each person identically.
Chapter 4 considers formal experiments. The ideal randomized controlled trial allows the researcher to estimate the average effect of the policy variable. It also allows the researcher to bound the distribution of effects using Kolmogorov bounds. The method is used to bound the effect of commitment savings devices on increasing or decreasing savings. The chapter presents Manski’s natural bounds and discusses inference when the data does not come from ideal randomized controlled trials. It considers the problem of estimating the causal effect of guns on crime using variations in state gun laws.
Structural Estimation
The first four chapters consider questions and issues relevant to economics, but describe standard estimation methods. Chapters 5 to 9 use economic theory directly in the estimation methods.
Chapter 5 introduces revealed preference. The chapter shows how this idea is used to infer unobserved characteristics of individual economic actors. Berkeley econometrician, Dan McFadden, pioneered the idea of using economic theory in his analysis of how people would use the new (at the time) Bay Area Rapid Transit (BART) system. This chapter introduces standard tools of demand analysis including the logit and probit models. It takes these tools to the question of whether smaller US cities should invest in urban rail infrastructure.
Chapter 6 also uses revealed preference, but this time to analyze labor markets. Chicago’s Jim Heckman shared the Nobel prize with Dan McFadden for their work on revealed preference. In McFadden’s model you, the econometrician, do not observe the outcome from any choice, just the choice that was made. In Heckman’s model you observe the outcome from the choice that was made, but not the outcome from the alternative. The chapter describes the related concepts of censoring and selection, as well as their model counterparts the Tobit and Heckit. The section uses these tools to analyze gender differences in wages and returns to schooling.
Chapter 7 returns to the question of estimating demand. This time it allows the price to be determined as the outcome of market interactions by a small number of firms. This chapter considers the modern approach to demand analysis developed by Yale economist, Steve Berry. This approach combines game theory with IV estimation. The estimator is used to determine the value of Apple Cinnamon Cheerios.
Chapter 8 uses game theory and the concept of a mixed strategy Nash equilibrium to reanalyze the work of Berkeley macroeconomist, David Romer. Romer used data on decision making in American football to argue that American football coaches are not rational. In particular, coaches may choose to punt too often on fourth down. Reanalysis finds the choice to punt to be generally in line with the predictions of economic theory. The chapter introduces the generalized method of moments (GMM) estimator developed by the University of Chicago’s Lars Peter Hansen.
Chapter 9 considers the application of game theory to auction models. The book considers the GPV model of first price (sealed-bid) auctions and the Athey-Haile model of second price (English) auctions. GPV refers to the paper by Emmanuel Guerre, Isabelle Perrigne and Quang Vuong, Optimal Nonparametric Estimation of First-Price Auctions published in 2000. The paper promoted the idea that auctions, and structural models more generally, can be estimated in two steps. In the first step, standard statistical methods are used to estimate statistical parameters of the auction. In the second step, economic theory is used to back out the underlying policy parameters. For second-price auctions, the chapter presents Athey and Haile (2002). Stanford’s Susan Athey and Yale’s Phil Haile provide a method for analyzing auctions when only some of the information is available. In particular, they assume that the econometrician only knows the price and the number of bidders. These methods are used to analyze timber auctions and determine whether the US Forestry Service had legitimate concerns about collusion in the 1970s logging industry.
Repeated Measurement
Chapters 10 to 12 consider data with repeated measurement. Repeated measurement has two advantages. First, it allows the same individual to be observed facing two different policies. This suggests that we can measure the effect of the policy as the difference in observed outcomes. Second, repeated measurement allows the econometrician to infer unobserved differences between individuals. We can measure the value of a policy that affects different individuals differently.
Chapter 10 considers panel data models. Over the last 25 years, the difference in difference estimator has become one of the most used techniques in microeconometrics. The chapter covers difference in difference and the standard fixed effects model. These methods are used to analyze the impact of increasing the minimum wage. The chapter replicates David Card and Alan Krueger’s famous work on the impact of increasing the minimum wage in New Jersey on restaurant employment. The chapter also measures the impact of the federal increase in the minimum wage that occurred in the late 2000s. The chapter follows Currie and Fallick (1996) and uses fixed effects and panel data from the National Longitudinal Survey of Youth 1997.
Chapter 11 considers a more modern approach to panel data analysis. Instead of assuming that time has the same effect on everyone, the chapter considers various methods for creating synthetic controls. It introduces the approach of Abadie, Diamond, and Hainmueller (2010) as well as alternative approaches based on regression regularization and convex factor models. It discusses the benefits and costs of these approaches and compares them using NLSY97 to measure the impact of the federal increase in the minimum wage in the late 2000s.
Chapter 12 introduces mixture models. These models are used throughout microeconometrics, but they are particularly popular as a way to solve measurement error issues. The chapter explains how these models work. It shows that they can be identified when the econometrician observes at least two signals of the underlying data process of interest. The idea is illustrated estimating returns to schooling for twins. The chapter returns to the question of the effect of New Jersey’s minimum wage increase on restaurant employment. The mixture model is used to suggest that the minimum wage increase reduced employment for small restaurants, consistent with economic theory.
Technical Appendices
The book has two technical appendices designed to help the reader to go into more depth on some issues that are not the focus of the book.
Appendix A presents statistical issues, including assessing the value of estimators using measures of bias, consistency and accuracy. It presents a discussion of the two main approaches to finding estimators, the classical method and the Bayesian method. It discusses standard classical ideas based on the Central Limit Theorem and a more recent innovation known as bootstrapping. The Bayesian discussion includes both standard ideas and Herbert Robbins’ empirical Bayesian approach. Like the rest of the book, this appendix shows how you can use these ideas but also gives the reader some insight on why you would want to use them. The appendix uses the various approaches to ask whether John Paciorek was better than Babe Ruth.
Appendix B provides more discussion of R and various programming techniques. The appendix discusses how R is optimized for analysis of vectors, and the implications for using loops and optimization. The chapter discusses various objects that are used in R
, basic syntax and commands as well as basic programming ideas including if () else
, for ()
and while ()
loops. The appendix discusses how matrices are handled in R . It also provides a brief introduction to optimization in R .
Notation
As you have seen above, the book uses particular fonts and symbols for various important things. It uses the symbol R to refer to the scripting language. It uses typewriter font
to represent code in R . Initial mentions of an important term are in bold face font.
In discussing the data analysis it uses
Hello R World
To use this book you need to download R and RStudio
on your computer. Both are free.
Download R and RStudio
First, download the appropriate version of RStudio
here: https://www.rstudio.com/products/rstudio/download/#download. Then you can download the appropriate version of R here: https://cran.rstudio.com/.
Once you have the two programs downloaded and installed, open up RStudio
. To open up a script go to File > New File > R Script
. You should have four windows: a script window, a console window, a global environment window, and a window with help, plots and other things.
Using the Console
Go to the console window and click on the >
. Then type print("Hello R World")
and hit enter. Remember to use the quotes. In general, R functions have the same basic syntax, functionname
with parentheses, and some input inside the parentheses. Inputs in quotes are treated as text while inputs without quotes are treated as variables.
print("Hello R World")
[1] "Hello R World"
Try something a little more complicated.
<- "Chris" # or write your own name
a print(paste("Welcome",a,"to R World",sep=" "))
[1] "Welcome Chris to R World"
Here we are creating a variable called a
. To define this variable we use the<-
symbol which means “assign.” It is possible to use =
but that is generally frowned upon. I really don’t know why it is done this way. However, when writing this out it is important to include the appropriate spaces. It should be a <- "Chris"
rather than a<-"Chris"
. Not having the correct spacing can lead to errors in your code. Note that #
is used in R
to “comment out” lines in codes. R does not read the line following the hash.
In R we can place one function inside another function. The function paste
is used to join text and variables together. The input sep = " "
is used to place a space between the elements that are being joined together. When placing one function inside another make sure to keep track of all of the parentheses. A common error is to have more or fewer closing parentheses than opening parentheses.
A Basic Script
In the script window name your script. I usually name the file something obvious like Chris.R
. You can use your own name unless it is also Chris.
# Chris.R
Note that this line is commented out, so it is does nothing. To actually name your file you need to go to File > Save As
and save it to a folder. When I work with data, I save the file to the same folder as the data. I then go to Session > Set Working Directory > To Source File Location
. This sets the working directory to be the same as your data. It means that you can read and write to the folder without complex path names.
Now you have the script set up. You can write into it.
# Chris.R
# Import data
<- read.csv("minimum wage.csv", as.is = TRUE)
x # the data can be imported from here:
# https://sites.google.com/view/microeconometricswithr/
# table-of-contents
# Summarize the data
summary(x)
State Minimum.Wage Unemployment.Rate
Length:51 Min. : 0.000 Min. :2.100
Class :character 1st Qu.: 7.250 1st Qu.:3.100
Mode :character Median : 8.500 Median :3.500
Mean : 8.121 Mean :3.641
3rd Qu.:10.100 3rd Qu.:4.100
Max. :14.000 Max. :6.400
# Run OLS
<- lm(Unemployment.Rate ~ Minimum.Wage, data = x)
lm1 summary(lm1)[4]
$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.49960275 0.32494078 10.7699709 1.62017e-14
Minimum.Wage 0.01743224 0.03733321 0.4669365 6.42615e-01
# the 4th element provides a nice table.
To run this script, you can go to the Run > Run All
. The first line imports the data. You can find the data at the associated website for the book. The data has three variables: the State, their June 2019 minimum wage, and their June 2019 unemployment rate. To see these variables, you can run summary(x)
.
To run a standard regression use the lm()
function. I call the object lm1
. On the left-hand side of the tilde (Unemployment.Rate
, and on the right-hand side is theMinimum.Wage
variable. If we use the option data = x
, we can just use the variable names in the formula for thelm()
function.
The object, lm1
, contains all the information about the regression. You can run summary(lm1)
to get a standard regression table.7
Discussion and Further Reading
The book is a practical guide to microeconometrics. It is not a replacement for a good textbook such as Cameron and Trivedi (2005) or classics like Goldberger (1991) or Greene (2000). The book uses R to teach microeconometrics. It is a complement to other books on teaching R , particularly in the context of econometrics, such as Kleiber and Zeileis (2008). For more discussion of DAGs and do operators, I highly recommend Judea Pearl’s Book of Why \citep{Pearl:2018]. To learn about programming in R , I highly recommend Kabacoff (2011). The book is written by a statistician working in a computer science department. Paarsh and Golyaev (2016) is an excellent companion to anyone starting out doing computer intensive economic research.
Ordinary Least Squares
Introduction
Ordinary least squares (OLS) is the work-horse model of microeconometrics. It is quite simple to estimate. It is straightforward to understand. It presents reasonable results in a wide variety of circumstances.
The chapter uses OLS to disentangle the effect of the policy variable from other unobserved characteristics that may be affecting the outcome of interest. The chapter describes how the model works and presents the two standard algorithms, one based on matrix algebra and the other based on least squares. It uses OLS on actual data to estimate the value of an additional year of high school or college.
Estimating the Causal Effect
You have been asked to evaluate a policy which would make public colleges free in the state of Vermont. The Green Mountain state is considering offering free college education to residents of the state. Your task is to determine whether more Vermonters will attend college and whether those additional attendees will be better off. Your boss narrows things down to the following question. Does an additional year of schooling cause people to earn more money?
The section uses simulated data to illustrate how averaging is used to disentangle the effect of the policy variable from the effect of the unobserved characteristics.
Graphing the Causal Effect
Your problem can be represented by the causal graph in Figure 1. The graph shows an arrow from
If the available data is represented by the causal graph in Figure 1, can it be used to estimate
A Linear Causal Model
Consider a simple model of our problem. Individual
where
Our goal is to estimate these parameters from the data we have.
Simulation of the Causal Effect
In the simulated data, we have a true linear relationship between
# Create a simulated data set set.seed(123456789)
# use to get the exact same answer each time the code is run.
<- 100
N # Set N to 100, to represent the number of observations.\
<- 2
a <- 3
b # model parameters of interest \# Note the use of \<- to mean "assign".
<- runif(N)
x # create a vector where the observed characteristic, x,\
# is drawn from a uniform distribution.
<- rnorm(N)
u # create a vector where the unobserved characteristic, \# u is drawn from a standard normal distribution.
<- a + b*x + u
y # create a vector y \#* allows a single number to be multiplied through \# the whole vector \# + allows a single number to be added to the whole vector \# or for two vectors of the same length to be added together.
A computer does not actually generate random numbers. It generates pseudo-random numbers. These numbers are derived from a distinct function. If you know the function and the current number, then you know exactly what the next number will be. The book takes advantage of this process by using the R
function, set.seed()
to generate exactly the same numbers every time.
plot(x,y) # creates a simple plot
abline(a = 2,b = 3) # adds a linear function to the plot.
# a - intercept, b - slope.
Figure 2 presents the true relationship between
Averaging to Estimate the Causal Effect}
Figure 2 suggests that we can use averaging to disentangle the effect of
mean(y[x > 0.95]) - mean(y[x < 0.05])
[1] 3.626229
# mean takes an average \# the logical expression inside the square brackets # creates an index for the elements of y where the logical \# expression in x holds.
We can do this with the simulated data. In both cases, we are “close” but not actually equal to 0 or 1. The result is that we find a slope that is equal to 2.72, not 3.
We disentangle the effect of
Assumptions of the OLS Model
The major assumptions of the OLS model are that unobserved characteristics enter independently and additively. The first assumption states that conditional on observed characteristics (the
To understand the independence assumption, it is helpful to go back to the original problem. We are interested in determining the economic value of attending college. Our estimated model of the effect of schooling on income allows the unobserved characteristics to determine income. Importantly, the model does not allow unobserved characteristics to affect both schooling and income. The model does not allow students from wealthy families to be more likely to go to college and get a good job due to their family background.
The second assumption states that unobserved characteristics enter the model additively.8 The implication is that the effect of the policy cannot vary with unobserved characteristics. Attending college increases everyone’s income by the same amount. This assumption allows the effect of
The simulated data set satisfies these two assumptions. The unobserved characteristic,
Matrix Algebra of the OLS Model
The chapter presents two algorithms for solving the model and revealing the parameter estimates; the algebraic algorithm and the least squares algorithm.9 This section uses matrix algebra to derive and program up the OLS estimator in R
.10
Standard Algebra of the OLS Model
Let’s simplify the problem. Consider (Equation 23) and let
The parameter of interest (
Rearranging (Equation 36}) highlights two problems. The first problem is that the observed terms (
Luckily, we can “kill two birds with one stone.” As (Equation 36) must hold for each individual
We use the notation
Summing through on the right-hand side, we have the slightly simplified equation below.
where
Dividing by
Unfortunately, we cannot actually solve this equation and determine
However, we can estimate
The estimate of
How close is our estimate to the true value of interest? How close is
Algebraic OLS Estimator in R
We can use the algebra presented above as pseudo-code for our first estimator in R
.
<- (mean(y) - 2)/mean(x)
b_hat b_hat
[1] 3.065963
Why does this method not give a value closer to the true value? The method gives an estimate of 3.46, but the true value is 3. One reason may be that the sample size is not very large. You can test this by running the simulation and increasing
Using Matrices
In general, of course, we do not know
- shows the equations representing the data. It represents a system of 100 linear equations.
Using matrix algebra, we can rewrite the system of equations.
Notice how matrix multiplication is done. Standard matrix multiplication follows the rule below.
Check what happens if you rearrange the order of the matrices. Do you get the same answer?14
Multiplying Matrices in R
We now illustrate matrix multiplication with R
.
= x[1:5]
x1 # only include the first 5 elements
= cbind(1,x1)
X1 # create a matrix with a columns of 1s \# cbind means column bind -
# it joins columns of the same length together.
# It returns a "matrix-like" object.
# Predict value of y using the model
%*%c(2,3) X1
[,1]
[1,] 3.513248
[2,] 2.845124
[3,] 4.375409
[4,] 3.268176
[5,] 3.811467
# See how we can add and multiply vectors and numbers in R. \# In R %*% represents standard matrix multiplication. \# Note that R automatically assumes c(2,3) is a column vector \# Compare to the true values
1:5] y[
[1] 4.423434 2.597724 4.274064 2.856203 4.273605
In the simulated data we see the relationship between
Matrix Estimator of OLS
It is a lot more compact to represent the system in (1) with matrix notation.
where
To solve the system we can use the same “division” idea that we used for standard algebra. In matrix algebra, we do division by multiplying the inverse of the matrix (
The problem is that our matrix is not invertible. Only full-rank square matrices are invertible and our matrix is not even square.16 The solution is to create a generalized matrix inverse.
We can make our matrix (
The matrix
To solve for the parameters of the model (
Simplifying and rearranging we have the following linear algebra derivation of the model.
From this we have the matrix algebra based estimate of our model.
Remember that we never observe
We didn’t just drop the unobserved term, we averaged over it. If the assumptions of OLS hold, then summation
Matrix Estimator of OLS in R
We can follow a similar procedure for inverting a matrix using R
. The matrix of explanatory variables includes a first column of 1’s which accounts for the intercept term.
<- cbind(1,x)
X # remember the column of 1's
Next we need to transpose the matrix. To illustrate a matrix transpose consider a matrix
<- matrix(c(1:6),nrow=3)
A # creates a 3 x 2 matrix.
A
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
# See how R numbers elements of the matrix.
t(A)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
# transpose of matrix A
t(A)%*%A
[,1] [,2]
[1,] 14 32
[2,] 32 77
# matrix multiplication of the transpose by itself
In our problem the transpose multiplied by itself gives the following
t(X)%*%X
x
100.00000 46.20309
x 46.20309 28.68459
In R
, the matrix inverse can be found using the solve()
function.
solve(t(X)%*%X)
x
0.03909400 -0.06296981
x -0.06296981 0.13628919
# solves for the matrix inverse.
The matrix algebra presented (Equation 14}) is pseudo-code for the operation in R
.
<- solve(t(X)%*%X)%*%t(X)%*%y
beta_hat beta_hat
[,1]
1.939289
x 3.197364
Our estimates are
We can check that we averaged over the unobserved term to get something close to 0.
solve(t(X)%*%X)%*%t(X)%*%u
[,1]
-0.06071137
x 0.19736442
Least Squares Method for OLS
An alternative algorithm is least squares. This section presents the algebra of least squares and compares the linear model (lm()
) estimator to the matrix algebra version derived above.
Moment Estimation
Least squares requires that we assume the unobserved characteristic has a mean of 0. We say that the first moment of the unobserved characteristic is 0. A moment refers to the expectation of a random variable taken to some power. The first moment is the expectation of the variable taken to the power of 1. The second moment is the expectation of the variable taken to the power of 2, etc.
From (Equation 23) we can rearrange and substitute into (Equation 15).
- states that the expected difference, or mean difference, between
and the predicted value of ( ) is 0.
The sample analog of the left-hand side of (2}) is the following average.
The sample equivalent of the mean is the average. This approach to estimation is called analog estimation (Charlies F. Manski 1988). We can make this number as close to zero as possible by minimizing the sum of squares, that is, finding the “least squares.”
Algebra of Least Squares
Again it helps to illustrate the algorithm by solving the simpler problem. Consider a version of the problem in which we are just trying to estimate the
We want to find the
We can solve for
Note that we can divide both sides by -2, giving the following rearrangement.
Our estimate of the relationship
Estimating Least Squares in R
We can program the least squares estimator in two ways. First, we can solve the problem presented in (Equation 18). Second, we can use the solution to the first order condition presented in (Equation 20).
optimize(function(b) sum((y - 2 - b*x)^2), c(-10,10))$minimum
[1] 3.099575
# optimize() is used when there is one variable.
# note that the function can be defined on the fly
# $minimum presents one of the outcomes from optimize()
We can use optimize()
to find a single optimal value of a function. We can use function()
to create the sum of squared difference function, which is the first argument of optimize()
. The procedure searches over the interval which is the second argument. Here it searches over the real line from -10 to 10 (c(-10,10)
).22 It looks for the value that minimizes the sum of the squared differences. The result is 3.37. Why do you think this is so far from the true value of 3?23
Alternatively, we can use the first order condition.
mean(x*y) - 2*mean(x))/mean(x*x) (
[1] 3.099575
Solving out (Equation 20) in R
gives an estimate of 3.37. Why do these two approaches give identical answers?
The lm()
Function
The standard method for estimating OLS in R
is to use the lm()
function. This function creates an object in R
. This object keeps track of various useful things such as the vector of parameter estimates.
<- as.data.frame(cbind(y,x))
data1 # creates a data.frame() object which will be used in the
# next section.
<- lm(y ~ x)
lm1 # lm creates a linear model object
length(lm1) # reports the number of elements of the list object
[1] 12
names(lm1) # reports the names of the elements
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
We can compare the answers from the two algorithms, the lm()
procedure and the matrix algebra approach.
$coefficients lm1
(Intercept) x
1.939289 3.197364
# reports the coefficient estimates
# $ can be used to call a particular element from the list.
# lm1[1] reports the same thing.
t(beta_hat) # results from the matrix algebra.
x
[1,] 1.939289 3.197364
Comparing the two different approaches in R
, we can see they give identical estimates. This is not coincidental. The solve()
function is based on a least squares procedure. It turns out there is no real computational difference between an “algebra” approach and a “least squares” approach in R
.
Measuring Uncertainty
How do we give readers a sense of how accurate our estimate is? The previous section points out that the estimated parameter is not equal to the true value and may vary quite a bit from the true value. Of course, in the simulations we know exactly what the true value is. In real world econometric problems, we do not.
This section provides a brief introduction to some of the issues that arise when thinking about reporting the uncertainty around our estimate. Appendix A discusses these issues in more detail.
Data Simulations
Standard statistical theory assumes that the data we observe is one of many possible samples. Luckily, with a computer we can actually see what happens if we observe many possible samples. The simulation is run 1,000 times. In each case a sample of 100 is drawn using the same parameters as above. In each case, OLS is used to estimate the two parameters
set.seed(123456789)
<- 1000
K <- 2
a <- 3
b <- matrix(NA,K,2)
sim_res # creates a 1000 x 2 matrix filled with NAs.
# NA denotes a missing number in R.
# I prefer using NA rather than an actual number like 0.
# It is better to create the object to be filled
# prior to running the loop.
# This means that R has the object in memory.
# This simple step makes loops in R run a lot faster.
for (k in 1:K) { # the "for loop" starts at 1 and moves to K
<- runif(N)
x <- rnorm(N, mean = 0, sd = 2)
u <- a + b*x + u
y <- lm(y ~ x)$coefficients
sim_res[k,] # inputs the coefficients from simulated data into the
# kth row of the matrix.
# print(k)
# remove the hash to keep track of the loop
} colnames(sim_res) <- c("Est of a", "Est of b")
# labels the columns of the matrix.
require(knitr)
Loading required package: knitr
# summary produces a standard summary of the matrix.
<- summary(sim_res)
sum_tab rownames(sum_tab) <- NULL # no row names.
# NULL creates an empty object in R.
kable(sum_tab)
Est of a | Est of b |
---|---|
Min. :0.7771 | Min. :0.5079 |
1st Qu.:1.7496 | 1st Qu.:2.5155 |
Median :2.0295 | Median :2.9584 |
Mean :2.0260 | Mean :2.9666 |
3rd Qu.:2.3113 | 3rd Qu.:3.4122 |
Max. :3.5442 | Max. :5.9322 |
Table 1 summarizes the results of the simulation. On average the estimated parameters are quite close to the true values, differing by 0.026 and 0.033 respectively. These results suggest the estimator is unbiased. An estimator is unbiased if it is equal to the true value on average. The OLS estimator is unbiased in this simulation because the unobserved term is drawn from a distribution with a mean of 0.
The table also shows that our estimates could differ substantially from the true value. At worst, the estimated value of the
Introduction to the Bootstrap
It would be great to present the analysis above for any estimate. Of course we cannot do that because we do not know the true values of the parameters. The Stanford statistician, Brad Efron, argues we can use the analogy principle. We don’t know the true distribution but we do know its sample analog, which is the sample itself.
Efron’s idea is called bootstrapping. And no, I have no idea what a bootstrap is. It refers to the English saying, “to pull one’s self up by his bootstraps.” In the context of statistics, it means to use the statistical sample itself to create an estimate of how accurate our estimate is. Efron’s idea is to repeatedly draw pseudo-samples from the actual sample, randomly and with replacement, and then for each pseudo-sample re-estimate the model.
If our sample is pretty large then it is a pretty good representation of the true distribution. We can create a sample analog version of the thought exercise above. We can recreate an imaginary sample and re-estimate the model on that imaginary sample. If we do this a lot of times we get a distribution of pseudo-estimates. This distribution of pseudo-estimates provides us with information on how uncertain our original estimate is.
Bootstrap in R
We can create a bootstrap estimator in R
. First, we create a simulated sample data set. The bootstrap code draws repeatedly from the simulated sample to create 1000 pseudo-samples. The code then creates summary statistics of the estimated results from each of the pseudo-samples.
set.seed(123456789)
<- 1000
K <- matrix(NA,K,2)
bs_mat for (k in 1:K) {
<- round(runif(N,min=1,max=N))
index_k # creates a pseudo-random sample.
# draws N elements uniformly between 1 and N.
# rounds all the elements to the nearest integer.
<- data1[index_k,]
data_k <- lm(y ~ x, data=data_k)$coefficients
bs_mat[k,] # print(k)
} <- matrix(NA,2,4)
tab_res 1] <- colMeans(bs_mat)
tab_res[,# calculates the mean for each column of the matrix.
# inputs into the first column of the results matrix.
2] <- apply(bs_mat, 2, sd)
tab_res[,# a method for having the function sd to act on each
# column of the matrix. Dimension 2 is the columns.
# sd calculates the standard deviation.
3] <- quantile(bs_mat[,1],c(0.025,0.975))
tab_res[,# calculates quantiles of the column at 2.5% and 97.5%.
4] <- quantile(bs_mat[,2],c(0.025,0.975))
tab_res[,colnames(tab_res) <- c("Mean", "SD", "2.5%", "97.5%")
rownames(tab_res) <- c("Est of a","Est of b")
# labels the rows of the matrix.
kable(tab_res)
Table 2 presents the bootstrapped estimates of the OLS model on the simulated data. The table presents the mean of the estimates from the pseudo-samples, the standard deviation of the estimates and the range which includes 95% of the cases. It is good to see that the true values do lie in the 95% range.
Standard Errors
kable(summary(lm1)[4])
Table 3 presents the standard results that come out of the lm()
procedure in R
. Note that these are not the same as the bootstrap estimates.24 The second column provides information on how uncertain the estimates are. It gives the standard deviation of the imaginary estimates assuming that the imaginary estimates are normally distributed. The last two columns present information which may be useful for hypothesis testing. This is discussed in detail in Appendix A.
The lm()
procedure assumes the difference between the true value and the estimated value is distributed normally. That turns out to be the case in this simulation. In the real world there may be cases where it may be less reasonable to make this assumption.
Returns to Schooling
Now we have the basics of OLS, we can start on a real problem. One of the most studied areas of microeconometrics is returns to schooling (Card 2001). Do policies that encourage people to get more education, improve the economic outcomes? One way to answer this question is to use survey data to determine how much education a person received and how much income they earned.
Berkeley labor economist David Card, analyzed this question using survey data on men aged between 14 and 24 in 1966 (Card 1995). The data set provides information on each man’s years of schooling and income. These are measured ten years later, in 1976.
A Linear Model of Returns to Schooling
Card posits that income in 1976 is determined by the individual’s years of schooling.
In (Equation 28), income in 1976 for individual
NLSM Data
The National Longitudinal Survey of Older and Younger Men (NLSM) is a survey data set used by David Card. Card uses the young cohort who were in the late teens and early twenties in 1966. Card’s version can be downloaded from his website, http://davidcard.berkeley.edu/data_sets.html.25 The original data is a .dat
file. The easiest way to wrangle the data is to import it into excel as a “fixed width” file and then copy the variable names from the codebook file. You can then add the variable names to the first line of the excel file and save it as a .csv
file. In the code below I named the file nls.csv
.26
<- read.csv("nls.csv", as.is=TRUE)
x # I follow the convention of defining any data set as x
# (or possibly y or z).
# I set the working directory to the one where my main
# script is saved, which is the same place as the data.
# Sessions -> Set Working Directory -> Source File Location
# It is important to add "as.is = TRUE",
# otherwise R may change your variables into "factors"
# which is confusing and leads to errors.
# factors can be useful when you want to instantly create
# a large number of dummy variables from a single variable.
# However, it can make things confusing if you don't realize
# what R is doing.
$wage76 <- as.numeric(x$wage76) x
Warning: NAs introduced by coercion
$lwage76 <- as.numeric(x$lwage76) x
Warning: NAs introduced by coercion
# changes format from string to number.
# "el wage 76" where "el" is for "log"
# Logging helps make OLS work better. This is because wages
# have a skewed distribution, and log of wages do not.
<- x[is.na(x$lwage76)==0,]
x1 # creates a new data set \# with missing values removed
# is.na() determines missing value elements ("NA"s)
After reading in the data, the code changes the format of the data. Because some observations have missing values, the variables will import as strings. Changing the string to a number creates a missing value code NA
. The code then drops the missing variables.
Plotting Returns to Schooling
<- lm(lwage76 ~ ed76, data=x1)
lm1 plot(x1$ed76,x1$lwage76, xlab="Years of Education",
ylab="Log Wages (1976)")
# plot allows us to label the charts
abline(a=lm1$coefficients[1],b=lm1$coefficients[2],lwd=3)
Figure 3 presents a simple plot of the relationship between log wages in 1976 and the years of schooling. Again we see a distinct positive relationship even though the observations are spread out in a cloud. We can see that people who do not graduate from high school (finish with less than 12 years of education) earn less on average than those who attend college (have more than 12 years of education). There is a lot of overlap between the distributions.
Estimating Returns to Schooling
kable(summary(lm1)[4])
We can use OLS to estimate the average effect of schooling on income. Table 4 gives the OLS estimate. The coefficient estimate of the relationship between years of schooling and log wages is 0.052. The coefficient is traditionally interpreted as the percentage increase in wages associated with a 1 year increase in schooling (Card 1995). This isn’t quite right, but it is pretty close. Below shows the predicted percentage change in wages, measured at the mean of wages, being 5.4%. Compare this to reading the coefficient as 5.2%.
exp(log(mean(x1$wage76)) + lm1$coefficients[2])/mean(x1$wage76)
ed76
1.053475
This estimate suggests high returns to schooling. However, the model makes a number of important assumptions about how the data is generated. The following chapters discuss the implications of those assumptions in detail.
Discussion and Further Reading
OLS is the “go to” method for estimation in microeconometrics. The method makes two strong assumptions. First, the unobserved characteristics must be independent of the policy variable. Second, the unobserved characteristics must affect the outcome variable additively. These two assumptions allow averaging to disentangle the causal effect from the effect of unobserved characteristics. We can implement the averaging using one of two algorithms, a matrix algebra-based algorithm or the least squares algorithm. In R
, the two algorithms are computationally equivalent. The next two chapters consider weakening the assumptions. Chapter 2 takes the first step by allowing the OLS model to include more variables. Chapter 3 considers cases where neither assumption holds.
Measuring and reporting uncertainty has been the traditional focus of statistics and econometrics. This book takes a more “modern” approach to focus attention on issues around measuring and determining causal effects. This chapter takes a detour into the traditional areas. It introduces the bootstrap method and discusses traditional standard errors. These topics are discussed in more detail in the Appendix.
The chapter introduces causal graphs. Pearl and Mackenzie (2018) provide an excellent introduction to the power of this modeling approach.
To understand more about the returns to schooling literature, I recommend Card (2001). Chapters 2 and 3 replicate much of the analysis presented in Card (1995)}. The book returns to the question of measuring returns to schooling in Chapters 2, 3, 6, 8 and 11.
Multiple Regression
Introduction
In Chapter 1 there is only one explanatory variable. However, in many questions we expect multiple explanations. In determining a person’s income, the education level is very important. There is clear evidence that other factors are also important, including experience, gender and race. Do we need to account for these factors in determining the effect of education on income?
Yes. In general, we do. Goldberger (1991) characterizes the problem as one of running a short regression when the data is properly explained by a long regression. This chapter discusses when we should and should not run a long regression. It also discusses an alternative approach. Imagine a policy that can affect the outcome either directly or indirectly through another variable. Can we estimate both the direct effect and the indirect effect? The chapter combines OLS with directed acyclic graphs (DAGs) to determine how a policy variable affects the outcome. It then illustrates the approach using actual data on mortgage lending to determine whether bankers are racist or greedy.
Long and Short Regression
Goldberger (1991) characterized the problem of omitting explanatory variables as the choice between long and short regression. The section considers the relative accuracy of long regression in two cases: when the explanatory variables are independent of each other, and when the explanatory variables are correlated.
Using Short Regression
Consider an example where true effect is given by the following long regression. The dependent variable
In our running example, think of
We are interested in estimating
By doing this we have a different “unobserved characteristic.” The unobserved characteristic,
Does it matter? Does it matter if we just leave out important explanatory variables? Yes. And No. Maybe. It depends. What was the question?
Independent Explanatory Variables
The Figure 4 presents the independence case. There are two variables that determine the value of
Consider the simulation below and the results of the various regressions presented in Table 5. Models (1) and (2) show that it makes little difference if we run the short or long regression. Neither of the estimates is that close, but that is mostly due to the small sample size. It does impact the estimate of the constant; can you guess why?
Dependent Explanatory Variables
Short regressions are much less trustworthy when there is some sort of dependence between the two variables. Figure 5 shows the causal relationship when
Models (3) and (4) of Table 5 present the short and long estimators for the case where there is dependence. In this case we see a big difference between the two estimators. The long regression gives estimates of
Simulation with Multiple Explanatory Variables
The first simulation assumes that
set.seed(123456789)
<- 1000
N <- 2
a <- 3
b <- 4
c <- rnorm(N)
u_x <- 0
alpha <- x1 <- (1 - alpha)*runif(N) + alpha*u_x
x <- w1 <- (1 - alpha)*runif(N) + alpha*u_x
w <- rnorm(N)
u <- a + b*x + c*w + u
y <- lm(y ~ x)
lm1 <- lm(y ~ x + w) lm2
The second simulation allows for dependence between
<- 0.5
alpha <- x2 <- (1 - alpha)*runif(N) + alpha*u_x
x <- w2 <- (1 - alpha)*runif(N) + alpha*u_x
w <- a + b*x + c*w + u
y <- lm(y ~ x)
lm3 <- lm(y ~ x + w) lm4
The last simulation suggests that we need to take care not to overly rely on long regressions. If
<- 0.95
alpha <- x3 <- (1 - alpha)*runif(N) + alpha*u_x
x <- w3 <- (1 - alpha)*runif(N) + alpha*u_x
w <- a + b*x + c*w + u
y <- lm(y ~ x)
lm5 <- lm(y ~ x + w) lm6
Warning: package 'modelsummary' was built under R version 4.4.1
Warning: package 'gt' was built under R version 4.4.1
(1) | (2) | (3) | (4) | (5) | (6) | |
---|---|---|---|---|---|---|
(Intercept) | 3.983 | 2.149 | 2.142 | 2.071 | 2.075 | 2.077 |
(0.099) | (0.082) | (0.046) | (0.036) | (0.033) | (0.032) | |
x | 3.138 | 2.806 | 6.842 | 2.857 | 7.014 | 0.668 |
(0.168) | (0.111) | (0.081) | (0.166) | (0.034) | (1.588) | |
w | 4.054 | 4.159 | 6.346 | |||
(0.111) | (0.160) | (1.588) | ||||
Num.Obs. | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 |
R2 | 0.258 | 0.682 | 0.877 | 0.927 | 0.977 | 0.978 |
R2 Adj. | 0.257 | 0.681 | 0.877 | 0.927 | 0.977 | 0.978 |
AIC | 3728.3 | 2882.6 | 3399.4 | 2884.8 | 2897.3 | 2883.4 |
BIC | 3743.0 | 2902.2 | 3414.2 | 2904.4 | 2912.0 | 2903.1 |
Log.Lik. | -1861.141 | -1437.292 | -1696.717 | -1438.393 | -1445.661 | -1437.712 |
RMSE | 1.56 | 1.02 | 1.32 | 1.02 | 1.03 | 1.02 |
Table 5 shows what happens when you run a short regression with dependence between the variables.28 When there is no dependence the short regression does fine, actually a little better in this example. However, when there is dependence the short regression is capturing both the effect of
Matrix Algebra of Short Regression
To illustrate the potential problem with running a short regression consider the matrix algebra.
The (Equation 24) gives the true relationship between the outcome vector
The (Equation 25) shows that the short regression gives the same answer if
cov(x1,w1) # calculates the covariance between x1 and w1
[1] 0.007019082
cov(x2,w2)
[1] 0.2557656
t(x1)%*%w1/N
[,1]
[1,] 0.2581774
# this corresponds to the linear algebra above
t(x2)%*%w2/N
[,1]
[1,] 0.3188261
# it measures the correlation between the Xs and Ws.
In our simulations we see that in the first case the covariance between
Collinearity and Multicollinearity
As we saw in Chapter 1, the true parameter vector can be written as follows.
where here
Chapter 1 states that in order to interpret the parameter estimates as measuring the true effects of
In statistics, the problem that the our matrix of observables is not full-column rank is called “collinearity.” Two, or more, columns are “co-linear.” Determining whether a matrix is full-column rank is not difficult. If the matrix
Econometrics textbooks are generally not a lot of laughs. One prominent exception is Art Goldberger’s A Course in Econometrics and its chapter on Multicollinearity. Goldberger points out that multicollinearity has many syllables but in the end it is just a problem of not having enough data (Goldberger 1991). More accurately, a problem of not having enough variation in the data. He then proceeds by discussing the analogous problem of micronumerosity.30 Notwithstanding Goldberger’s jokes, multicollinearity is no joke.
Models (5) and (6) in Table 5 show what happens when the explanatory variables
Matrix Algebra of Multicollinearity
In the problem we make two standard assumptions. First, the average value of the unobserved characteristic is 0. Second, the
Understanding Multicollinearity with R
Given the magic of our simulated data, we can look into what is causing the problem with our estimates.
<- cbind(1,x3,w3)
X2 solve(t(X2)%*%X2)%*%t(X2)%*%u
[,1]
0.0766164
x3 -2.3321265
w3 2.3462160
First, we can look at the difference between
mean(u)
[1] 0.07635957
cov(x3,u) # calculates the covariace between two variables
[1] 0.01316642
cov(w3,u)
[1] 0.01413717
t(X2)%*%u/N
[,1]
0.07635957
x3 0.01593319
w3 0.01687792
Again we can look at the main OLS assumptions, that the mean of the unobserved term is zero and the covariance between the unobserved term and the observed terms is small.
The operation above shows that the average of the unobserved characteristic does differ somewhat from 0. Still, it is not large enough to explain the problem. We can also look at the independence assumption, which implies that the covariance between the observed terms and the unobserved term will be zero (for large samples). Here, they are not quite zero, but still small. Again, not enough to explain the huge difference.
The problem is that we are dividing by a very small number. The inverse of a
where the rearranged matrix is divided by the determinant of the matrix. We see above that the reciprocal of the determinant of the matrix is very small. In this case, the very “small” inverse overwhelms everything else.
1/det(t(X2)%*%X2)
[1] 2.610265e-06
# calculates the reciprocal of the determinant of the matrix.
Returns to Schooling
Now that we have a better idea of the value and risk of multiple regression, we can return to the question of returns to schooling. Card (1995) posits that income in 1976 is determined by a number of factors including schooling.
Multiple Regression of Returns to Schooling
We are interested in the effect of schooling on income. However, we want to account for how other variables may also affect income. Standard characteristics that are known to determine income are work experience, race, the region of the country where the individual grew up and the region where the individual currently lives.
In (Equation 28), income in 1976 for individual
NLSM Data
<- read.csv("nls.csv",as.is=TRUE)
x $wage76 <- as.numeric(x$wage76) x
Warning: NAs introduced by coercion
$lwage76 <- as.numeric(x$lwage76) x
Warning: NAs introduced by coercion
<- x[is.na(x$lwage76)==0,]
x1 $exp <- x1$age76 - x1$ed76 - 6 # working years after school
x1$exp2 <- (x1$exp^2)/100
x1# experienced squared divided by 100
The chapter uses the same data as Chapter 1. This time we create measures of experience. Each individual is assumed to have “potential” work experience equal to their age less years in full-time education. The code also creates a squared term for experience. This allows the estimator to capture the fact that wages tend to increase with experience but at a decreasing rate.
OLS Estimates of Returns to Schooling
<- lm(lwage76 ~ ed76, data=x1)
lm1 <- lm(lwage76 ~ ed76 + exp + exp2, data=x1)
lm2 <- lm(lwage76 ~ ed76 + exp + exp2 + black + reg76r,
lm3 data=x1)
<- lm(lwage76 ~ ed76 + exp + exp2 + black + reg76r +
lm4 + smsa66r + reg662 + reg663 + reg664 +
smsa76r + reg666 + reg667 + reg668 + reg669,
reg665 data=x1)
# reg76 refers to living in the south in 1976
# smsa refers to whether they are urban or rural in 1976.
# reg refers to region of the US - North, South, West etc.
# 66 refers to 1966.
(1) | (2) | (3) | (4) | |
---|---|---|---|---|
(Intercept) | 5.571 | 4.469 | 4.796 | 4.621 |
(0.039) | (0.069) | (0.069) | (0.074) | |
ed76 | 0.052 | 0.093 | 0.078 | 0.075 |
(0.003) | (0.004) | (0.004) | (0.003) | |
exp | 0.090 | 0.085 | 0.085 | |
(0.007) | (0.007) | (0.007) | ||
exp2 | -0.249 | -0.234 | -0.229 | |
(0.034) | (0.032) | (0.032) | ||
black | -0.178 | -0.199 | ||
(0.018) | (0.018) | |||
reg76r | -0.150 | -0.148 | ||
(0.015) | (0.026) | |||
smsa76r | 0.136 | |||
(0.020) | ||||
smsa66r | 0.026 | |||
(0.019) | ||||
reg662 | 0.096 | |||
(0.036) | ||||
reg663 | 0.145 | |||
(0.035) | ||||
reg664 | 0.055 | |||
(0.042) | ||||
reg665 | 0.128 | |||
(0.042) | ||||
reg666 | 0.141 | |||
(0.045) | ||||
reg667 | 0.118 | |||
(0.045) | ||||
reg668 | -0.056 | |||
(0.051) | ||||
reg669 | 0.119 | |||
(0.039) | ||||
Num.Obs. | 3010 | 3010 | 3010 | 3010 |
R2 | 0.099 | 0.196 | 0.265 | 0.300 |
R2 Adj. | 0.098 | 0.195 | 0.264 | 0.296 |
AIC | 3343.5 | 3004.5 | 2737.2 | 2611.6 |
BIC | 3361.6 | 3034.5 | 2779.3 | 2713.7 |
Log.Lik. | -1668.765 | -1497.238 | -1361.606 | -1288.777 |
RMSE | 0.42 | 0.40 | 0.38 | 0.37 |
Table 6 gives the OLS estimates of the returns to schooling. The estimates on the coefficient of years of schooling on log wages vary from 0.052 to 0.093. The table presents the results in a traditional way. It presents models with longer and longer regressions. This presentation style gives the reader a sense of how much the estimates vary with the exact specification of the model. Model (4) in Table 6 replicates Model (2) from 2 of Card (1995).31
The longer regressions suggest that the effect of education on income is larger than for the shortest regression, although the relationship is not increasing in the number of explanatory variables. The effect seems to stabilize at around 0.075.
Under the standard assumptions of the OLS model, we estimate that an additional year of schooling causes a person’s income to increase (by about) 7.5%.32 Are the assumptions of OLS reasonable? Do you think that unobserved characteristics of the individual do not affect both their decision to attend college and their income? Do you think family connections matter for both of these decisions? The next chapter takes on these questions.
Causal Pathways
Long regressions are not always the answer. The next two sections present cases where people rely on long regressions even though the long regressions can lead them astray. This section suggests using directed acyclic graphs (DAGs) as an alternative to long regression.33
Consider the case against Harvard University for discrimination against Asian-Americans in undergraduate admissions. The Duke labor economist, Peter Arcidiacono, shows that Asian-Americans have a lower probability of being admitted to Harvard than white applicants.34 Let’s assume that the effect of race on Harvard admissions is causal. The questions is then how does this causal relationship work? What is the causal pathway?
One possibility is that there is a direct causal relationship between race and admissions. That is, Harvard admissions staff use the race of the applicant when deciding whether to make them an offer. The second possibility is that the causal relationship is indirect. Race affects admissions, but indirectly. Race is mediated by some other observed characteristics of the applicants such as their SAT scores, grades or extracurricular activities.
This is not some academic question. If the causal effect of race on admissions is direct, then Harvard may be legally liable. If the causal effect of race on admissions is indirect, then Harvard may not be legally liable.
Arcidiacono uses long regression in an attempt to show Harvard is discriminating. This section suggests his approach is problematic. The section presents an alternative way to disentangle the direct causal effect from the indirect causal effect.
Dual Path Model
Figure 6 illustrates the problem. The figure shows there exist two distinct causal pathways for
In algebra, we have the following relationship between
and
Substituting (Equation 30) into (Equation 29) we have the full effect of
The full relationship of
Given the model described in Figure 6, it is straightforward to estimate
For the remainder of the chapter we will make the problem go away by assuming that
Simulation of Dual Path Model
Consider the simulated data generated below. The data is generated according to the causal diagram in Figure 6.
set.seed(123456789)
<- 50
N <- 1
a <- 0
b <- 3
c <- 4
d <- round(runif(N)) # creates a vector of 0s and 1s
x <- runif(N) + d*x
w <- rnorm(N)
u <- a + b*x + c*w + u y
Table 7 presents the OLS results for the regression of
A standard solution to the omitted variable problem is to include the omitted variable in the regression (Goldberger 1991). Table 8 presents results from a standard long regression. Remember that the true value of the coefficient on
The issue with the standard long regression is multicollinearity. From Figure 6 we see that
There is a better way to do the estimation. Figure 6 shows that there is a causal pathway from
<- lm(y ~ x)$coef[2]
e_hat # element 2 is the slope coefficient of interest.
<- lm(y ~ w)$coef[2]
c_hat <- lm(w ~ x)$coef[2]
d_hat # Estimate of b
- c_hat*d_hat e_hat
x
-0.08876039
By running three regressions, we can estimate the true value of
Dual Path Estimator Versus Long Regression
set.seed(123456789)
<- matrix(NA,100,3)
b_mat for (i in 1:100) {
<- round(runif(N))
x <- runif(N) + d*x
w <- rnorm(N)
u <- a + b*x + c*w + u
y <- summary(lm(y ~ x + w))
lm2_temp # summary provides more useful information about the object
# the coefficients object (item 4) provides additional
# information about the coefficient estimates.
1] <- lm2_temp[[4]][2]
b_mat[i,# The 4th item in the list is the results vector.
# The second item in that vector is the coefficient on R.
2] <- lm2_temp[[4]][8]
b_mat[i,# the 8th item is the T-stat on the coefficient on R.
3] <-
b_mat[i,lm(y ~ x)$coef[2] - lm(w ~ x)$coef[2]*lm(y ~ w)$coef[2]
# print(i)
}colnames(b_mat) <-
c("Standard Est","T-Stat of Standard","Proposed Est")
Standard Est | T-Stat of Standard | Proposed Est |
---|---|---|
Min. :-5.1481 | Min. :-3.02921 | Min. :-0.119608 |
1st Qu.:-1.6962 | 1st Qu.:-0.80145 | 1st Qu.:-0.031123 |
Median :-0.3789 | Median :-0.19538 | Median :-0.008578 |
Mean :-0.1291 | Mean :-0.07418 | Mean :-0.002654 |
3rd Qu.: 1.3894 | 3rd Qu.: 0.74051 | 3rd Qu.: 0.030679 |
Max. : 6.5594 | Max. : 2.68061 | Max. : 0.104119 |
We can use simulation to compare estimators. We rerun the simulation 100 times and summarize the results of the two approaches in Table 9.
The table shows that the proposed estimator gives values much closer to 0 than the standard estimator. It also shows that the standard estimator can, at times, provide misleading results. The estimate may suggest that the value of
The proposed estimator is much more accurate than the standard estimator. Figure 7 shows the large difference in the accuracy of the two estimates. The standard estimates vary from over -5 to 5, while the proposed estimates have a much much smaller variance. Remember the two approaches are estimated on exactly the same data.
Matrix Algebra of the Dual Path Estimator
We can write out the dual path causal model more generally with matrix algebra. To keep things consistent with the OLS presentation, let
where
The (Equation 32) presents the model of the data generating process. In the simultaneous equation system we see that the matrix
We are interested in estimating
All three vectors can be estimated following the same matrix algebra we presented for estimating OLS.
The first equation
Substituting the results of the last two regressions into the appropriate places we get our proposed estimator for the direct effect of
The (Equation 34) presents the estimator of the direct effect of the
Dual Path Estimator in R
The (Equation 34) is pseudo-code for the dual-path estimator in R
.
<- cbind(1,x)
X <- cbind(1,w)
W <- solve(t(X)%*%X)%*%t(X)%*%y
beta_tilde_hat <- solve(t(X)%*%X)%*%t(X)%*%W
Delta_hat <- solve(t(W)%*%W)%*%t(W)%*%y
gamma_hat - Delta_hat%*%gamma_hat beta_tilde_hat
[,1]
-0.01515376
x 0.02444155
The estimated value of
Are Bankers Racist or Greedy?
African Americans are substantially more likely to be denied mortgages than whites. Considering data used by Munnell et al. (1996), being black is associated with a 20% reduction in the likelihood of getting a mortgage.39 The US Consumer Financial Protection Bureau states that the Fair Housing Act may make it illegal to refuse credit based on race.40
Despite these observed discrepancies, lenders may not be doing anything illegal. It may not be illegal to deny mortgages based on income or credit history. Bankers are allowed to maximize profits. They are allowed to deny mortgages to people that they believe are at high risk of defaulting. There may be observed characteristics of the applicants that are associated with a high risk of defaulting that are also associated with race.
Determining the causal pathway has implications for policy. If the effect of race is direct, then laws like the Fair Housing Act may be the correct policy response. If the effect is indirect, then such a policy will have little effect on a policy goal of increasing housing ownership among African Americans.
The section finds that there may be no direct effect of race on mortgage lending.
Boston HMDA Data
The data used here comes from Munnell et al. (1996). This version is downloaded from the data sets for Stock and Watson (2011) here: https://wps.pearsoned.com/aw_stock_ie_3/178/45691/11696965.cw/index.html. You can also download a csv version of the data from https://sites.google.com/view/microeconometricswithr/table-of-contents.
<- read.csv("hmda_aer.csv", as.is = TRUE)
x $deny <- ifelse(x$s7==3,1,NA)
x# ifelse considers the truth of the first element. If it is
# true then it does the next element, if it is false it does
# the final element.
# You should be careful and make sure that you don't
# accidentally misclassify an observation.
# For example, classifying "NA" as 0.
# Note that == is used in logical statments for equal to.
$deny <- ifelse(x$s7==1 | x$s7==2,0,x$deny)
x# In logical statements | means "or" and & means "and".
# The variable names refer to survey questions.
# See codebook.
$black <- x$s13==3 # we can also create a dummy by using
x# true/false statement.
The following table presents the raw effect of race on mortgage denials from the Munnell et al. (1996) data. It shows that being African American reduces the likelihood of a mortgage by 20 percentage points.
|
Causal Pathways of Discrimination
Race may be affecting the probability of getting a mortgage through two different causal pathways. There may be a direct effect in which the lender is denying a mortgage because of the applicant’s race. There may be an indirect effect in which the lender denies a mortgage based on factors such as income or credit history. Race is associated with lower income and poorer credit histories.
Figure 8 represents the estimation problem. The regression in Table 10 may be picking up both the direct effect of Race on Deny (
Estimating the Direct Effect
$lwage <- NA
x$s31a>0 & x$s31a<999999,]$lwage <-
x[xlog(x[x$s31a>0 & x$s31a<999999,]$s31a)
# another way to make sure that NAs are not misclassified.
# See the codebook for missing data codes.
$mhist <- x$s42
x$chist <- x$s43
x$phist <- x$s44
x$emp <- x$s25a
x$emp <- ifelse(x$emp>1000,NA,x$emp) x
To determine the causal effect of race we can create a number of variables from the data set that may mediate race. These variables measure information about the applicant’s income, employment history and credit history.
<- x$deny
Y1 <- cbind(1,x$black)
X1 <- cbind(1,x$lwage,x$chist,x$mhist,x$phist,x$emp)
W1 <- is.na(rowSums(cbind(Y1,X1,W1)))==0
index # this removes missing values.
<- X1[index,]
X2 <- Y1[index]
Y2 <- W1[index,]
W2 <- solve(t(X2)%*%X2)%*%t(X2)%*%Y2
beta_tilde_hat <- solve(t(X2)%*%X2)%*%t(X2)%*%W2
Delta_hat <- solve(t(W2)%*%W2)%*%t(W2)%*%Y2
gamma_hat - Delta_hat%*%gamma_hat beta_tilde_hat
[,1]
[1,] -0.01797654
[2,] 0.11112066
Adding these variables reduces the possible direct effect of race on mortgage denials by almost half. Previously, the analysis suggested that being African American reduced the probability of getting a mortgage by 20 percentage points. This analysis shows that at least 8 percentage points of that is due to an indirect causal effect mediated by income, employment history, and credit history. Can adding in more such variables reduce the estimated direct causal effect of race to zero?
Adding in More Variables
$married <- x$s23a=="M"
x$dr <- ifelse(x$s45>999999,NA,x$s45)
x$clines <- ifelse(x$s41>999999,NA,x$s41)
x$male <- x$s15==1
x$suff <- ifelse(x$s11>999999,NA,x$s11)
x$assets <- ifelse(x$s35>999999,NA,x$s35)
x$s6 <- ifelse(x$s6>999999,NA,x$s6)
x$s50 <- ifelse(x$s50>999999,NA,x$s50)
x$s33 <- ifelse(x$s33>999999,NA,x$s33)
x$lr <- x$s6/x$s50
x$pr <- x$s33/x$s50
x$coap <- x$s16==4
x$school <- ifelse(x$school>999999,NA,x$school)
x$s57 <- ifelse(x$s57>999999,NA,x$s57)
x$s48 <- ifelse(x$s48>999999,NA,x$s48)
x$s39 <- ifelse(x$s39>999999,NA,x$s39)
x$chval <- ifelse(x$chval>999999,NA,x$chval)
x$s20 <- ifelse(x$s20>999999,NA,x$s20)
x$lwage_coap <- NA
x$s31c>0 & x$s31c < 999999,]$lwage_coap <-
x[xlog(x[x$s31c>0 & x$s31c < 999999,]$s31c)
$lwage_coap2 <- ifelse(x$coap==1,x$lwage_coap,0)
x$male_coap <- x$s16==1 x
We can add in a large number of variables that may be reasonably associated with legitimate mortgage denials, including measures of assets, ratio of debt to assets and property value. Lenders may also plausibly deny mortgages based on whether the mortgage has a co-applicant and characteristics of the co-applicant.
<- cbind(1,x$lwage,x$chist,x$mhist,x$phist,x$emp,
W1 $emp^2,x$married,x$dr,x$clines,x$male,
x$suff,x$assets,x$lr,x$pr,x$coap,x$s20,
x$s24a,x$s27a,x$s39,x$s48,x$s53,x$s55,x$s56,
x$s57,x$chval,x$school,x$bd,x$mi,x$old,
x$vr,x$uria,x$netw,x$dnotown,x$dprop,
x$lwage_coap2,x$lr^2,x$pr^2,x$clines^2,x$rtdum)
x# x$rtdum measures the racial make up of the neighborhood.
<- is.na(rowSums(cbind(Y1,X1,W1)))==0
index <- X1[index,]
X2 <- Y1[index]
Y2 <- W1[index,] W2
Bootstrap Dual Path Estimator in R
The bootstrap estimator uses the algebra as pseudo-code for an estimator in R
.
set.seed(123456789)
<- 1000
K <- matrix(NA,K,2)
bs_mat for (k in 1:K) {
<- round(runif(length(Y2),min=1,max=length(Y2)))
index_k <- Y2[index_k]
Y3 <- X2[index_k,]
X3 <- W2[index_k,]
W3 <- solve(t(X3)%*%X3)%*%t(X3)%*%Y3
beta_tilde_hat <- solve(t(X3)%*%X3)%*%t(X3)%*%W3
Delta_hat <- solve(t(W3)%*%W3)%*%t(W3)%*%Y3
gamma_hat <- beta_tilde_hat - Delta_hat%*%gamma_hat
bs_mat[k,] # print(k)
}<- matrix(NA,2,4)
tab_res 1] <- colMeans(bs_mat)
tab_res[,2] <- apply(bs_mat, 2, sd)
tab_res[,1,3:4] <- quantile(bs_mat[,1], c(0.025,0.975))
tab_res[# first row, third and fourth column.
2,3:4] <- quantile(bs_mat[,2], c(0.025,0.975))
tab_res[colnames(tab_res) <- c("Estimate", "SD", "2.5%", "97.5%")
rownames(tab_res) <- c("intercept","direct effect")
kable(tab_res)
Adding in all these variables significantly reduces the estimate of the direct effect of race on mortgage denials. The estimated direct effect of being African American falls from a 20 percentage point reduction in the probability of getting a mortgage to a 2 percentage point reduction. A standard hypothesis test with bootstrapped standard errors states that we cannot rule out the possibility that the true direct effect of race on mortgage denials is zero.41
Policy Implications of Dual Path Estimates
The analysis shows that African Americans are much more likely to be denied mortgages in Boston during the time-period. If this is something a policy maker wants to change, then she needs to know why African Americans are being denied mortgages. Is it direct discrimination of the banks? Or is the effect indirect because African Americans tend to have lower income and poorer credit ratings than other applicants. A policy that makes it illegal to use race directly in mortgage decisions will be more effective if bankers are in fact using race directly in mortgage decisions. Other policies may be more effective if bankers are using credit ratings and race is affecting loan rates indirectly.
Whether this analysis answers the question is left to the reader. It is not clear we should include variables such as the racial make up of the neighborhood or the gender of the applicant. The approach also relies on the assumption that there is in fact no direct effect of race on mortgage denials. In addition, it uses OLS rather than models that explicitly account for the discrete nature of the outcome variable.42
The approach presented here is quite different from the approach presented in Munnell et al. (1996). The authors are also concerned that race may have both direct and indirect causal effects on mortgage denials. Their solution is to estimate the relationship between the mediating variables (
Discussion and Further Reading
There is a simplistic idea that longer regressions must be better than shorter regressions. The belief is that it is always better to add more variables. Hopefully, this chapter showed that longer regressions can be better than shorter regressions, but they can also be worse. In particular, long regressions can create multicollinearity problems. While it is funny, Goldberger’s chapter on multicollinearity downplays the importance of the issue.
The chapter shows that if we take DAGs seriously we may be able to use an alternative to the long regression. The chapter shows that in the case where there are two paths of a causal effect, we can improve upon the long regression. I highly recommend Pearl and Mackenzie (2018) to learn more about graphs. Pearl uses the term “mediation” to refer to the issue of dual causal paths.
To find out more about the lawsuit against Harvard University, go to https://studentsforfairadmissions.org/.
Bounds Estimation
Introduction
In the first three chapters we estimated, or attempted to estimate, a single value of interest. This chapter considers situations where we are either unable or unwilling to estimate a single value for the policy parameter of interest. Instead, the chapter considers cases where we are limited to estimating a range of values. We are interested in using the data to estimate the bounds on the policy parameter of interest.
It is standard practice in econometrics to present the average treatment effect (ATE). This estimand provides the policy maker with the average impact of the policy if everyone was to receive the policy. That is, if everyone changes from not attending college to attending college, the ATE predicts what would happen. I would give an example, but I can’t think of one. In general, policies do not work like this. Consider the policy of making public state colleges free. Such a policy would encourage more people to attend college, but a bunch of people were already attending college and a bunch of people will not attend college even if it is free. What does the ATE tell us will happen to those that are newly encouraged to go to college? Not that much.
If attending college has the same effect on everyone, then the ATE provides useful information. If everyone has the same treatment effect, the average must be equal to the treatment effect. The difficulty arises when different people get different value from going to college. That is, the difficulty always arises.
This chapter considers two implications. In the first case, the data allows the ATE to be estimated, but we would prefer to know the distribution of the policy effect. In general, we cannot estimate this distribution. We can, however, bound it. These bounds are based on a conjecture of the great Soviet mathematician, Andrey Kolmogorov. The chapter explains how the Kolmogorov bounds work and when they provide the policy maker with useful information. These bounds are illustrated by analyzing a randomized controlled trial on the effect of “commitment savings” devices.
In the second case, the data does not allow the ATE to be estimated. Or more accurately, we are unwilling to make the non-credible assumptions necessary to estimate the ATE. The Northwestern econometrician, Charles Manski, argues that econometricians are too willing to present estimates based on non-credible assumptions. Manski shows that weaker but more credible assumptions often lead to a range of estimates. He suggests that presenting a range of estimates is better than providing precisely estimated nonsense. The chapter presents Manski’s natural bounds and discusses how assumptions can reduce the range of estimates of the policy effect. The chapter illustrates these ideas by estimating whether more guns reduce crime.
Potential Outcomes
You have been tasked by the AOC 2028 campaign to estimate the likely impact of a proposal to make state public universities tuition free.43 Your colleague is tasked with estimating how many more people will attend college once it is made free. You are to work out what happens to incomes of those that choose to go to college, now that is free. You need to estimate the treatment effect of college.
Model of Potential Outcomes
Consider a simple version of the problem. There are two possible outcomes. There is the income the individual receives if they attend college (
where
We are interested in determining the treatment effect for each individual
This is the difference between the two possible outcomes for each individual.
Simulation of Impossible Data
Imagine you have access to the impossibly good data set created below (actually, just an impossible data set). The data provides information on the simulated individual’s outcome (
set.seed(123456789)
<- 200
N <- 2
a <- rnorm(N,mean=2,sd=3)
b # this creates variation in the slope with an average
# effect of 2.
<- rep(0,N) # creates a vector of zeros
x0 <- rep(1,N)
x1 <- rnorm(N)
u <- a + b*cbind(x0,x1) + u
y # y is a matrix, [a + u, a + b + u]
# rep creates a vector by repeating the first number by the
# amount of the second number.
par(mfrow=c(2,1)) # creates a simple panel plot
par(mar=c(2,4,0.5,0.5)) # adjusts margins between plots.
plot(density(y[,1]),type="l",lwd=4,xlim=range(y),
ylab="density",main="")
lines(density(y[,2]),lwd=2)
abline(v=colMeans(y),lwd=c(4,2))
legend("topright",c("No College","College"),lwd=3,lty=c(1,2))
plot(ecdf(y[,1]), xlim=range(y),main="",do.points=FALSE,
lwd=4,xlab="y")
lines(ecdf(y[,2]),lwd=2,do.points=FALSE)
# ecdf empirical cumulative distribution function.
The Figure 9 presents the density functions, means and cumulative distribution functions of the two potential outcomes for the simulated data. The figure suggests that individuals generally have better outcomes when
A concern is that the two distributions overlap. Moreover, the cumulative distributions functions cross. There may be individuals in the data who are actually better off if
Distribution of the Treatment Effect
The Equation 36 states that the treatment effect may vary across individuals. If it does, then it has a distribution. Figure 10 presents the density and cumulative distribution function for the difference in outcome if the individual attended college and if the individual did not. The distribution shows that the treatment effect varies across individuals. It is heterogeneous. Moreover, the effect of college may either increase or decrease income, depending on the individual.
par(mfrow=c(2,1)) # creates a simple panel plot
par(mar=c(2,4,0.5,0.5)) # adjusts margins between plots.
plot(density(y[,2]-y[,1]),type="l",lwd=3,main="")
abline(v=0,lty=2,lwd=3)
plot(ecdf(y[,2]-y[,1]),main="",do.points=FALSE,lwd=3,
xlab="y")
abline(v=0,lty=2,lwd=3)
Average Treatment Effect
The average treatment effect (ATE) holds a special position in econometrics and statistics. A possible reason is that it measures the average difference in potential outcomes. That’s actually pretty neat given that outside of our impossible data we cannot observe the difference in potential outcomes. How can we measure the average of something we cannot observe?
ATE and Its Derivation
# mean of the difference vs difference of the means
mean(y[,2]-y[,1]) == mean(y[,2]) - mean(y[,1])
[1] TRUE
The mean of the difference is equal to the difference of the means. We cannot observe the difference in the treatment outcomes. But, we can observe the outcomes of each treatment separately. We can observe the mean outcomes for each treatment. This neat bit of mathematics is possible because averages are linear operators.
We can write out the expected difference in potential outcomes by the Law of Total Expectations.
where
The rest follows from manipulating the conditional expectations.
Rubin (1974) presents the derivation in Equation 38. He points out that if we can estimate each of the average potential outcomes then we have an estimate of the average treatment effect.
But can we estimate the average potential outcome?
ATE and Do Operators
To answer this question, it is clearer to switch notation. At the risk of upsetting the Gods of Statistics, I will mix notations from two different causal models. The expected potential outcome if
By “do” we mean that this is the expected outcome if individuals in the data faced a policy which forced the treatment
The notation highlights the fact that the expected potential outcome of a treatment may not be equal to expected outcomes in a particular treatment. In general,
To see why these numbers are not the same, consider the following derivation. We can write down the expected outcome conditional on the do operator by the Law of Total Expectations. We can write out the average outcome conditional on the policy as the sum of the average outcomes of the policy conditional on the observed treatments weighted by the observed probabilities of the treatments.
The expected outcome under a policy in which individuals go to college is a weighted sum of the effect of the policy on individuals who currently go to college and the effect of the policy on individuals who currently do not go to college.
We are generally able to observe three of the four numbers on the right-hand side of Equation 40. We observe the probability individuals are allocated to the current treatments. In addition, we assume that
ATE and Unconfoundedness
We can estimate the average treatment effect if we are willing to make the following assumption.
Unconfoundedness
The Assumption 1 states that expected outcome of the policy does not vary with treatment observed in the data. Under the assumption, there is no information content in the fact that one group attends college and one group does not. This assumption may be reasonable if we have data from an ideal randomized controlled trial. For most other data, including many randomized controlled trials, the assumption may not be credible.
The assumption implies we can substitute the unknown expected value with the known expected value.
The implication is that we can estimate the average of the potential outcomes for each treatment. Thus we can estimate the average difference in potential outcomes. Said differently, unconfoundedness allows us to estimate the average treatment effect.
ATE and Simulated Data
<- runif(N) < 0.3 # treatment assignment
X <- (1-X)*y[,1] + X*y[,2] # outcome conditional on treatment Y
Consider a change to our simulated data to make it look more like an actual data set. In the new data we only see one outcome and one treatment for each individual. However, if we can make the unconfoundedness assumption then we can estimate the average treatment effect. Our new data satisfies the assumption because the assignment to treatment is random.
mean(Y[X==1]) - mean(Y[X==0])
[1] 2.432335
In the data the true average treatment effect is 2. Our estimate is 2.43.
What changes could you make to the simulated data that would increase the accuracy of the estimate?44
Kolmogorov Bounds
There are policy questions where the ATE provides a useful answer, but it is often provided as a statistic of convenience. In the data generated above, many simulated individuals are better off under treatment
We do not have access to the impossible data generated above. We cannot estimate the joint distribution of potential outcomes or the distribution of the treatment effect. However, we can **bound} these distributions.
Kolmogorov’s Conjecture
The Russian mathematician, Andrey Kolmogorov, conjectured that difference of two random variables with known marginals could be bounded in the following way. Note that I have written this out in a simplified way that will look more closely like the way it is implemented in R
.46
Theorem 1 (Kolmogorov’s Conjecture) Let
and
The Theorem 1 states that we can bound the distribution of the treatment effect even though we only observe the distributions of outcomes for each of the treatments. You may be surprised to learn how easy these bounds are to implement and how much information they provide about the distribution of the treatment effect.
Kolmogorov Bounds in R
We can use Theorem 1 as pseudo-code for the functions that bound the treatment effect distribution.
<- function(b, y1, y0) {
FL <- function(x) -(mean(y1 < x) - mean(y0 < x - b))
f # note the negative sign as we are maximizing
# (Remember to put it back!)
<- optimize(f, c(min(y1,y0),max(y1,y0)))
a return(max(-a$objective,0))
}<- function(b, y1, y0) {
FU <- function(x) mean(y1 < x) - mean(y0 < x - b)
f <- optimize(f, c(min(y1,y0), max(y1,y0)))
a return(1 + min(a$objective,0))
}
<- 50
K <- min(y[,1]) - max(y[,2])
min_diff <- max(y[,1]) - min(y[,2])
max_diff <- (max_diff - min_diff)/K
del_diff <- min_diff + c(1:K)*del_diff
y_K plot(ecdf(y[,2] - y[,1]), do.points=FALSE,lwd=3,main="")
lines(y_K,sapply(y_K, function(x) FL(x,y[,2],y[,1])),
lty=2,lwd=3)
lines(y_K,sapply(y_K, function(x) FU(x,y[,2],y[,1])),
lty=3,lwd=3)
abline(v=0,lty=2,lwd=3)
The Figure 11 presents the distribution of the treatment effect for the simulated data as well as the lower and upper bounds. Remember that in normal data we cannot observe the treatment effect but thanks to math we can determine its bounds. If you look closely, you will notice that some simulated individuals must be harmed by the treatment. At 0, the bounds are strictly positive. Of course, we know that in our impossible data some simulated individuals are in fact worse off.
Do “Nudges” Increase Savings?
Researchers in economics and psychology have found that individuals often make poor decisions. They make decisions that are against the individual’s own interest. Given this, can policies or products be provided that “nudge” individuals to make better decisions?
Ashraf, Karlan, and Yin (2006) describe an experiment conducted with a bank in the Philippines. In the experiment some customers were offered “commitment” savings accounts. In these accounts the customer decides upon a goal, such as a target amount or a target date, and can deposit but not withdraw until the goal is reached. Such products may help people with issues controlling personal finances or interacting with household members on financial matters. People offered accounts did not actually have to open an account and many did not.
Ashraf, Karlan, and Yin (2006) use a field experiment to determine the effectiveness of a commitment savings account.47 In the experiment there are three treatment groups; the first group is offered the commitment savings account at no extra cost or savings, the second group is provided information on the value of savings, and the third is a control. Here we will compare the commitment group to the control.
The section uses the data to illustrate the value of Kolmogorov bounds.
Field Experiment Data
We first replicate the findings in Ashraf, Karlan, and Yin (2006). The data is available at https://doi.org/10.7910/DVN/27854.
require(readstata13)
Loading required package: readstata13
# this data set was saved with Stata version 13.
<- read.dta13("seedanalysis_011204_080404.dta") x
Warning in read.dta13("seedanalysis_011204_080404.dta"):
Factor codes of type double or float detected in variables
amount, paid, spaying, value, value1,
value2, value3, value4, value5, lexpen1,
etypica2, fargreen, cosgreen, frabrao1,
costro1, frabrao2, costrvo2, frabrao3,
hsave, pgloan, hgowe, gtime, e4, e10,
dates_month, marketbuy, expensivebuy,
numchild, familyplan, assistfam,
personaluse, recreation, familypurchase,
workout, initiatepeace, schkids
No labels have been assigned.
Set option 'nonint.factors = TRUE' to assign labels anyway.
Warning in read.dta13("seedanalysis_011204_080404.dta"):
Missing factor labels for variables
repay5
No labels have been assigned.
Set option 'generate.factors=TRUE' to generate labels.
<- is.na(rowSums(cbind(x$treatment,
index_na $balchange,x$marketing)))==0
x<- x[index_na,]
x1 <- x1[x1$treatment==0 & x1$marketing==0,]$balchange
bal_0 <- x1[x1$treatment==1 & x1$marketing==0,]$balchange
bal_1 # we are just going to look at the people who did not receive
# the marketing information.
# These people are split between those that received
# the account
# (treatment = 1), and those that did not (treatment = 0).
# balchange - measure their balance changed in a year.
<- log(bal_0 + 2169)
lbal_0 <- log(bal_1 + 2169)
lbal_1 # the distribution of balances is very skewed.
mean(bal_1) - mean(bal_0)
[1] 411.4664
The average treatment effect is a 411 peso increase (about $200) in savings after 12 months for those offered the commitment accounts. This result suggests that commitment accounts have a significant impact on savings rates. However, it is not clear if everyone benefits and how much benefit these accounts provide.
Bounds on the Distribution of Balance Changes
The Figure 12 presents the bounds on the distribution of the treatment effect. The figure shows that there is a small portion of the population that ends up saving a large amount due to the commitment savings device, over 10,000 pesos. It also shows that for a large part of the population the commitment savings may or may not increase savings.
There may even be people who actually end up saving less. Unlike the example above, we cannot show that the fraction must be greater than 0.
Intent To Treat Discussion
One issue with the analysis presented above, and with the main results of Ashraf, Karlan, and Yin (2006), is that they are the intent to treat estimates. We have estimated the treatment effect of being “assigned” to a commitment account. People are not lab rats. They have free will. In this case, people assigned to the commitment accounts had the choice of whether to open the account or not. Many did not.
Can you calculate the average treatment effect using the instrumental variable approach? Hint: it is much higher. Did you calculate the ATE or the LATE?
More generally, the concern is that we do not know what would happen to the savings of people who were assigned the commitment account but chose not to open it. Did these people know something that we do not?
Manski Bounds
In his seminal paper, Non-Parametric Bounds on Treatment Effects, Chuck Manski introduced the idea of set estimation to economics (Charles F. Manski 1990). Manski argues that many of the assumptions underlying standard econometrics are ad hoc and unjustified. Rather than making such assumptions, Manski suggests presenting results based on assumptions that can be well justified. In many cases, such assumptions do not provide precise estimates.
Manski also points out that the econometrician and the policy maker may have different views on the reasonableness of assumptions. Therefore, the econometrician should present the results ordered from those based on the most reasonable assumptions to those results based on the least reasonable assumptions. This approach to presenting research gives the policy maker a better understanding of the relationship between the assumptions and the policy predictions (Charles F. Manski and Pepper 2013).
The section presents the bounds approach and illustrates it with simulated data.
Confounded Model
Consider the following confounded version of the model presented above.
where
This time, the value of the policy variable is also determined by the unobserved characteristic that determines income.
where
Simulation of Manski Bounds
Consider simulated data from a confounded data set.
<- 2
c <- 4
d <- -1
f <- round(runif(N))
Z <- rnorm(N)
u_2 <- f + c*u + d*Z + u_2
xstar <- xstar > 0 # treatment assignment
X <- (1-X)*y[,1] + X*y[,2] # outcome conditional on treatment
Y mean(Y[X==1]) - mean(Y[X==0])
[1] 3.506577
The simulated data illustrates the problem. If we assume unconfoundedness, we can estimate the average treatment effect. Our estimate is not close to the true value of 2. Try running OLS of
In economics we call this a selection problem. One solution is to use an instrumental variable estimate to determine
Bounding the Average Treatment Effect
The average treatment effect of college is the difference in the expected outcome given a policy of going to college and a policy of not going to college.
From above we know it can be written as the difference in expected income when the policy forces everyone to go to college and the expected income when the policy forces everyone not to go to college.
We can write out this via the Law of Total Expectation.
Each expectation can be split into the group that attends college and the group that does not attend college. We observe the outcome of the policy that sends the individuals to college for the group that actually goes to college. If we assume that their outcome from the policy is the same as we observe, then we can substitute the observed values into the equation.
We don’t know the outcome of the policy that sends individuals to college for the group that actually does not go to college. Note that I rearranged the equation a little.
We cannot determine the ATE. But we can bound the ATE by replacing the values we cannot observe with values we can observe. Importantly, we know these observed values must be larger (smaller) than the values we cannot observe.
Natural Bounds of the Average Treatment Effect
What is the weakest assumption we could make? An expectation is bounded by the smallest possible value and the largest possible value. An average cannot be smaller than the smallest possible value in the set being averaged. Similarly, the average cannot be larger than the largest possible value in the set being averaged.
The bounds are created by replacing the unknown values with the smallest (largest) values they could be. Let
Given these values, we can calculate the bounds on the average treatment effect.
Note how the bounds on the ATE are calculated. The maximum on the ATE is denoted by the overline. It is when the first expected outcome is as large as possible and the second expected outcome is as small as possible. Similarly, the minimum on the ATE is when the first outcome is as small as possible and the second outcome is as large as possible. The minimum on the ATE is denoted by the underline.
Natural Bounds with Simulated Data
In the simulated data we can use the observed minimum and maximum.
= mean(X==1)
PX1 = mean(X==0)
PX0 = mean(Y[X==1])
EY_X1 = mean(Y[X==0])
EY_X0 = min(Y)
minY = max(Y) maxY
The bounds are calculated by replacing the unknown outcome with the minimum possible value of the outcome and, alternatively, the maximum possible value for the outcome.
# ATE upper bound
- minY)*PX1 + (maxY - EY_X0)*PX0 (EY_X1
[1] 7.975223
# ATE lower bound
- maxY)*PX1 + (minY - EY_X0)*PX0 (EY_X1
[1] -5.010368
These bounds are wide. The average treatment effect of
Are Natural Bounds Useless?
The bounds presented above are wide and don’t even predict the correct sign for the ATE. What can we take away from this information?
First, if we are unwilling to make stronger assumptions, then the data may simply not help us answer the policy question of interest. Manski calls the willingness to make incredible assumptions in order to get more certain results, the “lure of incredible certitude” (Charles F. Manski 2018). He argues that this practice reduces the public and the policy maker’s willingness to rely on science and accept new knowledge.
Second, it is not that we don’t learn anything from the data. In this case we learn that the effect of a policy
Third, there may be assumptions and data that are reasonable and allow tighter bounds. Those are discussed more in the following sections.
Bounds with Exogenous Variation
We may have tighter bounds through variation in the data. In particular, we need variation such that the effect of the policy doesn’t change across different subsets of the data, but the bounds do.
Level Set
The Assumption 2 is like an instrumental variables assumption. Manski calls it a level-set assumption.49 It states that there exists some observable characteristic such that the average treatment effect does not change with changes in the observable characteristic. Given this property it is possible to get tighter bounds by estimating the bounds on the average treatment effect for various subsets of the data. Under the assumption, the average treatment effect must lie in the intersection of these bounds. Thus the new bounds are the intersection of these estimated bounds.
These are the bounds when the instrument-like variables has two values (
Exogenous Variation in Simulated Data
We haven’t used it yet, but there is a variable
= mean(Y[X==1 & Z==1])
EY_X1Z1 = mean(Y[X==1 & Z==0])
EY_X1Z0 = mean(Y[X==0 & Z==1])
EY_X0Z1 = mean(Y[X==0 & Z==0])
EY_X0Z0 = mean(X[Z==1]==1)
PX1_Z1 = mean(X[Z==0]==1)
PX1_Z0 = mean(X[Z==1]==0)
PX0_Z1 = mean(X[Z==0]==0) PX0_Z0
# ATE upper bound
min((EY_X1Z1 - minY)*PX1_Z1 + (maxY - EY_X0Z1)*PX0_Z1,
- minY)*PX1_Z0 + (maxY - EY_X0Z0)*PX0_Z0) (EY_X1Z0
[1] 7.049019
# ATE lower bound
max((EY_X1Z1 - maxY)*PX1_Z1 + (minY - EY_X0Z1)*PX0_Z1,
- maxY)*PX1_Z0 + (minY - EY_X0Z0)*PX0_Z0) (EY_X1Z0
[1] -4.00698
We see that using the level set restriction we do get tighter bounds, but the change is not very large. What changes could you make in the simulated data to get a larger effect of using the level set restriction?
Bounds with Monotonicity
Can the bounds be tighter with some economics? Remember that we observe the cases where
In the simulated data a higher unobserved term is associated with a greater likelihood of choosing treatment
Monotonicity
The Assumption 3 states that observing someone receive treatment
The monotonicity assumption implies that forcing everyone into treatment
Bounds with Monotonicity in the Simulated Data
# ATE upper bound
- EY_X0)*PX0 (maxY
[1] 3.76668
# ATE lower bound
- maxY)*PX1 (EY_X1
[1] -3.64774
Imposing Assumption 3 on the simulated data allows us to tighten the bounds. They reduce down to
Note that the impact of these assumptions is presented in the order that Charles F. Manski and Pepper (2013) prefer. We started with the most credible assumption, the natural bounds. Then we moved to make a level-set restriction because we had a variable that satisfied the assumption. Finally, we made the monotonicity assumption.
More Guns, Less Crime?
One of the most controversial areas in microeconometrics is estimating the effect of various gun laws on crime and gun related deaths. To study these effects, economists and social scientists look at how these laws vary across the United States and how those changes in laws are related to changes in crime statistics (Charles F. Manski and Pepper 2018).
Justice Louis Brandeis said that a “state may, if its citizens choose, serve as a laboratory; and try novel social and economic experiments without risk to the rest of the country.”51 The US states are a “laboratory of democracy.” As such, we can potentially use variation in state laws to estimate the effects of those laws. The problem is that US states are very different from each other. In the current terminology, the states with strong pro-gun laws tend to be “red” states or at least “purple” states. They also tend to have large rural populations.
Between 1980 and 1990, twelve states adopted Right to Carry (RTC) laws. We are interested in seeing how crime fared in those states relative to states that did not adopt those laws. To do this we can look at crime rates from the 1980s and 1990s. A potential problem is that the crack epidemic hit the United States at exactly this time, rising through the 80s and 90s before tailing off. The crack cocaine epidemic was associated with large increases in crime rates in urban areas (Aneja, III, and Zhang 2011).
This section uses publicly available crime data to illustrate the value of the bounds approach.
Crime Data
The data is downloaded from John Donohue’s website.52 While there is quite a lot of variation in gun laws, the definition of RTC is “shall issue” in data set used. For crime, we use the per population rate of aggravated assaults per state, averaged over the post 1990 years. The code also calculates the physical size of the state, which is a variable that will be used later.
library(foreign)
# the data is standard Stata format, the library foreign
# allows this data to be imported.
<- read.dta("UpdatedStateLevelData-2010.dta")
x <- X <- Z <- NULL
Y # the loop will create variables by adding to the vectors
for (i in 2:length(unique(x$state))) {
# length measures the number of elements in the object.
= sort(unique(x$state))[i]
state # note the first state is "NA"
<- c(X,sum(x[x$state==state,]$shalll, na.rm = TRUE) > 0)
X # determines if a state has an RTC law at some point in time.
# na.rm tells the function to ignore NAs
<- c(Y,mean(x[x$state==state & x$year > 1990,]$rataga,
Y na.rm = TRUE))
# determines the average rate of aggrevated assualt for the
# state post 1990.
<- c(Z,mean(x[x$state==state & x$year > 1990,]$area,
Z na.rm = TRUE) > 53960)
# determines the physical area of the state
# Small state = 0, large stage = 1
# print(i)
}
The Figure 13 shows the histogram for the average aggravated assault rate per state in the post 1990 years. It shows that rate per 100,000 is between 0 and 600 for the most part.
ATE of RTC Laws under Unconfoundedness
If we assume unconfoundedness, then RTC laws lower aggravated assault. Comparing the average rate of aggravated assault in states with RTC laws to states without RTC laws, we see that the average is lower with RTC laws.
<- mean(Y[X==1])
EY_X1 <- mean(Y[X==0])
EY_X0 - EY_X0 EY_X1
[1] -80.65852
Unconfoundedness is not a reasonable assumption. We are interested in estimating the average effect of implementing an RTC law. We are not interested in the average rate of assaults conditional on the state having an RTC law.
Natural Bounds on ATE of RTC Laws
We cannot observe the effect of RTC laws for states that do not have RTC laws. We could assume that the assault rate lies between 0 and 100,000 (which it does).
<- mean(X==0)
PX0 <- mean(X==1)
PX1 <- 0
minY <- 100000
maxY # ATE upper bound
- minY)*PX1 + (maxY - EY_X0)*PX0 (EY_X1
[1] 23666.01
# ATE lower bound
- maxY)*PX1 + (minY - EY_X0)*PX0 (EY_X1
[1] -76333.99
The natural bounds are very very wide. An RTC policy may lead to assault rates decreasing by 75,000 or increasing by 24,000 per 100,000 people.
We can make these bounds tighter by assuming that assault rates of the policy cannot lie outside the rates observed in the data.
<- min(Y)
minY <- max(Y)
maxY # ATE upper bound
- minY)*PX1 + (maxY - EY_X0)*PX0 (EY_X1
[1] 334.1969
# ATE lower bound
- maxY)*PX1 + (minY - EY_X0)*PX0 (EY_X1
[1] -624.7655
These bounds are lot tighter. A policy that introduces RTC for the average state could decrease the assault rate by 625 or increase the assault rate by 334. Given that range, it could be that RTC laws substantially reduce aggravated assaults or it could be that they have little or no effect. They may even cause an increase in aggravated assaults.
Bounds on ATE of RTC Laws with Exogenous Variation
<- mean(X[Z==1]==1)
PX1_Z1 <- mean(X[Z==0]==1)
PX1_Z0 <- mean(X[Z==1]==0)
PX0_Z1 <- mean(X[Z==0]==0)
PX0_Z0 <- mean(Y[X==1 & Z==1])
EY_X1Z1 <- mean(Y[X==1 & Z==0])
EY_X1Z0 <- mean(Y[X==0 & Z==1])
EY_X0Z1 <- mean(Y[X==0 & Z==0])
EY_X0Z0 # ATE upper bound
min((EY_X1Z1 - minY)*PX1_Z1 + (maxY - EY_X0Z1)*PX0_Z1,
- minY)*PX1_Z0 + (maxY - EY_X0Z0)*PX0_Z0) (EY_X1Z0
[1] 323.2504
# ATE lower bound
max((EY_X1Z1 - maxY)*PX1_Z1 + (minY - EY_X0Z1)*PX0_Z1,
- maxY)*PX1_Z0 + (minY - EY_X0Z0)*PX0_Z0) (EY_X1Z0
[1] -613.3812
We can make a level set assumption. Assume that the instrument-like variable is the physical size of the state. The assumption is that the average treatment effect of implementing an RTC law must be the same irrespective of the physical size of the state. Note that observable outcomes like the assault rate and the proportion of states with RTC laws may vary with the physical size. The assumption is on the average treatment effect which is unobserved.
The bounds are tighter, although not much. RTC laws could reduce aggravated assaults by 613 or increase rates by 323.
Bounds on ATE of RTC Laws with Monotonicity
Would it be reasonable to use the monotonicity assumption above (Assumption 3)?
Let’s assume that states that currently have RTC laws will also tend to have lower levels of aggravated assault. Moreover, forcing states that don’t currently have RTC laws will not reduce the expected aggravated assaults below that level. This is the “negative” of the monotonicity assumption in the simulated data.
Monotonicity (version 2)
We can summarize this with Assumption 4.
The Assumption 4 implies the following change to the bounds on the unobserved expectations.
Plugging these into the bounds on the ATE we have the following bounds on the effect of the RTC laws.
# ATE upper bound
- minY)*PX1 (EY_X1
[1] 184.2203
# ATE lower bound
- EY_X0)*PX0 (minY
[1] -75.66166
These bounds are substantially tighter. They suggest that the estimate of the ATE under unconfoundedness is actually at the high end of the possible effect of RTC laws. This is evidence that the unconfoundedness assumption cannot hold. At least, it is inconsistent with the weaker monotonicity assumption.
The results in this section suggest the slogan may be more accurately stated as “more guns, more or less crime.”
Discussion and Further Reading
This chapter argues that it may be better to provide less precise estimates than precise predictions of little value to policy makers.
I strongly believe that the average treatment effect is given way too much prominence in economics and econometrics. ATE can be informative, but it can also badly mislead policy makers and decision makers. If we know the joint distribution of potential outcomes, then we may be able to better calibrate the policy. I hope that Kolmogorov bounds will become a part of the modern econometrician’s toolkit. A good place to learn more about this approach is Fan and Park (2010). Mullahy (2018) explores this approach in the context of health outcomes.
Chuck Manski revolutionized econometrics with the introduction of set identification. He probably does not think so, but Chuck has changed the way many economists and most econometricians think about problems. We think much harder about the assumptions we are making. Are the assumptions credible? We are much more willing to present bounds on estimates, rather than make non-credible assumptions to get point estimates.
Manski’s natural bounds allow the researcher to estimate the potential effect of the policy with minimal assumptions. These bounds may not be informative, but that in and of itself is informative. Stronger assumptions may lead to more informative results but at the risk that the assumptions, not the data, determine the results.
I highly recommend any book by Chuck Manski. However, Charles F. Manski (1995) is the standard on non-parametric bounds. To understand more about potential outcomes see Rubin (1974). To understand more about do operators see Pearl and Mackenzie (2018).
Charles F. Manski and Pepper (2018) use the bounds approach to analyze the relationship between right-to-carry laws and crime.
Estimating Demand
Introduction
In the early 1970s, San Francisco was completing a huge new infrastructure project, the Bay Area Rapid Transit (BART) system. The project initially cost $1.6 billion and included tunneling under the San Francisco Bay. Policy makers were obviously interested in determining how many people would use the new system once it was built. But that is a problem. How do you predict the demand for a product that does not exist?
One solution is to ask people. A survey was conducted of people who were likely to use the new transport system. The survey asked detailed questions about their current mode of transport and asked them whether they would use the new system. The concern is that it is hard for people to predict how they would use something that does not exist. Berkeley econometrician, Dan McFadden, suggested an alternative approach. Instead of asking people to predict what they would do, McFadden suggested using information on what people actually do do, then use economic theory to predict what they would do.
McFadden argued that combining survey data with economic theory would produce more accurate estimates than the survey data alone (McFadden 1974). In the case of the BART survey, McFadden was correct. According to the survey data, 15% of respondents said that they would use BART. McFadden estimated that 6% of respondents would use BART. In fact, 6% of respondents actually did use BART.53 Survey data is valuable, but people give more accurate answers to some questions than others.
The first part of the book discussed how exogenous variation is needed to use observed data to predict policy outcomes. Chapters 1 and 2 assume that observed variation in exposure to a policy is determined independently of unobserved characteristics. Chapters 3 and 4 relaxed this assumption but allowed economic theory to be used in estimating the impact of the policy. This part of the book extends the idea of using economic theory. This chapter introduces the idea of using revealed preference.
Today, the ideas that McFadden developed for analyzing the value of BART are used across economics, antitrust, marketing, statistics and machine learning. At the Federal Trade Commission and the Department of Justice, economists use these techniques to determine whether a merger between ice cream manufacturers, or cigarette manufacturers, or supermarkets, or hospitals, will lead to higher prices.
When Google changed the way it displayed search results, user traffic moved away from Google’s competitors. Such actions by a dominant firm like Google could lead to antitrust actions unless the changes also made users better off. By combining economic theory and data on the behavior of Google’s users, we can determine whether Google’s changes were pro or anti-competitive. According to the FTC’s statement on the Google investigation, analysis of Google’s click through data by staff economists showed that consumers benefited from the changes that Google made. This, and other evidence, led the FTC to end its investigation of Google’s “search bias” practice with a 5-0 vote.54
The chapter begins with basic economic assumption of demand estimation, revealed preference. It takes a detour to discuss the maximum likelihood algorithm. It returns with Daniel McFadden’s model of demand. The chapter introduces the logit and probit estimators, and uses them to determine whether consumers in small US cities value rail as much as their big city neighbors.
Revealed Preference
McFadden’s analysis, and demand analysis more generally, relies on the following assumption.
(Revealed Preference) If there are two choices,
The Assumption 5 states that if we observe someone choose product
The section uses simulated data to illustrate how revealed preference is used to estimate consumer preferences.
Modeling Demand
Consider a data set where a large number of individuals are observed purchasing either product
Person
We usually make a transformation to normalize everything relative to one of the available products. That is, all prices and demand are made relative to one of the available products. Here we will **normalize} to product
In addition, we often observe the data at the market level rather than the individual level. That is, we see the fraction of individuals that purchase
Simulating Demand
The simulated data illustrates the power of the revealed preference assumption. Consider the following distribution of an unobserved term. The unobserved term is drawn from a normal distribution with a mean of 1 and a variance of 9 (
set.seed(123456789)
<- 1000
N <- sort(rnorm(N, mean=1, sd=3)) u
Can we uncover this distribution from observed behavior of the individuals in our simulated data? Can we use the revealed preference assumption to uncover the unobserved term (
<- 2
p mean(u - p > 0) # estimated probability
[1] 0.386
1 - pnorm(p, mean=1, sd=3) # true probability
[1] 0.3694413
If
Revealing Demand
If we are able to observe a large number of prices, then we can use revealed preference to estimate the whole distribution of the unobserved utility. At each price, the share of individuals purchasing product
<- runif(9,min=-10,max=10)
p # 9 points between -10, 10.
<- matrix(NA,length(p),1) # share of market buying A.
s for (i in 1:length(p)) {
1] <- mean(u - p[i] > 0)
s[i,#print(i)
}
plot(1-ecdf(u)(u),u, type="l",lwd=3,lty=1,col=1,
xlab="s", ylab="p", xlim=c(0,1))
# ecdf(a)(a) presents the estimated probabilities of a.
lines(sort(s),p[order(s)], type="l", lwd=3,lty=2)
legend("bottomleft",c("True","Est"),lwd=3,lty=c(1:2))
The Figure 14 presents the estimated demand curve.
Discrete Choice
Demand estimation often involves outcomes with discrete values. In McFadden’s original problem, we observe one of three choices, car, bus, or train. OLS tends not to work very well when the outcome of interest is discrete or limited in some manner. Given this, it may be preferable to use a discrete choice model such as a logit or probit.56
The section uses simulated data to illustrate issues with estimating the discrete choice model.
Simple Discrete Choice Model
Consider the following discrete model. There is some latent (hidden) value of the outcome (
We can think of
Simulating Discrete Choice
In the simulated data there is a latent variable (
set.seed(123456789)
<- 100
N <- 2
a <- -3
b <- rnorm(N)
u <- runif(N)
x <- a + b*x + u
ystar <- ystar > 0
y <- lm(y ~ x) lm1
The Figure 15 shows that the estimated relationship differs substantially from the true distribution. The figure illustrates how OLS fails to accurately estimate the parameters of the model. In order to correctly estimate the relationship we need to know the distribution of the unobserved term.
Modeling Discrete Choice
We can write out the model using matrix notation.
where
The probability of observing one of the outcomes (
where
If we know
The standard solution is to assume we know the true distribution of the unobserved term (
where
Maximum Likelihood
The standard algorithm for estimating discrete choice models is maximum likelihood. The maximum likelihood algorithm generally requires some assumption about the distribution of the error term. However, as seen above, we are generally making such an assumption anyway.
This section takes a detour to illustrate how the maximum likelihood algorithm works.
Binomial Likelihood
Consider the problem of determining whether a coin is “fair.” That is, whether the coin has an equal probability of Heads or Tails, when tossed. If the coin is weighted then it may not be fair. It may have a greater probability of landing on Tails than on Heads. The code simulates an unfair coin. The observed probability of a head is 34 of 100.
set.seed(123456789)
<- 100
N <- 0.367 # the true probability of Head.
p <- runif(N) < p
Head mean(Head) # the observed frequency of Head.
[1] 0.34
What is the likelihood that this data was generated by a fair coin? It is the probability of observing 34 Heads and 66 Tails given that the true probability of a Head is 0.5.
What is the probability of observing 1 Head given the true probability is 0.5? It is just the probability of Heads, which is 0.5.
What is the probability of observing three Heads and zero Tails? If the coin tosses are independent of each other, then it is the probability of each Head, all multiplied together.58
How about three Heads and two Tails?
Actually, it isn’t. This is the probability of observing 3 Heads and *then} 2 Tails. But it could have been 1 Head, 2 Tails, 2 Heads or 1 Tail, 1 Head, 1 Tail, 2 Heads etc., etc. There are a number of different combinations of results that have 3 Heads and 2 Tails. In this case there are
In R
we can use factorial()
to do the calculation.
factorial(5)/(factorial(3)*factorial(2))
[1] 10
What is the likelihood of observing 34 Heads and 66 Tails? If the true probability is 0.5, the likelihood is given by the binomial function.
What is the likelihood of observing
In R
we can use the choose()
function to calculate the coefficient for the binomial function.
choose(100, 34)*(0.5^100)
[1] 0.0004581053
The likelihood that the coin is fair seems small.
What is the most likely true probability? One method uses the analogy principle. If we want to know the true probability then we use the analogy in the sample. The best estimate of the true probability is the observed frequency of Heads in the sample (Charles F. Manski 1990).59 It is 34/100. Note that this is not equal to the true probability of 0.367, but it is pretty close.
Alternatively, find the probability that maximizes the likelihood. What is the true probability
It is not a great idea to ask a computer to solve the problem as written. The issue is that these numbers can be very very small. Computers have a tendency to change very small numbers into other, totally different, small numbers. This can lead to errors.
Find the probability that maximizes the log likelihood.
The solution to this problem is identical to the solution to the original problem.60
Binomial Likelihood in R
The Equation 65 provides pseudo-code for a simple optimizer in R
. We can use the optimize()
function to find the minimum value in an interval.61 Note also that the function being optimized dropped the coefficient of the binomial function. Again, this is fine because the optimum does not change.
<- function(p, N, p_hat) {
ll_function return(-((p_hat*N)*log(p) + (1 - p_hat)*N*log(1-p)))
# Note the negative sign as optimize is a minimizer.
}optimize(f = ll_function, interval=c(0,1), N = 100, p_hat=0.34)
$minimum
[1] 0.3399919
$objective
[1] 64.10355
The maximum likelihood estimate is 0.339, which is fairly close to the analog estimate of 0.34. Figure 16 shows that the likelihood function is relatively flat around the true value. The implication is that the difference in the likelihood between the true value and the estimated value is quite small.
<- c(1:1000)/1000
p <- sapply(p, function(x) lchoose(100,34) + 34*log(x) +
log_lik 66*log(1-x))
# sapply is a quick way for R to apply a function to a set
# note that we have defined the function on the fly.
OLS with Maximum Likelihood
We can use maximum likelihood to estimate OLS. Chapter 1 presented the standard algorithms for estimating OLS. It points out that with an additional assumption, the maximum likelihood algorithm could also be used instead.
Assume we have data generated by the following linear model.
where
We can use maximum likelihood to estimate parameters of this model. However, we must assume that
We can determine the likelihood of observing the data by first rearranging Equation 66.
The probability of observing the outcome is as follows.
where
and
Note in R
, it is necessary to use the standard normal distribution function.62 To use this function we need to normalize the random variable by taking away the mean of the unobserved term, which is zero, and by dividing by the standard deviation of the unobserved term (
Therefore, the likelihood of observing the data is given by the following product.
The sample size is
We can find the maximum likelihood estimates of
Compare this to the estimator in Chapter 1.64
Maximum Likelihood OLS in R
We can create a maximum likelihood estimator of the OLS model using Equation 71 as pseudo-code.
<- function(para, y, X) {
f_ols <- cbind(1,X)
X <- length(y)
N <- dim(X)[2]
J <- exp(para[1])
sig # Note that sig must be positive.
# The exponential function maps
# from any real number to positive numbers.
# It allows the optimize to choose any value and
# transforms that number into a positive value.
<- para[2:(J+1)]
beta <- (y - X%*%beta)/sig
z <- -sum(log(dnorm(z)) - log(sig))
log_lik return(log_lik)
# remember we are minimizing.
}
The standard optimizer in R
is the function optim()
. This function defaults to a Nelder-Mead which is a fairly robust algorithm.
<- optim(par=c(0,2,-3),fn=f_ols,y=ystar,X=x)
a1 # optim takes in starting values with par, then the function
# used and then values that the function needs.
# we cheat by having it start at the true values.
# sig
exp(a1$par[1])
[1] 0.9519634
# beta
$par[2:3] a1
[1] 1.832333 -2.643751
Our estimate
Probit
Circling back, consider the discrete choice problem that began the chapter. If we have information about the distribution of the unobserved term, which is generally assumed, then we can find the parameters that maximize the likelihood of the model predicting the data we observe.
Consider the problem described by Equation 56. Assume the unobserved characteristic is distributed standard normal, then the likelihood of observing the data is given by the following function.
The Equation 72 shows the likelihood function of observing the data we actually observe (
The parameter of interest,
Note that I made a slight change going from Equation 72 to Equation 73. I took advantage of the fact that the normal distribution is symmetric. This version is better for computational reasons.65
Probit in R
<- function(beta, y, X) {
f_probit <- cbind(1,X)
X <- X%*%beta
Xb <- (1 - y)*log(pnorm(-Xb)) + y*log(pnorm(Xb))
log_lik return(-sum(log_lik))
}
optim(par=lm1$coefficients,fn=f_probit,y=y,X=x)$par
(Intercept) x
2.014153 -2.835234
We can use Equation 73 as the basis for our own probit estimator. The probit estimates are closer to the true values of 2 and -3, although they are not particularly close to the true values. Why aren’t the estimates close to the true values?66
Generalized Linear Model
The probit is an example of a generalized linear model. The outcome vector is,
In R
these types of functions can often be estimated with the glm()
function. Like the lm()
function, glm()
creates an object that includes numerous results including the coefficients. The nice thing about the glm()
function is that it includes a variety of different models. Unfortunately, that makes it unwieldy to use.
We can compare our probit estimates to those from the built-in R
probit model using glm()
.
glm(y ~ x, family = binomial(link="probit"))$coefficients
(Intercept) x
2.014345 -2.835369
The results are about the same. The two models are solved using different algorithms. The glm()
uses an algorithm called iterative weighted least squares rather than maximum likelihood.
McFadden’s Random Utility Model
In order to estimate the impact of the BART rail system, McFadden needed a model that captured current choices and predicted demand for a product that didn’t exist.
The section presents McFadden’s model, the probit, and logit estimators and simulation results.
Model of Demand
In McFadden’s model, person
Person
To predict demand for BART from observed demand for cars and buses we need two assumptions. First, the preference weights (
We can use revealed preference and observed choices to make inferences about person
The Equation 75 shows that if there is enough variation in the observed characteristics of the choices (
Probit and Logit Estimators
If we assume that
McFadden’s original paper estimates a logit. It assumes that the unobserved characteristics are distributed extreme value type 1. This mouthful-of-a-distribution is also called Gumbel or log Weibull. The advantage of this distribution is that the difference in the unobserved terms is a logistic distribution and a logit model can be used.
The logit has some very nice properties. In particular, it is very easy to compute. This made it a valuable model in the 1970s. Even today the logit is commonly used in machine learning because of its computational properties.68 The logit assumption allows the probability of interest to have the following form.
This function is very useful. It has the property that whatever parameter you give it, it returns a number between 0 and 1, a probability.69 It is often used in optimization problems for this reason.
Simulation with Probit and Logit Estimators
Consider a simulation of the McFadden model with both the probit and logit assumption on the unobserved characteristics.
set.seed(123456789)
<- 5000
N <- cbind(1,matrix(runif(2*N),nrow=N))
XA <- cbind(1,matrix(runif(2*N),nrow=N))
XB # creates two product characteristic matrices
<- c(1,-2,3) beta
In the simulation there are 5,000 individuals choosing between two products with two observable characteristics. Note that these characteristics vary across the individuals, but the preferences of the individuals do not.
# Probit
<- rnorm(N)
uA <- XA%*%beta - XB%*%beta + uA > 0
y <- glm(y ~ I(XA - XB), family = binomial(link="probit"))
glm1 # note that I() does math inside the glm() function.
The probit model assumes that the unobserved characteristic (the relative unobserved characteristic) is distributed standard normal. The logit assumes that the unobserved characteristics are distributed extreme value type 1. Note that the function I()
allows mathematical operations with in the glm()
or lm()
function. Here it simply takes the difference between the two matrices of observed characteristics.
# Logit
<- log(rweibull(N, shape = 1))
uA # gumbel or extreme value type 1
<- log(rweibull(N, shape = 1))
uB <- (XA%*%beta - XB%*%beta) + (uA - uB) > 0
y <- glm(y ~ I(XA - XB), family = binomial(link="logit")) glm2
What happens if you run OLS? Do you get the right sign? What about magnitude?
The Table 12 presents the probit and logit estimates. The table shows that the probit gives estimates that are very close to the true values of -2 and 3. Why is the intercept term 0 rather than the true value of 1? The estimates from the logit are also relatively close to the true values. The different error assumptions of the logit may lead to wider variation in the estimates.
Multinomial Choice
In the analysis above, individuals choose between two options. In many problems individuals have many choices. This section looks at commuters choosing between car, bus, and train. It is in these multinomial choice problems that the logit really comes into its own.
The section presents a multinomial probit, actually a bivariate probit which can be used to model the choice between three options. This model is relatively general, but the computational burden increases exponentially (possibly geometrically) in the number of choices. This is called the curse of dimensionality. One solution to this computational issue is to use an ordered probit. The book doesn’t discuss this model but the model can be very useful for certain problems.
Instead, the section considers a multinomial logit and the independence of irrelevant alternatives assumption. The assumption implies that the unobserved characteristics associated with one choice are independent of the unobserved characteristics associated with any other choice. This assumption allows the logit model to handle very large choice problems. It also allows the model to handle predictions about new choices that are not in the observed data, such as the BART rail system. However, it is a strong restriction on preferences.
The section illustrates these methods with simulated data.
Multinomial Choice Model
Consider the full model of choice over three potential modes of transportation. In the model each person
where
In the case where rail is also a choice, person
As discussed above, the standard approach to estimation of choice problems in economics is to have a “left out” choice or reference category. In this case, that transport mode is bus. The observed characteristics of car and train are created in reference to bus. Let
From Equation 81, the probability of observing the choice of car is as a follows.
The Figure 17 depicts the three choices as a function of the unobserved characteristics. If the relative unobserved characteristics of both choices are low enough, then neither is chosen. That is, the individual chooses Bus. The individual chooses Car if the relative unobserved characteristic is higher for Car than it is for Train.
Multinomial Probit
When there are three choices, the multinomial normal is a bivariate normal. The bivariate probit model assumes
and
The simplest case is the probability of observing Bus (
Simplest, not simple. A word to the wise, writing out multivariate probits is a good way to go insane.
There is a lot to unpack in Equation 84. First, it is written out in a way that is going to allow “easy” transition to R
code. The probability in question is a joint probability over the two relative unobserved characteristics. This joint probability can be written as a mixture distribution of the probability of not choosing Train conditional on the unobserved characteristic for the choice of Car. Next, the probability over not choosing Train is written out using a standard normal distribution. This is useful for programming it up in R
. Lastly, because the unobserved characteristics may be correlated, the variable needs to be normalized, where the distribution of the conditional unobserved characteristic has a mean of
The probability of observing the choice of Car (
Now the choice of Car is determined by both the relative value of Car to Bus and the relative value of Car to Train. If the unobserved characteristic associated with the Car choice is high, then the unobserved characteristic associated with Train does not have to be that low for Car to be chosen.
Multinomial Probit in R
The Equation 84 and Equation 85 have been written in a way to suggest pseudo-code for the bivariate probit. Note how the integral is done. The program takes a set of draws from a standard normal distribution and then for each draw it calculates the appropriate probabilities. After the loop has finished, the probabilities are averaged. This approach gives relatively accurate estimates even with relatively small numbers of loops. Even so, the computational burden of this estimator is large. Moreover, this burden increases dramatically with each choice that is added to the problem.
<- function(para, Y, W1, W2, K = 10) {
f_biprobit # K is a counter for the number of draws from the normal
# distribution - more draws gives greater accuracy
# but slower computational times.
# function setup
set.seed(123456789) # helps the optimizer work better.
<- 1e-20 # used below to make sure the logs work.
eps <- cbind(1,W1)
W1 <- cbind(1,W2)
W2 <- dim(Y)[1]
N <- dim(W1)[2]
J <- exp(para[1])/(1 + exp(para[1]))
rho # the "sigmoid" function that keeps value between 0 and 1.
# It assumes that the correlation is positive.
<- para[2:(J+1)]
beta # integration to find the probabilities
<- rnorm(K)
u <- p_10 <- rep(0, N)
p_00 for (uk in u) {
<- uk < -W1%*%beta
uk0 <- p_00 +
p_00 *pnorm((-W2%*%beta - rho*uk)/((1 - rho^2)^(.5)))
uk0<- p_10 + (1 - uk0)*pnorm(((W1 - W2)%*%beta +
p_10 1 - rho)*uk)/((1 - rho^2)^(.5)))
(
}# determine the likelihood
<- (Y[,1]==0 & Y[,2]==0)*log(p_00/K + eps) +
log_lik 1]==1 & Y[,2]==0)*log(p_10/K + eps) +
(Y[,1]==0 & Y[,2]==1)*log(1 - p_00/K - p_10/K + eps)
(Y[,return(-sum(log_lik))
}
Multinomial Logit
Given the computational burden of the multivariate normal, particularly as the number of choices increase, it is more common to assume a multinomial logit. This model assumes that the unobserved characteristic affecting the choice between Car and Bus is independent of the unobserved characteristic affecting the choice between Car and Train.71
As above, a maximum log-likelihood algorithm is used to estimate the multinomial logit.
Multinomial Logit in R
Starting with the bivariate probit and Equation 86, we can write out the multinomial logit with three choices.
<- function(beta,Y,W1,W2) {
f_logit <- 1e-20 # so we don't take log of zero.
eps <- as.matrix(Y)
Y1 <- as.matrix(cbind(1,W1))
W1 <- as.matrix(cbind(1,W2))
W2 <-
p_10 exp(W1%*%beta)/(1 + exp(W1%*%beta) + exp(W2%*%beta))
<-
p_01 exp(W2%*%beta)/(1 + exp(W1%*%beta) + exp(W2%*%beta))
<- 1 - p_10 - p_01
p_00 <- (Y[,1]==0 & Y[,2]==0)*log(p_00 + eps) +
log_lik 1]==1 & Y[,2]==0)*log(p_10 + eps) +
(Y[,1]==0 & Y[,2]==1)*log(p_01 + eps)
(Y[,return(-sum(log_lik))
}
Notice that, unlike the multinomial probit, we do not need to take random number draws. Calculating the probabilities is much simpler and computationally less intensive.
Simulating Multinomial Choice
Now we can compare the two methods. The simulation uses a bivariate normal distribution to simulate individuals choosing between three options. The simulation also assumes that the unobserved characteristics are correlated between choices.
require(mvtnorm)
Loading required package: mvtnorm
Warning: package 'mvtnorm' was built under R version 4.4.1
# this package creates multivariate normal distributions
set.seed(123456789)
<- 1000
N <- c(0,0)
mu <- 0.1 # correlation parameter
rho <- cbind(c(1,rho),c(rho,1))
Sig <- rmvnorm(N, mean = mu, sigma = Sig)
u # relative unobserved characteristics of two choices
<- matrix(runif(N*2), nrow=N)
x1 <- matrix(runif(N*2), nrow=N)
x2 # relative observed characteristics of two choices.
<- -1
a <- -3
b <- 4
c <- a + b*x1 + c*x2 + u
U <- matrix(0, N, 2)
Y 1] <- U[,1] > 0 & U[,1] > U[,2]
Y[,2] <- U[,2] > 0 & U[,2] > U[,1]
Y[,<- c(log(rho),a,b,c)
par1 <- optim(par=par1, fn=f_biprobit,Y=Y,
a1 W1=cbind(x1[,1],x2[,1]),
W2=cbind(x1[,2],x2[,2]),K=100,
control = list(trace=0,maxit=1000))
<- c(a,b,c)
par2 <- optim(par=par2, fn=f_logit,Y=Y,W1=cbind(x1[,1],x2[,1]),
b1 W2=cbind(x1[,2],x2[,2]),
control = list(trace=0,maxit=1000))
$par[2:4] a1
[1] -1.039715 -2.819453 3.887579
$par[1:3] b1
[1] -1.591007 -4.976470 6.687181
At 100 draws, the bivariate probit estimator does pretty well, with
Demand for Rail
In McFadden’s analysis, policy makers were interested in how many people would use the new BART rail system. Would there be enough users to make such a large infrastructure project worthwhile? This question remains relevant for cities across the US. Many large US cities like New York, Chicago and Boston, have major rail infrastructure, while many smaller US cities do not. For these smaller cities, the question is whether building a substantial rail system will lead to an increase in public transportation use.
To analyze this question we can use data from the National Household Travel Survey.72 In particular, the publicly available household component of that survey. The data provides information on what mode of transport the household uses most days; car, bus or train. It contains demographic information such as home ownership, income, and geographic information such as rural and urban residence. Most importantly for our question, the data provides a measure of how dense the rail network is in the location.
The section uses logit and probit models to estimate the demand for cars in “rail” cities and “non-rail” cities. It estimates the multinomial version of these models for “rail” cities. It then takes those parameter estimates to predict demand for rail in “non-rail” cities.
National Household Travel Survey
The following code brings in the data and creates variables. It creates the “choice” variables for transportation use. Note that households may report using more than one mode, but the variables are defined exclusively. The code also adjusts variables for missing. The data uses a missing code of -9. This would be used as a value if it is not replaced with NA
.
<- read.csv("hhpub.csv", as.is = TRUE)
x $choice <- NA
x$choice <- ifelse(x$CAR==1,"car",x$choice)
x$choice <- ifelse(x$BUS==1,"bus",x$choice)
x$choice <- ifelse(x$TRAIN==1,"train",x$choice)
x# Note that this overrules the previous choice.
$car1 <- x$choice=="car"
x$train1 <- x$choice=="train"
x# adjusting variables to account for missing.
$home <- ifelse(x$HOMEOWN==1,1,NA)
x$home <- ifelse(x$HOMEOWN>1,0,x$home)
x# home ownership
$income <- ifelse(x$HHFAMINC > 0, x$HHFAMINC, NA)
x# household income
$density <- ifelse(x$HTPPOPDN==-9,NA,x$HTPPOPDN)/1000
x# missing is -9.
# population density
# dividing by 1000 makes the reported results look nicer.
$urban1 <- x$URBAN==1 # urban versus rural
x<- x[x$WRKCOUNT>0 & (x$MSACAT == 1 | x$MSACAT == 2),]
y # limit to households that may commute and those that
# live in some type of city.
$rail <- y$RAIL == 1
y# an MSA with rail
<- is.na(rowSums(cbind(y$car1,y$train1,
index_na $home,y$HHSIZE,y$income,
y$urban1,y$density,y$MSACAT,
y$rail)))==0
y<- y[index_na,] # drop missing y
How different are “rail” cities from “non-rail” cities? The plan is to use demand estimates for cars, buses, and rail in cities with rail networks to predict demand for rail in other cities. However, cities with and without rail may differ in a variety of ways which may lead to different demand for rail between the two types of cities.
<- c("car1","train1","home","HHSIZE","income",
vars "urban1","density")
<- matrix(NA,length(vars),2)
summ_tab for (i in 1:length(vars)) {
1] <- mean(y[y$rail==1,colnames(y)==vars[i]])
summ_tab[i,2] <- mean(y[y$rail==0,colnames(y)==vars[i]])
summ_tab[i,
}row.names(summ_tab) <- vars
colnames(summ_tab) <- c("Rail","No Rail")
The Table 13 presents summary results for each type of city. It shows that in cities with rail networks about 10% of the population uses trains most days, while it is only 1% of those in cities without dense rail networks. The different cities also differ in other ways including income and population density.
What would happen to the demand for rail if a city without one built it? Would demand increase to 10% as it is in cities with rail networks? Or would demand be different due to other differences between the cities?
Demand for Cars
We can also look at how the demand for cars varies between the two types of cities.
# Without Rail
<- y[y$rail == 0,]
y_nr <- glm(car1 ~ home + HHSIZE + income + urban1 +
glm_nr data = y_nr,
density, family=binomial(link=logit))
# With Rail
<- y[y$rail == 1,]
y_r <- glm(car1 ~ home + HHSIZE + income + urban1 +
glm_r data = y_r,
density, family=binomial(link=logit))
(1) | (2) | |
---|---|---|
(Intercept) | 3.775 | 3.506 |
(0.270) | (0.211) | |
home | 0.633 | 0.498 |
(0.095) | (0.071) | |
HHSIZE | -0.018 | 0.050 |
(0.033) | (0.026) | |
income | 0.133 | -0.068 |
(0.019) | (0.013) | |
urban1TRUE | -1.160 | -0.312 |
(0.245) | (0.187) | |
density | -0.055 | -0.109 |
(0.008) | (0.003) | |
Num.Obs. | 22419 | 11624 |
AIC | 5129.3 | 7005.1 |
BIC | 5177.5 | 7049.2 |
Log.Lik. | -2558.673 | -3496.533 |
RMSE | 0.16 | 0.29 |
While Table 14 shows that demand for cars varies given different access to rail. It is not clear how to interpret the coefficient estimates. There are lots of differences between the two types of cities, including the choices available to commuters. Nevertheless, we see the demand increasing in home ownership and decreasing in density.
Estimating Demand for Rail
We can set up the McFadden demand model for cars, buses and trains. The utility of car and train is relative to bus. Note that the value of train is assumed to be a function of density. The assumption is that trains have fixed station locations and in more dense cities, these locations are likely to be more easily accessible to the average person.
<- cbind(y_r$home, y_r$HHSIZE, y_r$income,
X_r_car $urban1,0)
y_r<- cbind(0,0,0,0,y_r$density)
X_r_train # train value is assumed to be determined by
# population density.
<- cbind(y_r$choice=="car",y_r$choice=="train") Y1_r
The following presents the optimization procedure for the two multinomial choice models. Each uses the initial probit or logit model for starting values on the
<- c(0,glm_r$coefficients)
par1 <- optim(par=par1,fn=f_biprobit,Y=Y1_r,W1=X_r_car,
a1 W2=X_r_train,K=100,control=list(trace=0,
maxit=10000))
<- glm_r$coefficients
par2 <- optim(par=par2,fn=f_logit,Y=Y1_r,W1=X_r_car,
a2 W2=X_r_train,control=list(trace=0,maxit=10000))
Predicting Demand for Rail
Once we estimate demand for car, bus, and rail in cities with rail networks, we can use the estimated parameters to predict demand for rail in a non-rail city. To do this, we combine the parameter estimates from rail cities with the observed characteristics of households in non-rail cities. This is done for both the multinomial probit and the multinomial logit.
<- cbind(1,y_nr$home, y_nr$HHSIZE, y_nr$income,
X_nr_car $urban1,0)
y_nr<- cbind(1,0,0,0,0,y_nr$density)
X_nr_train # Probit estimate
set.seed(123456789)
<- exp(a1$par[1])/(1 + exp(a1$par[1]))
rho <- a1$par[2:length(a1$par)]
beta <- X_nr_car
W1 <- X_nr_train
W2 = 100
K1 <- rnorm(K1)
u1 <- p_10 <- rep(0, dim(W1)[1])
p_00 for (uk in u1) {
<- uk < -W1%*%beta
uk0 <- p_00 +
p_00 *pnorm((-W2%*%beta - rho*uk)/((1 - rho^2)^(.5)))
uk0<- p_10 +
p_10 1 - uk0)*pnorm(((W1 - W2)%*%beta +
(1 - rho)*uk)/((1 - rho^2)^(.5)))
(
}<- (K1 - p_00 - p_10)/K1
p_train_pr mean(p_train_pr)
[1] 0.08872575
# Logit estimate
<- exp(X_nr_train%*%a2$par)
p_train_lg_top <- 1 + exp(X_nr_car%*%a2$par) +
p_train_lg_bottom exp(X_nr_train%*%a2$par)
<- p_train_lg_top/p_train_lg_bottom
p_train_lg mean(p_train_lg)
[1] 0.08313867
If a city without rail built one, the demand for rail would increase substantially. It would increase from 1% to either 8.3% or 8.9%, depending upon the estimator used. A substantial increase, almost to the 10% we see for cities with rail. Do you think this evidence is enough to warrant investment in rail infrastructure by smaller US cities?
Discussion and Further Reading
The first part of the book presented methods that focused on the use of experimentation, or lack thereof. This part focuses on methods based on economic theory. This chapter discusses how the economic assumption of revealed preference can be used to identify the policy parameters of interest. The Nobel Prize winning economist, Daniel McFadden, showed how combining economic theory and survey data could provide better predictions of demand for new products.
This chapter presented the standard discrete choice models of the logit and probit. The chapter introduces the maximum likelihood algorithm. It shows how this algorithm can be used to estimate OLS, logit and probit models.
The application of the logit and probit is to demand estimation. While these are standard methods, the reader should note that they are not really the way modern demand analysis is done. The chapter assumes that prices (or product characteristics) are exogenous. Modern methods use an IV approach to account for the endogeneity of prices. In particular, the ideas of S. Berry, Levinsohn, and Pakes (1995) have become standard in my field of Industrial Organization. Chapter 7 discusses this approach in more detail.
Estimating Selection Models
Introduction
Chapter 5 introduced the idea of using the economic assumption of revealed preference for estimating policy effects. Berkeley econometrician, Daniel McFadden, won the Nobel prize in economics for his work using revealed preference to estimate demand. McFadden was joined in the Nobel prize by University of Chicago econometrician, James Heckman. Heckman won for his work advocating the use of revealed preference to a broader range of problems.
Chapter 6 considers two related problems, censoring and selection. Censoring occurs when the value of a variable is limited due to some constraint. For example, we tend not to see wages below the federal minimum wage. The chapter shows our estimates can be biased when our statistical models expect the variable to go below the censored level. A standard method to account for censoring is to combine a probit with OLS. This combined model is called a Tobit. The chapter estimates a wage regression similar to Chapter 2’s analysis of returns to schooling. The difference is that here, the regression accounts for censoring of wages at the minimum wage.
The selection problem is a generalization of the censoring problem. The data is censored due to some sort of “choice.”73 While McFadden considered problems where the choice was observed but the outcomes were not. Heckman examined a question where both the choice and the outcome of that choice are observed.
The chapter uses Heckman’s model to analyze the gender-wage gap. The concern is that observed difference in wages by gender may underestimate the actual difference. Traditionally, many women did not have paid work because they have childcare or other uncompensated responsibilities. Whether or not a woman works full-time depends on the wage she is offered. We only observed the offers that were accepted, which means the offers are “selected.” The Heckman model allows us to account for the “choice” of these women to work.
In addition to analyzing the gender wage gap, the chapter returns to the question of measuring returns to schooling. The chapter uses a version of the Heckman selection model to estimate the joint distribution of potential wages for attending college and not attending college.
Modeling Censored Data
Censoring refers to the issue that a variable is set to an arbitrary value such as 0. Say for example, that a variable must always have a positive value. When we look at hours worked, the values are all positive. The minimum number of hours a person can work is zero. Such restrictions on the values can make it difficult to use OLS and other methods described in the previous chapters.
The section presents the latent value model and the Tobit estimator.
A Model of Censored Data
Consider a model eerily similar to the model presented in the previous chapter. There is some latent outcome (
We can think of
Simulation of Censored Data
Consider a simulated version of the model presented above.
set.seed(123456789)
<- 500
N <- 2
a <- -3
b <- runif(N)
x <- rnorm(N)
u <- a + b*x + u
ystar <- ifelse(ystar > 0, ystar, 0)
y <- lm(y ~ x)
lm1 $coefficients[2] lm1
x
-2.026422
The Figure 18 shows that the relationship estimated with OLS is quite different from the true relationship. The true relationship has a slope of -3, while the estimated relationship is much flatter with a slope of -2. Can you see the problem?
OLS does not provide a correct estimate of the relationship because the data is censored. Only positive values of
The implication is that our method of averaging, discussed in Chapter 1, no longer works. One solution is to limit the data so that it is not censored. Figure 18 suggests that for values of
lm(y[x < 0.6] ~ x[x < 0.6])$coef[2]
x[x < 0.6]
-2.993644
length(y[x < 0.6])
[1] 291
In this case, limiting the sample to the data that is not censored leads to an estimate close to the true value of -3.
Another solution is to ignore the exact amount of positive values and estimate a probit. If we simplify the problem by setting all positive values to 1 we can use a standard probit. Again our estimate is also not efficient. We have thrown away a bunch of information about the value of
glm(y > 0 ~ x, family = binomial(link = "probit"))$coefficients
(Intercept) x
2.015881 -3.013131
Again, despite throwing away information, the probit gives results that are pretty close to the true values of 2 and -3.
The solution presented below is to use a probit to account for the censoring and estimate OLS on the non-censored data. In particular, the Tobit is a maximum likelihood estimator that allows the two methods to be combined in a natural way. The estimator also uses all the information and so is more efficient than the solutions presented above.
Latent Value Model
One way to correct our estimate is to determine what the censored values of
However, we may be willing to make an assumption about the shape of the distribution. In that case, it may be possible to estimate the distribution of the missing data using information from the data that is not censored.
The latent value model is very similar to the demand model presented in the previous chapter. In both cases, there is some **latent} value that we are interested in measuring. We can write out the data generating process.
where
If the unobserved characteristic of the individual is high enough, then the outcome is not censored. In that case, we have the OLS model. As shown in the previous chapter, if the unobserved term is distributed normally,
where
where the standard normal density is denoted by
For the alternative case, we can use the probit model.
The censored model with the normality assumption is called a Tobit. The great econometrician, Art Goldberger, named it for the great economist, James Tobin, and the great limited dependent variable model, the probit (Enami and Mullahy 2009).75
Tobit Estimator
We can write out the Tobit estimator by combining the ideas from the maximum likelihood estimator of OLS and the probit presented in Chapter 5.
That is, find the parameters (
The notation R
.
Unlike the probit model we have an extra term for the distribution of the unobserved term (
Tobit Estimator in R
We can use Equation 92 as pseudo-code for the estimator in R
.
<- function(para, y, X) {
f_tobit <- cbind(1,X)
X <- exp(para[1]) # use exp() to keep value positive.
sig <- para[2:length(para)]
beta <- y == 0
is0 # indicator function for y = 0.
<- (y - X%*%beta)/sig
z <- -sum(is0*log(pnorm(z)) +
log_lik 1 - is0)*(log(dnorm(z)) - log(sig)))
(# note the negative because we are minimizing.
return(log_lik)
}
<- c(0,lm1$coefficients)
par1 <- optim(par=par1,fn=f_tobit,y=y,X=x)
a1 exp(a1$par[1])
0.9980347
$par[2:length(a1$par)] a1
(Intercept) x
2.010674 -3.012778
The model estimates the three parameters pretty accurately, the shape parameter,
Censoring Due To Minimum Wages
One of the standard questions in labor economics is determining the effect on earnings of an individual’s characteristics, like education, age, race, and gender. For example, we may be interested in whether women are paid less than men for the same work. A concern is that our wage data may be censored. For example, in July 2009 the federal government increased the federal minimum wage to $7.25. That is, it was generally illegal to pay people less than $7.25.76
The section uses information on wages from 2010 and compares OLS to Tobit estimates.
National Longitudinal Survey of Youth 1997
The National Longitudinal Survey of Youth 1997 (NLSY97) is a popular data set for applied microeconometrics and labor economics. The data follows about 8,000 individuals across 18 years beginning in 1997. At the start of the data collection, the individuals are teenagers or in their early 20s.77
<- read.csv("NLSY97_min.csv", as.is = TRUE)
x $wage <-
xifelse(x$CVC_HOURS_WK_YR_ALL.10 > 0 & x$YINC.1700 > 0,
$YINC.1700/x$CVC_HOURS_WK_YR_ALL.10,NA)
x$wage <- as.numeric(x$wage)
x$wage <- ifelse(x$wage < quantile(x$wage,0.90,
xna.rm = TRUE), x$wage,NA)
# topcode at the 90th percentile
# this done to remove unreasonably high measures of wages.
$lwage <- ifelse(x$wage > 7.25, log(x$wage), 0)
x# note the 0s are used as an indicator in the Tobit function.
$ed <- ifelse(x$CV_HGC_EVER_EDT>0 &
x$CV_HGC_EVER_EDT < 25,x$CV_HGC_EVER_EDT,NA)
x# removed very high values of education.
$exp <- 2010 - x$KEY.BDATE_Y - x$ed - 6
x$exp2 <- (x$exp^2)/100
x# division makes the reported results nicer.
$female <- x$KEY.SEX==2
x$black <- x$KEY.RACE_ETHNICITY==1
x<- is.na(rowSums(cbind(x$lwage,x$wage,x$ed,x$exp,
index_na $black,x$female)))==0
x<- x[index_na,] x1
To illustrate censoring we can look at wage rates for the individuals in NLSY97. Their **average wage rate} is calculated as their total income divided by the total number hours worked for the year. In this case, income and wages are measured in 2010. The code uses the censored variable where log wage is set to 0 unless the wage rate is above $7.25 per hour.
The histogram in Figure 20 gives some indication of the issue. We see that the distribution of wages is not symmetric and there seems to be higher than expected frequency just above the federal minimum wage. Actually, what is surprising is that there is a relatively large number of individuals with average wages below the minimum. It is unclear why that is, but it may be due to reporting errors or cases where the individual is not subject to the law.
Tobit Estimates
Even though it may provide inaccurate results, it is always useful to run OLS. Here we use it as a comparison to see if the censoring affects our measurement of how education, gender, and race affect wages. Following the argument presented above, it is useful to also see the probit estimates. The probit accounts for some of the impact of censoring although it throws away a lot of information.
<- lm(lwage ~ ed + exp + exp2 + female + black, data=x1)
lm1 <- glm(lwage > 0 ~ ed + exp + exp2 + female + black,
glm1 data=x1, family=binomial(link="probit"))
Comparing OLS to probit and Tobit estimates we can see how the censoring affects standard estimates of returns to experience, schooling, race and gender.
<- c(0,glm1$coefficients)
par1 <- optim(par=par1,fn=f_tobit,y=x1$lwage,
a1 X=cbind(x1$ed,x1$exp,x1$exp2,x1$female,x1$black),
control = list(trace=0,maxit=10000))
<- cbind(lm1$coefficients,glm1$coefficients,
res_tab $par[2:length(a1$par)])
a1<- rbind(res_tab,c(1,1,exp(a1$par[1])))
res_tab colnames(res_tab) <- c("OLS Est","Probit Est","Tobit Est")
rownames(res_tab) <- c("intercept","ed","exp","exp sq",
"female","black","sigma")
The Table 15 presents a comparison of wage rate regressions between OLS, probit and the Tobit. Note that the values for
OLS Est | Probit Est | Tobit Est | |
---|---|---|---|
intercept | -0.1285217 | -1.5149073 | -0.6399135 |
ed | 0.1359129 | 0.1448531 | 0.1633185 |
exp | 0.1173622 | 0.1223768 | 0.1209332 |
exp sq | -0.4039422 | -0.4198480 | -0.3505662 |
female | -0.1681297 | -0.1956350 | -0.1920443 |
black | -0.2836382 | -0.3134971 | -0.3339733 |
sigma | 1.0000000 | 1.0000000 | 1.2218445 |
Modeling Selected Data
The Tobit model is about censoring. A close cousin of censoring is selection. In both cases we can think of the problem as having missing data. The difference is the reason for the missingness. In censoring, the data is missing because the outcome variable of interest is above or below some threshold. With selection, the data is missing because the individuals in the data have made a choice or have had some choice made for them.
Consider the problem of estimating returns to schooling for women. Compared to males, a large share of the female population don’t earn wages. We have a selection problem if this choice is determined by how much these women would have earned if they had chosen to work. The observed distribution of wages for women may be systematically different than the unobserved distribution of wage offers. This difference may lead us to underestimate the gender wage gap.
A Selection Model
Consider a model similar to the one presented above. There is some latent outcome (
We can think of
Note if
Simulation of a Selection Model
Consider a simulated data set that is similar to the data created for the Tobit model. One difference is the
require(mvtnorm)
set.seed(123456789)
<- 100
N <- 6
a <- -3
b <- 4
c <- -5
d <- runif(N)
x <- runif(N)
z <- c(0,0)
mu <- rbind(c(1,0.5),c(0.5,1))
sig <- rmvnorm(N, mean=mu, sigma=sig)
u # creates a matrix with two correlated random variables.
<- ifelse(c + d*z + u[,1] > 0, a + b*x + u[,2], 0)
y <- x
x1 <- y
y1 <- lm(y1 ~ x1)
lm1 <- x[z < 0.6]
x1 <- y[z < 0.6]
y1 <- lm(y1 ~ x1)
lm2 <- y > 0
y1 <- glm(y1 ~ z, family = binomial(link = "probit")) glm1
(1) | (2) | (3) | |
---|---|---|---|
(Intercept) | 4.397 | 6.039 | 4.025 |
(0.426) | (0.301) | (0.869) | |
x1 | -1.314 | -3.358 | |
(0.741) | (0.517) | ||
z | -4.905 | ||
(1.150) | |||
Num.Obs. | 100 | 64 | 100 |
R2 | 0.031 | 0.405 | |
R2 Adj. | 0.021 | 0.395 | |
AIC | 440.9 | 205.6 | 58.2 |
BIC | 448.7 | 212.0 | 63.4 |
Log.Lik. | -217.458 | -99.777 | -27.112 |
RMSE | 2.13 | 1.15 | 0.30 |
The Table 16 presents OLS and probit regression results for the simulated data. The outcome variable
Heckman Model
In algebra, the selection model is similar to the Tobit model.
where
In the Heckman model the “decision” to work or not, is dependent on a different set of observed and unobserved characteristics represented by
Heckman Estimator
As with the Tobit model, we can use maximum likelihood to estimate the model. The likelihood of observing the censored value
The likelihood of observing a censored value of
The likelihood of a strictly positive value of
Unfortunately, it is rather gruesome to write down the likelihood in this way. Therefore, we take advantage of Bayes’s rule and write down the likelihood of
where
We want to find the parameters (
Note that I have taken advantage of the fact that normals are symmetric.
To make this presentation a little less messy, I assumed that the distribution of unobserved terms is a bivariate standard normal. That is, the variance terms are 1. As with the probit, the variance of
Heckman Estimator in R
The code for the Heckman estimator is very similar to the code for the Tobit estimator. The difference is that this estimator allows for a set of characteristics that determine whether or not the outcome variable is censored.
<- function(par,y, X_in, Z_in = X_in) {
f_heckman # defaults to Z_in = X_in
<- cbind(1,X_in)
X <- cbind(1,Z_in)
Z <- y == 0 # indicator function
is0 <- exp(par[1])/(1 + exp(par[1]))
rho # this is the sigmoid function
# Note that in actual fact rho is between -1 and 1.
<- par[2:(1+dim(X)[2])]
beta <- par[(2 + dim(X)[2]):length(par)]
gamma <- X%*%beta
Xb <- Z%*%gamma
Zg <- (Zg + rho*(y - Xb))/((1 - rho^2)^(.5))
Zg_adj <- is0*log(pnorm(-Zg)) +
log_lik 1 - is0)*(log(pnorm(Zg_adj)) +
(log(dnorm(y - Xb)))
return(-sum(log_lik))
}<- c(0,lm1$coefficients,glm1$coefficients)
par1 <- optim(par=par1,fn=f_heckman,y=y,X=x,Z=z)
a2 # rho
exp(a2$par[1])/(1 + exp(a2$par[1]))
0.5357194
# beta
$par[2:3] a2
(Intercept) x1
5.831804 -2.656908
# gamma
$par[4:5] a2
(Intercept) z
4.406743 -5.488539
The Heckman estimator does a pretty good job of estimating the true parameters. The true
Analyzing the Gender Wage Gap
We can analyze the difference in wages between men and women using the NLSY97. Here we use the data from 2007 in order to minimize issues due to censoring. The analysis is also limited to “full-time” workers, those working more than an average of 35 hours per week.81 A Heckman model is used to adjust for selection.
NLSY97 Data
The data is from NLSY97 with hours and income from 2007.82 The analysis is limited to individuals working more than 1750 hours per year.
<- read.csv("NLSY97_gender_book.csv")
x $wage <-
xifelse(x$CVC_HOURS_WK_YR_ALL.07_XRND > 0,
$YINC.1700_2007/x$CVC_HOURS_WK_YR_ALL.07_XRND, 0)
x$lwage <- ifelse(x$wage > 1, log(x$wage), 0)
x$fulltime <- x$CVC_HOURS_WK_YR_ALL.07_XRND > 1750
x$lftwage <- ifelse(x$lwage > 0 & x$fulltime, x$lwage, 0)
x$female <- x$KEY_SEX_1997==2
x$black <- x$KEY_RACE_ETHNICITY_1997==1
x$age <- 2007 - x$KEY_BDATE_Y_1997
x$age2 <- x$age^2
x$college <- x$CV_HIGHEST_DEGREE_0708_2007 >= 3
x$south <- x$CV_CENSUS_REGION_2007==3
x$urban <- x$CV_URBAN.RURAL_2007==1
x$msa <- x$CV_MSA_2007 > 1 & x$CV_MSA_2007 < 5
x$married <- x$CV_MARSTAT_COLLAPSED_2007==2
x$children <- x$CV_BIO_CHILD_HH_2007 > 0
x<- is.na(rowSums(cbind(x$black,x$lftwage,x$age,
index_na $msa, x$urban,x$south,
x$college,x$female,
x$married,x$children)))==0
x<- x[index_na,]
x1 <- x1[x1$female,]
x1_f <- x1[!x1$female,]
x1_m # split by gender
The Figure 21 presents the log densities of wage rates for male and female full-time workers. It shows that female wages are shifted down, particularly at the low and high end. The question from the analysis above is whether the true difference is much larger. Is the estimated distribution of wages for females biased due to selection? Asked another way, is the distribution of female wages shifted up relative to the distribution of female wage offers?
Part of the explanation for the difference may be differences in education level, experience or location. We can include these additional variables in the analysis.
The regressions presented in Table 17 suggest that there is a substantial female wage gap. This is shown in two ways, first by comparing Model (1) to Model (2). These are identical regressions except that Model (1) is just on male wage earners and Model (2) is just on female wage earners. The regressions are similar except for the intercept term. Model (2) is substantially shifted down relative to Model (1). The second way is by simply adding a dummy for female in Model (3). The negative coefficient on the dummy also suggests that there is a substantial gender wage gap, even accounting for education and experience.
<- lm(lftwage ~ age + age2 + black + college +
lm1 + msa, data=x1_m)
south <- lm(lftwage ~ age + age2 + black + college +
lm2 + msa, data=x1_f)
south <- lm(lftwage ~ female + age + age2 + black + college +
lm3 + msa, data=x1) south
(1) | (2) | (3) | |
---|---|---|---|
(Intercept) | -12.470 | -5.341 | -3.297 |
(14.382) | (10.084) | (8.412) | |
age | 0.990 | 0.430 | 0.292 |
(1.148) | (0.807) | (0.672) | |
age2 | -0.017 | -0.008 | -0.004 |
(0.023) | (0.016) | (0.013) | |
blackTRUE | -0.660 | 0.020 | -0.255 |
(0.083) | (0.057) | (0.048) | |
collegeTRUE | 0.547 | 0.620 | 0.596 |
(0.133) | (0.082) | (0.072) | |
southTRUE | 0.053 | -0.025 | -0.012 |
(0.080) | (0.055) | (0.047) | |
msaTRUE | 0.030 | 0.038 | 0.048 |
(0.143) | (0.120) | (0.093) | |
femaleTRUE | -0.660 | ||
(0.045) | |||
Num.Obs. | 1076 | 1555 | 2631 |
R2 | 0.097 | 0.042 | 0.115 |
R2 Adj. | 0.092 | 0.038 | 0.113 |
AIC | 3497.8 | 4563.6 | 8134.8 |
BIC | 3537.6 | 4606.4 | 8187.7 |
Log.Lik. | -1740.898 | -2273.810 | -4058.398 |
RMSE | 1.22 | 1.04 | 1.13 |
OLS estimates of log wages for full-time workers in 2007 from NLSY97. Model (1) is for male workers. Model (2) is for female workers. Model (3) includes both genders but a dummy variable for female.
Choosing To Work
While the previous analysis suggests a substantial gender wage gap, that gap may be underestimated. This would occur if the women observed in the work force were the ones more likely to have received higher wage offers. The first step to estimate the distribution of wage offers is to estimate the “choice to work.” I put choice in quotations because I am not assuming that all women are actually making a choice. I am assuming that whether a woman works or not depends in part on what the woman expects to earn.
<- glm(lftwage > 0 ~ college, data = x1_f)
glm1 <- glm(lftwage > 0 ~ college + south, data = x1_f)
glm2 <- glm(lftwage > 0 ~ college + south + married +
glm3 data = x1_f) children,
(1) | (2) | (3) | |
---|---|---|---|
(Intercept) | 0.284 | 0.278 | 0.157 |
(0.012) | (0.016) | (0.062) | |
collegeTRUE | 0.219 | 0.218 | 0.215 |
(0.036) | (0.036) | (0.036) | |
southTRUE | 0.014 | 0.015 | |
(0.023) | (0.023) | ||
marriedTRUE | -0.037 | ||
(0.089) | |||
childrenTRUE | 0.126 | ||
(0.062) | |||
Num.Obs. | 1555 | 1555 | 1555 |
R2 | -0.098 | -0.099 | -0.097 |
AIC | 1984.9 | 1986.5 | 1986.2 |
BIC | 2000.9 | 2007.9 | 2018.3 |
Log.Lik. | -989.444 | -989.270 | -987.115 |
RMSE | 0.46 | 0.46 | 0.46 |
Probit estimates of the “choice” to work for females in NLSY97.
The Table 18 shows that having a college education substantially increases the likelihood that a woman will work. The analysis also suggests that being married and having children affects the likelihood of working although the coefficient on married is not statistically significantly different from zero and the coefficient on children is positive.
Heckman Estimates of Gender Gap
In order to use the Heckman model, the outcome variable is normalized so that it has a standard normal distribution. The model assumes that income for men and women is determined by experience, education and location. For men it is assumed that if we don’t observe a man working full-time, it is something idiosyncratic about the man. In contrast, we assume that women are selected to work full-time based on education, location, number of children and whether they are married or not.
$lftwage_norm <-
x1$lftwage-mean(x1$lftwage))/sd(x1$lftwage)
(x1$lftwage_norm3 <- ifelse(x1$lftwage==0,0,x1$lftwage_norm)
x1<- x1$lftwage_norm3
y1 <- cbind(x1$age,x1$age2,x1$black,x1$college,
X1 $south,x1$msa)
x1<- cbind(x1$college,x1$south,x1$married,x1$children)
Z1 <- y1[x1$female]
y1f <- X1[x1$female,]
X1f <- Z1[x1$female,]
Z1f <- c(0,lm2$coefficients,glm3$coefficients)
par1 <- optim(par=par1,fn=f_heckman,y=y1f,X=X1f,Z=Z1f,
a2 control = list(trace=0,maxit=10000))
<- cbind(1,X1f)%*%a2$par[2:8] + rnorm(dim(X1f)[1])
y_adj <- X1[!x1$female,]
X1m <- cbind(1,X1m)%*%a2$par[2:8] + rnorm(dim(X1m)[1]) y_adj_m
The Figure 22 presents density estimates after accounting for selection into full-time work. It shows that the distribution of female wage offers is shifted much further down than the standard estimate. In order to account for observed differences between men and women, the chart presents a density of wage for women but with their observed characteristics set to the same values as men. Accounting for other observed differences between men and women has little effect.
Back to School Returns
This section returns to the question of whether an additional year of schooling increases income. This approach is similar to the IV approach presented in Chapter 3. It shows that the effect of college is heterogeneous and not always positive.
NLSM Data
We can use the NLSM data used by Card (1995) to compare the Heckman model with the IV approach.83 In this analysis, a person is assumed to have gone to college if they have more than 12 years of education.
<- read.csv("nls.csv",as.is=TRUE)
x $lwage76 <- as.numeric(x$lwage76) x
Warning: NAs introduced by coercion
<- x[is.na(x$lwage76)==0,]
x1 $lwage76_norm <-
x1$lwage76 - mean(x1$lwage76))/sd(x1$lwage76)
(x1# norm log wages for Heckman model
$exp <- x1$age76 - x1$ed76 - 6 # working years after school
x1$exp2 <- (x1$exp^2)/100
x1$college <- x1$ed76 > 12 x1
College vs. No College
We can compare the differences in income for people who attended college and those that did not.
<- lm(lwage76_norm ~ exp + exp2 + black + reg76r +
lm1c data=x1[x1$college,])
smsa76r, <- lm(lwage76_norm ~ exp + exp2 + black + reg76r +
lm1nc data=x1[!x1$college,]) smsa76r,
(1) | (2) | |
---|---|---|
(Intercept) | -0.785 | -1.231 |
(0.108) | (0.209) | |
exp | 0.175 | 0.182 |
(0.028) | (0.036) | |
exp2 | -0.603 | -0.628 |
(0.184) | (0.149) | |
black | -0.372 | -0.565 |
(0.067) | (0.051) | |
reg76r | -0.219 | -0.469 |
(0.050) | (0.050) | |
smsa76r | 0.379 | 0.396 |
(0.056) | (0.048) | |
Num.Obs. | 1521 | 1489 |
R2 | 0.152 | 0.248 |
R2 Adj. | 0.150 | 0.245 |
AIC | 3970.8 | 3773.8 |
BIC | 4008.1 | 3811.0 |
Log.Lik. | -1978.380 | -1879.916 |
RMSE | 0.89 | 0.86 |
The Table 19 presents the results from OLS estimates for two groups: males that attend college and males that do not attend college. Note that this model is a little different from results presented in earlier chapters because I have normalized log wages for use in the Heckman model. Looking at the intercept term we see that the distribution of wages for males who attend college shifted up relative to the distribution of wages of males who do not.
Choosing College
Again we don’t mean that a person is literally choosing whether or not to attend college. We mean that there are unobserved characteristics of the individual that are related to both attending college and earning income once graduated.
# Probit Estimate
<- glm(college ~ nearc4,
glm1 family = binomial(link = "probit"), data=x1)
<- glm(college ~ nearc4 + momdad14,
glm2 family = binomial(link = "probit"), data=x1)
<- glm(college ~ nearc4 + momdad14 + black + smsa66r,
glm3 family = binomial(link = "probit"),data=x1)
<- glm(college ~ nearc4 + momdad14 + black + smsa66r
glm4 + reg662 + reg663 + reg664 + reg665 + reg666 +
+ reg668 + reg669,
reg667 family = binomial(link = "probit"), data=x1)
(1) | (2) | (3) | (4) | |
---|---|---|---|---|
(Intercept) | -0.196 | -0.640 | -0.442 | -0.745 |
(0.041) | (0.062) | (0.072) | (0.130) | |
nearc4 | 0.307 | 0.316 | 0.223 | 0.237 |
(0.049) | (0.050) | (0.056) | (0.058) | |
momdad14 | 0.552 | 0.411 | 0.420 | |
(0.058) | (0.060) | (0.061) | ||
black | -0.502 | -0.466 | ||
(0.059) | (0.064) | |||
smsa66r | 0.139 | 0.130 | ||
(0.055) | (0.057) | |||
reg662 | 0.233 | |||
(0.122) | ||||
reg663 | 0.277 | |||
(0.120) | ||||
reg664 | 0.355 | |||
(0.142) | ||||
reg665 | 0.198 | |||
(0.122) | ||||
reg666 | 0.327 | |||
(0.137) | ||||
reg667 | 0.266 | |||
(0.131) | ||||
reg668 | 0.767 | |||
(0.182) | ||||
reg669 | 0.533 | |||
(0.133) | ||||
Num.Obs. | 3010 | 3010 | 3010 | 3010 |
AIC | 4137.5 | 4046.4 | 3965.1 | 3948.4 |
BIC | 4149.5 | 4064.4 | 3995.2 | 4026.5 |
Log.Lik. | -2066.737 | -2020.206 | -1977.571 | -1961.176 |
RMSE | 0.50 | 0.49 | 0.48 | 0.48 |
The Table 20 reiterates results we have seen in earlier chapters. Growing up near a 4-year college is associated with a higher probability of attending college; so is growing up with both parents. It is also related to race and where the person was living in 1966. Men living in cities were more likely to attend college.
Heckman Estimates of Returns to Schooling
The Heckman estimator is very similar to the IV estimator presented in Chapter 3.84 I have highlighted the similarity by using the
A probit is used to determine whether the person attends college or not.85 The probit uses some of the instrumental variables discussed in Chapter 3 such as distance to college and whether the person lived with their parents at 14. Notice that these variables are not included in the variables determining income. Also note that the college function does not include variables that will affect income in 1976. The assumption is that the college decision had to do with factors that were true for the person in 1966, while income in 1976 has to do with factors that are true for the person in 1976.
<- cbind(x1$exp,x1$exp2,x1$black,x1$reg76r,x1$smsa76r)
X1 <- cbind(x1$nearc4,x1$momdad14,x1$black,x1$smsa66r,
Z1 $reg662,x1$reg663,x1$reg664,x1$reg665,
x1$reg666,x1$reg667,x1$reg668,x1$reg669)
x1
# College
<- x1[x1$college,]$lwage76_norm
y1c <- X1[x1$college,]
X1c <- Z1[x1$college,]
Z1c
<- c(0,lm1c$coefficients,glm4$coefficients)
par1 <- optim(par=par1,fn=f_heckman,y=y1c,X=X1c,Z=Z1c,
a_coll control = list(trace=0,maxit=100000))
# No college
<- -x1[!x1$college,]$lwage76_norm
y1nc # negative in order to account for correlation in income
# and no college.
<- X1[!x1$college,]
X1nc <- Z1[!x1$college,]
Z1nc
<- c(0,lm1nc$coefficients,-glm4$coefficients)
par1 <- optim(par=par1,fn=f_heckman,y=y1nc,X=X1nc,Z=Z1nc,
a_nocoll control = list(trace=0,maxit=100000))
# Predicted wages
<- c(0,0)
mu1 <- exp(a_coll$par[1])/(1 + exp(a_coll$par[1]))
rho1 <- cbind(c(1,rho1),c(rho1,1))
sig1 <- rmvnorm(dim(x1)[1],mu1,sig1)
u1 <- a_coll$par[2:length(lm1$coefficients+1)]
beta_c <- a_nocoll$par[2:length(lm1$coefficients+1)]
beta_nc $coll_y <- cbind(1,X1)%*%beta_c + u1[,1]
x1$nocoll_y <- -cbind(1,X1)%*%beta_nc + u1[,2] x1
This is not a “full” Heckman selection model. I have estimated two separate models, one on choosing college and the other on not choosing college. Note that I have used negatives to account for the “not” decision. This is done in order to simplify the exposition. But there are some costs, including the fact that there are two different estimates of the correlation term. In the results presented in Figure 23, the correlation coefficient from the first model is used. Can you write down and estimate a full model?
Effect of College
The Heckman model suggests that the effect of college is heterogeneous. While most benefit from college, not all benefit. Figure 23 presents a plot of the predicted wage for each person in the sample, for both the case where they went to college and the case where they did not go to college. For those sitting above the 45 degree line, their predicted wage is higher going to college than not going to college. For those below the 45 degree line, it is the opposite. Their predicted wages are higher from not attending college. Given all the assumptions, it is not clear exactly what we should take from this, but it is interesting that so much weight is above the 45 degree line. Remember that only 50% of the data actually went to college. The model assumes that people are “choosing” not to go to college based on beliefs about what they would earn if they did. The model is estimated with two Heckman regressions rather than a full “switching” model which would allow for a richer error structure.
We can compare this approach to the estimates of the ATE and LATE using OLS and IV. The estimated average treatment effect is 0.14, which is substantially higher than the OLS estimate of 0.075. It is similar to the IV and LATE estimates. This should not be too surprising given how closely related the Heckman model is to the IV model.
mean(x1$coll_y - x1$nocoll_y)/4
[1] 0.1276973
Discussion and Further Reading
The chapter introduces the Tobit model which can be used to estimate data that is censored. It also introduces the Heckman selection model, which is used to estimate the gender-wage gap and returns to schooling. When using limited dependent variable models, including the ones discussed in this chapter and the previous chapter, I often go back to appropriate chapters of Greene (2000).
This chapter showed how the economic assumption of revealed preference can be used for a broader range of problems than just demand estimation. In particular, in labor economics we often have data in which we observe both a decision and the outcome of the decision. The Roy model considers the situation where an individual is choosing between two employment sectors. We observe the choice and the wages in the chosen sector. Heckman and Honore (1990) show that with enough exogenous variation in wages and revealed preference, the joint distribution of wages across both sectors is identified. Recent work considers this model under even weaker assumptions (Henry, Mourifie, and Meango 2020).
The chapter re-analyzes the returns to schooling data used in Card (1995). It uses a version of the Heckman model to find average returns similar to the IV estimates. The model also estimates the joint distribution of returns and shows that for some individuals the expected income from attending college may be less than the expected income from not attending college.
Demand Estimation with IV
Introduction
When I began teaching microeconometrics a few years ago I read up on the textbook treatment of demand estimation. There was a lot of discussion about estimating logits and about McFadden’s model. It looked a lot like how I was taught demand estimation 25 years before. It looked nothing like the demand estimation that I have been doing for the last 20 years. Demand estimation is integral to antitrust analysis. It is an important part of marketing and business strategy. But little of it actually involves estimating a logit. Modern demand estimation combines the insights of instrumental variable estimation and game theory.
My field, industrial organization, changed dramatically in the 1970s and 1980s as game theory became the major tool of analysis. A field that had been dominated by industry studies quickly became the center of economic theory and the development of game theoretic analysis in economics. By the time I got to grad school in the mid 1990s, the field was changing again. New people were looking to combine game theory with empirical analysis. People like Susan Athey, Harry Paarsh and Phil Haile started using game theory to analyze timber auctions and oil auctions. Others like Steve Berry and Ariel Pakes were taking game theory to demand estimation.
Game theory allows the econometrician to make inferences from the data by theoretically accounting for the way individuals make decisions and interact with each other. As with discrete choice models and selection models, we can use economic theory to help uncover unobserved characteristics. Again, we assume that economic agents are optimizing. The difference here is that we allow economic agents to explicitly interact and we attempt to model that interaction. Accounting for such interactions may be important when the number of agents is small and it is reasonable to believe that these individuals do in fact account for each others’ actions.
This chapter presents a standard model of competition, the Hotelling model. It presents two IV methods for estimating demand from simulated data generated by the Hotelling model. It introduces the idea of using both cost shifters and demand shifters as instruments. The chapter takes these tools to the question of estimating the value of Apple Cinnamon Cheerios.
Modeling Competition
In the late 1920s, the economist and statistician, Harold Hotelling, developed a model of how firms compete. Hotelling wasn’t interested in “perfect competition,” and its assumption of many firms competing in a market with homogeneous products. Hotelling was interested in what happened when the number of firms is small and the products are similar but not the same. Hotelling (1929) was responding to an analysis written the French mathematician, Joseph Bertrand, some 80 years earlier. Bertrand, in turn, was responding to another French mathematician, Antoine Cournot, whose initial analysis was published in the 1830s.
All three were interested in what happens when two firms compete. In Cournot’s model, the two firms make homogeneous products. They choose how much to produce and then the market determines the price. Cournot showed that this model leads to much higher prices than was predicted by the standard (at the time) model of competition. Bertrand wasn’t convinced. Bertrand considers the same case but had the firms choose prices instead.
Imagine two hotdog stands next to each other. They both charge $2.00 a hotdog. Then one day, the left stand decides to charge $1.90 a hotdog. What do you think will happen? When people see that the left stand is charging $1.90 and the right one is charging $2.00, they are likely to buy from the left one. Seeing this, the right hotdog stand reacts and sets her price at $1.80 a hotdog. Seeing the change, everyone switches to the right stand. Bertrand argued that this process will lead prices to be bid down to marginal cost. That is, with two firms the model predicts that prices will be the same as the standard model.
Hotelling agreed that modeling the firms as choosing price seemed reasonable, but was unconvinced by Bertrand’s argument. Hotelling suggested a slight change to the model. Instead of the two hotdog stands being next to each other, he placed them at each end of the street. Hotelling pointed out that in this case even if the left stand was 10c cheaper, not everyone would switch away from the right stand. Some people have their office closer to the right stand and are unwilling to walk to the end of the street just to save a few cents on their hotdog. Hotelling showed that in this model prices were again much higher than for the standard model.
When I think about competition, it is Hotelling’s model that I have in my head.
Competition is a Game
Hotelling, Cournot and Bertrand all modeled competition as a game. A game is a formal mathematical object which has three parts; players, strategies and payoffs. In the game considered here, the players are the two firms. The strategies are the actions that the players can take given the information available to them. The payoffs are the profits that the firms make. Note that in Cournot’s game, the strategy is the quantity that the firm chooses to sell. In Bertrand’s game, it is the price that the firm chooses to sell at. Cournot’s model is a reasonable representation of an exchange or auction. Firms decide how much to put on the exchange and prices are determined by the exchange’s mechanism. In Bertrand’s game, the firm’s post prices and customers decide how much to purchase.
Consider the following pricing game represented in Table Table 21. There are two firms, Firm 1 and Firm 2. Each firm chooses a price, either
What do you think will be the outcome of the game? At which prices do both firms make the most money? If both firms choose
No. At least it won’t be the outcome if the outcome is a Nash equilibrium. The outcomes of the games described by Hotelling, Bertrand and Cournot are all Nash equilibrium. Interestingly, John Nash, didn’t describe the equilibrium until many years later, over 100 years later in the case of Bertrand and Cournot. Even the definition of a game didn’t come into existence until the work of mathematicians like John von Neumann in the early 20th century.
A Nash equilibrium is where each player’s strategy is optimal given the strategies chosen by the other players. Here, a Nash equilibrium is where Firm 1’s price is optimal given Firm 2’s price, and Firm 2’s price is optimal given Firm 1’s price. It is not a Nash equilibrium for both firms to choose
The Nash equilibrium is
Hotelling’s Line

The Figure 24 represents Hotelling’s game. There are two firms
Let Firm
where
where
Everyone located between 0 and
Nash Equilibrium
Given all this, what will be the price in the market? We assume that the price is determined by the Nash equilibrium. Each firm is assumed to know the strategy of the other firm. That is, each firm knows the price of their competitor. The Nash equilibrium is the price such that Firm
Firm L’s problem is as follows.
where
The solution to the optimization problem is the solution to the first order condition.
Firm R’s problem is similar.
The first order condition is as follows.
Given these first order conditions we can write down a system of equations.
Solving the system we have the Nash equilibrium prices in the market.
In equilibrium, prices are determined by the relative value of the products (
Estimating Demand in Hotelling’s Model
We can illustrate the modern approach to demand estimation using Hotelling’s model. The section creates a simulated the demand system based on the model and estimates the parameters using the IV approach.
Simulation of Hotelling Model
The simulation uses the model above to create market outcomes including prices and market shares. There are 1,000 markets. These may represent the two firms competing at different times or different places. The data is adjusted to keep market outcomes where prices are positive and shares are between 0 and 1.
set.seed(123456789)
<- 1000 # number of markets
N = 0.25 # demand parameter
beta <- 3 + rnorm(N) # quality (vertical) measure
v <- 3 + rnorm(N)
cL <- 9 + rnorm(N)
cR # costs for both firms.
<- (v + 1 + beta*cR + 2*beta*cL)/(3*beta)
pL <- (-v - 1 + beta*cL + 2*beta*cR)/(3*beta)
pR # price function for each firm.
<- (v + 1 - beta*pL + beta*pR)/2
xL # demand for firm L
<- pL > 0 & pR > 0 & xL > 0 & xL < 1
index <- cL[index]
cL1 <- cR[index]
cR1 <- pL[index]
pL1 <- pR[index]
pR1 <- xL[index]
xL1 <- v[index]
v1 # adjusting values to make things nice.
The Figure 25 plots demand and relative price for product
Prices are Endogenous
Probably every economist in industrial organization has run a regression like what is depicted in Figure 25. Each one has looked at the results and has felt their heart sink because everything that they knew about economics was wrong. Then they have taken a deep breath and remembered prices are endogenous.
We are interested in estimating how prices affect demand for the product. We know they do. Equation 100 explicitly states that prices cause demand to fall. The problem is that we did not plot the demand curve. We plotted out a bunch of outcomes from the market. We plotted out a thousand equilibrium prices and equilibrium demand levels. Back to Econ 101, think of a thousand demand and supply crosses going through each of the points in Figure 25.
The problem in the simulation is that prices are determined endogenously. The observed market prices are the outcome of the Nash equalibria and the choices of Firm
Cost Shifters
The standard instrumental variable approach is to use cost shifters. That is, we need an instrument related to costs. For example, if we observed
# Intent to Treat
<- lm(xL1 ~ I(cL1-cR1))
lm1 # First stage
<- lm(I(pL1-pR1) ~ I(cL1-cR1))
lm2 # IV estimate
$coefficients[2]/lm2$coefficients[2] lm1
I(cL1 - cR1)
-0.04755425
Remember back to Chapter 3, we can use graph algebra to find a simple estimator. This is the intent to treat regression (demand on costs) divided by the first stage regression (price on costs). Note that price of interest is the difference in the prices of the two firms. The instrument for this price is the difference in the costs of the two firms. The estimate is -0.048. The true value is -0.125, which you see from Equation 100 where
<- cbind(pL1,pR1)
X1 <- cbind(cL1,cR1)
Z1 <- as.matrix(xL1)
Y1 <- lm_iv(Y1,X1, Reps=500)
tab_ols <- lm_iv(Y1, X1, Z1, Reps = 500)
tab_iv # using the function defined in chapter 3.
row.names(tab_iv) <- row.names(tab_ols) <-
c("intercept",colnames(X1))
<- cbind(tab_ols[,1:2],tab_iv[,1:2])
tab_res colnames(tab_res) <-
c("OLS coef","OLS sd","IV coef", "IV sd")
library(knitr)
kable(tab_res)
We can also use the matrix algebra method described in Chapter 3. Here we separate out the two prices and don’t use the information that they have the same coefficient. We can confirm the results of Figure 25; relationship between price and demand have the wrong sign. The IV estimate is again on the low side at around -0.04.
Demand Shifters
The analysis above is the standard approach to IV in demand analysis. However, we are not limited to using cost shifters. In the simulation there is a third exogenous variable,
Looking at the Nash equilibrium (Equation 106), we see that
<- lm(pL1 ~ v1)
lm1 <- lm1$coefficients[2]
b1 <- 1/(3*b1)
beta -beta/2
v1
-0.1162436
The idea of using the model in this way is at the heart of much of modern industrial organization. The idea is to use standard statistical techniques to estimate parameters of the data, and then use the model to relate those data parameters to the model parameters of interest. The idea was promoted by Guerre, Perrigne, and Vuong (2000) who use an auction model to back out the underlying valuations from observed bids.86
Berry Model of Demand
Yale industrial organization economist, Steve Berry, is probably the person most responsible for the modern approach to demand analysis. S. Berry, Levinsohn, and Pakes (1995) may be the most often used method in IO. We are not going to unpack everything in that paper. Rather, we will concentrate on the two ideas of demand inversion and taking the Nash equilibrium seriously.
We are interested in estimating demand. That is, we are interested in estimating the causal effect of price (
This section presents a general IV approach to demand estimation.
Choosing Prices
To see how marginal costs affect price consider a simple profit maximizing firm
The Equation 107 shows a firm choosing prices to maximize profits, which is quantity times margin. The solution to the maximization problem can be represented as the solution to a first order condition.
where
where
A Problem with Cost Shifters
The Equation 109 shows that as marginal cost (
where
From Equation 110, this assumption gives a linear relationship between price and marginal cost.
Empirical Model of Demand
Consider a case similar to that discussed in Chapter 5. We observe a large number of “markets”
For simplicity assume that there are just two products. The demand for product
where the share (
This model is similar to the logit and probit models presented in Chapter 5. More formally, the model assumes that utility is quasi-linear. In this case, the assumption allows a neat trick. If we can invert
Inverting Demand
Instead of writing demand as a function of price, we can write price as a function of demand.
The inversion provides a nice linear relationship between price, the index over product characteristics and the inverse of the market share. Now that things are linear we can use standard IV methods. Unfortunately, things get a lot more complicated with more choices. It is not even clear that this “inversion” is always possible (S. T. Berry, Gandhi, and Haile 2013).
In the special case of the logit demand, things are relatively straightforward (S. Berry 1994).
We can write the share of demand for product
The log of share is a linear function of the utility index less information about all the other products.
Notice that it is possible to get rid of all the other characteristics. We can do this by inverting demand for the “outside” good. Remember in the logit this value is set to 1 and
From this we see the following representation. The log of the relative share is a linear function of
In this model, the confounding is due to the relationship between
Demand Shifters to Estimate Supply
If we have instruments for demand then we can rearrange the equation above.
We can write this out as an IV model presented in Chapter 3.
where
Demand Estimation from Supply Estimates
If all the assumptions hold, then the IV procedure above provides an estimate of the effect that changes in demand have on price. That is, the procedure estimates the slope of the supply function. But that is not what we are interested in. We want to estimate the demand function. We want to know how changes in price affect demand.
Can we use what we know about how prices are set by the firm to back out demand? Can we use game theory to back out the policy parameters of interest from the estimated parameters? Yes. This is the two-step estimation approach exemplified by Guerre, Perrigne, and Vuong (2000). In the first step, we estimate a standard empirical model. In the second step, we use economic theory to back out the policy parameters of interest from the estimated parameters. Here, we estimate the slope of the supply curve and use game theory to back out the slope of the demand curve.
In order to simplify things substantially, assume that there is one product per profit maximizing firm and the demand curve is approximately linear around the optimal price (
The left-hand side is the observed relationship between price and demand from the data. This we can estimate with the IV procedure above. The right-hand side shows the slope of the demand function (
This result has the following implications for the relationship between the estimated values and the parameter values of interest,
where
The Introduction of Apple Cinnamon Cheerios
Over the last thirty years we have seen some amazing new products. The Apple iPod, the Apple iPhone, the Apple iPad, but years before any of these, General Mills introduced Apple Cinnamon Cheerios. This product may be subject to the oddest debate in microeconometrics: what is the true value of Apple Cinnamon Cheerios? MIT econometrician, Jerry Hausman, found that the introduction of Apple Cinnamon Cheerios substantially increased consumer welfare. Stanford IO economist, Tim Bresnahan, claimed Hausman was mistaken.
I’m sure you are thinking, who cares? And you would be correct. I, myself, have never eaten Apple Cinnamon Cheerios. I am reliably informed that they are similar to Apple Jacks, but I have not eaten those either.
However, the debate did raise important issues regarding how assumptions presented above are used to estimate new products like BART or Apple Cinnamon Cheerios (Bresnahan 1997). McFadden’s approach requires that products are a sum of their attributes and preferences for those attributes is fixed across products. We will continue to use these assumptions in order to determine the value of Apple Cinnamon Cheerios.
This section uses cereal price and sales data from a Chicagoland supermarket chain in the 1980s and 1990s.
Dominick’s Data for Cereal
Data on the demand for cereal is available from the Kilts School of Marketing at the University of Chicago. The data was collected from the Dominick’s supermarket chain and stores throughout Chicagoland. We have information on 490 UPCs (products) sold in 93 stores over 367 weeks from the late 80s to the late 90s. As in Chapter 5 we want to map the products into characteristics. As there is no characteristic information other than name and product size, the Dominick’s data is merged with nutritional information for 80 cereal products from James Eagan.88 To estimate the model we need to have one product that is the “outside good.” In this case, we assume that it is the product with the largest share of the products analyzed. Prices, characteristics and shares are created relative to the outside good.89 A more standard assumption is to classify the outside good based on a definition of the market, say “all breakfast foods.” The assumption makes the exposition a lot simpler but at the cost of very strong assumptions on how individuals substitute between breakfast foods.90
<- read.csv("dominicks.csv", as.is = TRUE)
x <- x$ozprice
p $fat <- x$fat/100
x$oz <- x$oz/100
x$sodium <- x$sodium/1000
x$carbo <- x$carbo/100
x# changes the scale of the variables for presentation
<- x[,colnames(x) %in% c("sig","fat","carbo","sodium",
W "fiber", "oz","quaker","post",
"kellogg","age9", "hhlarge")]
# sig (sigma) refers to the adjusted measure of market share
# discussed above
# fat, carbo, sodium and fiber refer to cereal incredients
# oz is the size of the package (ounces)
# quaker, post and kellogg are dummies for major cereal brands
# age9 is a measure of children in the household
# hhlarge is a measure household size.
Instrument for Price of Cereal
S. Berry (1994) suggests that we need to instrument for price.
Think about variation in prices in this data. Prices vary across products, as determined by the manufacturer. Prices vary across stores, as determined by the retailer (Dominick’s). Prices vary across time due to sales and discounts. The last can be determined by the manufacturer or the retailer or both. The concern here is that we have variation across stores. Stores with higher demand for certain cereal products will also get higher prices.
Berry suggests that we need two types of instruments. We need instruments that exogenously vary and determine price through changes in costs. These are called cost shifters. They may be wages or input prices. Above it is pointed out that in theory these instruments are generally not linearly related to price. We also need instruments that vary exogenously and determine price through changes in demand. These are called demand shifters. They may be determined by demographic differences or by difference in product characteristics. The analysis here uses variation in income across stores. The assumptions are that
<- cbind(x$income,x$fat,x$sodium,x$fiber,x$carbo,x$oz,
Z $age9,x$hhlarge,x$quaker,x$post,x$kellogg)
xcolnames(Z) <- colnames(W)
<- lm_iv(p,W, Reps=300)
tab_ols <- lm_iv(p, W, Z, Reps = 300)
tab_iv # using the IV function from Chapter 3
row.names(tab_iv) <- row.names(tab_ols) <-
c("intercept",colnames(W))
<- cbind(tab_ols[,1:2],tab_iv[,1:2])
tab_res colnames(tab_res) <-
c("OLS coef","OLS sd","IV coef", "IV sd")
OLS coef | OLS sd | IV coef | IV sd | |
---|---|---|---|---|
intercept | -0.0112440 | 0.0012072 | 0.5936581 | 2.2635711 |
sig | 0.0254831 | 0.0001892 | 0.3010258 | 1.0055785 |
fat | -0.8947611 | 0.0342263 | 3.9357655 | 20.2435595 |
sodium | -0.1022678 | 0.0051230 | -1.0862754 | 3.7329640 |
fiber | -0.0089701 | 0.0001210 | 0.0174915 | 0.0960446 |
carbo | -0.0702999 | 0.0084609 | 2.8229701 | 10.8086889 |
oz | -0.4796423 | 0.0045994 | -0.4891653 | 0.0595272 |
age9 | 0.1625589 | 0.0106795 | -0.2610798 | 1.6172912 |
hhlarge | -0.0358485 | 0.0091715 | -0.2633392 | 1.0674555 |
quaker | -0.0291338 | 0.0009273 | -0.1722694 | 0.5237855 |
post | -0.0599129 | 0.0008755 | -0.0306216 | 0.0904887 |
kellogg | -0.0204212 | 0.0005213 | -0.2046944 | 0.6582104 |
The Table 23 presents the OLS and IV estimates. The OLS estimates present the non-intuitive result that price and demand are positively correlated. The IV model assumes that changes in income are exogenous and that they determine price through changes in demand. Under standard IV assumptions 0.3 measures the effect of changes in demand on price, although this is not precisely estimated. As expected, it is positive, meaning that an exogenous increase in demand is associated with higher prices. This is great, but we are interested in estimating demand, not supply.
Demand for Apple Cinnamon Cheerios
The discussion above suggests that we can transform the estimates from the IV model to give the parameters of interest. That is, we can use assumptions about firm behavior to back out the slope of the demand function from our estimate of the slope of the supply function. See Equation 121.
<- -1/(2*tab_iv[2]) # transformation into "demand"
beta <- -tab_iv[,1]/beta # transform gammas back.
gamma 2] <- beta # puts in the causal effect of price on
gamma[#demand.
names(gamma)[2] <- "price"
Given this transformation we can estimate the demand curve for family size Apple Cinnamon Cheerios. The following loop determines the share of each product for different relative prices of the Apple Cinnamon Cheerios.
<- as.matrix(W)
W <- length(unique(x$store))
Ts <- length(unique(x$WEEK))
Tw <- length(unique(x$UPC))
J <- matrix(NA,Ts*Tw,J)
exp_delta <- 1
t for (ts in 1:Ts) {
<- unique(x$store)[ts]
store for (tw in 1:Tw) {
<- unique(x$WEEK)[tw]
week <- W[x$WEEK==week & x$store==store,]
W_temp <- exp(cbind(1,W_temp)%*%gamma)
exp_delta[t,] <- t + 1
t #print(t)
}
}<- exp_delta/(1 + rowSums(exp_delta, na.rm = TRUE))
share_est summary(colMeans(share_est, na.rm = TRUE))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01366 0.01727 0.01885 0.01873 0.02010 0.02435
The loop above calculates the predicted market shares for each of the products given the estimated parameters.
<- "1600062760"
upc_acc # Apple cinnamon cheerios "family size"
<-
share_acc mean(x[x$UPC=="1600062760",]$share, na.rm = TRUE)
# this is calculated to determine the relative prices.
<- 20
K <- -6
min_k <- -2
max_k # range of relative prices
<- (max_k - min_k)/K
diff_k <- matrix(NA,K,2)
acc_demand <- min_k
min_t for (k in 1:K) {
<- min_t + diff_k
pr <- exp_delta
exp_delta2 $UPC==upc_acc] <-
exp_delta2[xexp(as.matrix(cbind(1,pr,W[x$UPC==upc_acc,-1]))%*%gamma)
<- matrix(NA,length(unique(x$UPC)),2)
ave_share for (i in 1:length(unique(x$UPC))) {
<- sort(unique(x$UPC))[i]
upc 1] <- upc
ave_share[i,2] <-
ave_share[i,mean(exp_delta2[x$UPC==upc],na.rm = TRUE)
#print(i)
}2] <-
ave_share[,2]/(1 + sum(ave_share[,2], na.rm = TRUE))
ave_share[,1] <- pr
acc_demand[k,2] <- ave_share[ave_share[,1]==upc_acc,2]
acc_demand[k,<- min_t + diff_k
min_t #print("k")
#print(k)
}
The Figure 26 presents the demand curve for a family size box of Apple Cinnamon Cheerios.
Value of Apple Cinnamon Cheerios
To determine the value of Apple Cinnamon Cheerios family size box we can calculate the area under the demand curve (Hausman 1997).
If we approximate the area with a triangle, we get an annual contribution to consumer welfare of around $271,800 per year for all Dominick’s customers in Chicagoland.91 Assuming that Dominick’s had a market share of 25% and that Chicagoland accounts for about
0.5*(-3-acc_demand[5,1])*acc_demand[5,2]*267888011*(52/367)) (
[1] 271798.6
Discussion and Further Reading
My field, industrial organization, is dominated by what is called structural econometrics. That is, using game theory to estimate the parameters of the model, then using the parameter estimates to make policy predictions. Industrial organization economists believe that these ideas should be used more broadly across economics.
The chapter reconsiders the problem of demand estimation. It allows that prices are determined endogenously in the model. The chapter assumes that prices are actually the result of a game played between rival firms that sell similar products. It shows that we can use a standard IV estimator to estimate the slope of the supply function, then we can use the Nash equilibrium to determine the slope of demand.
It has become standard practice in empirical industrial organization to split the estimation problem in to these two steps. The first step involves standard statistical techniques, while the second step relies on the equilibrium assumptions to back out the policy parameters of interest. Chapter 9 uses this idea to estimate the parameters of interest from auction data (Guerre, Perrigne, and Vuong 2000).
S. Berry, Levinsohn, and Pakes (1995) is perhaps the most important paper in empirical industrial organization. It presents three important ideas. It presents the idea discussed above that logit demand can be inverted which allows for use of the standard instrumental variables approach. This idea is combined with an assumption that prices are determined via equilibrium of a Nash pricing game allowing the parameters of interest to be identified. If this wasn’t enough, it adds the idea of using a flexible model called a mixed logit. All this in one paper used to estimate the demand for cars! However, without Nevo (2000) few would have understood the contributions of S. Berry, Levinsohn, and Pakes (1995). More recently S. T. Berry, Gandhi, and Haile (2013) dug into the assumptions of S. Berry, Levinsohn, and Pakes (1995) to help us understand which are important for identification and which simply simplify the model. MacKay and Miller (2019) have an excellent exposition of the various assumptions needed to estimate demand. What if firms are colluding? Fabinger and Weyl (2013) and Jaffe and Weyl (2013) are good starting places for thinking about estimating demand in that case.
Estimating Games
Introduction
In Chapter 7 we used game theory to revisit demand estimation. This chapter takes game theory to a broader set of problems in microeconometrics. Chapter 7 considered games with pure strategies, and this chapter introduces the idea of mixed strategies. This chapter also introduces a standard estimator for game theory based models, the generalized method of moments (GMM). The chapter uses the GMM estimator to revisit returns to schooling and whether or not NFL coaches punt too often on 4th down.
Mixed Strategy Nash Equilibrium
In his 2006 paper, Berkeley macroeconomist David Romer argues that coaches in the NFL do not behave rationally. In particular, they choose to punt too often on fourth down. The goal in American football is to score points. The primary method for scoring is to run the ball across your opponent’s goal-line. The ball can be moved down toward the goal-line by either running and carrying or by throwing the ball to a team-mate to catch. In American football the offense has 4 tries (“downs”) to move the ball ten yards.
After the third try, the coach has a decision to make. They can “go for it.” That is, try a standard play which involves running with the ball or throwing the ball. If they make the distance, the number of tries resets, which is good. If they don’t make the distance, the other team gets the ball, which is bad. Most importantly, the other team gets the ball right at the same spot on the field, which is very bad. This means the other team may have a particularly easy chance to score. As I said, very bad. Punting the ball also gives the ball to the other team. But, and this is the kicker, pun intended, the other team gets the ball where the punt lands. This means that the other team may have a particularly hard chance to score, which is not that bad.
This idea that coaches punt too often has become pervasive; there is even an episode of the CBS TV show Young Sheldon discussing the issue. But is it correct? We do not observe the information required to evaluate the coaches’ decision. The fact that coaches almost always punt means that we don’t actually observe what happens when they don’t punt. We don’t observe the counter-factual outcome. However, we can use game theory to predict it.
Coaches’ Decision Problem
The decision to punt or not on fourth down can be thought of as maximizing expected points rather than maximizing the probability of winning. Romer (2006) argues that, at least early on in the game, this is a reasonable way to model the coaches’ problem.
Here the expected points are calculated by Ron Yurko. For more information go here: https://github.com/ryurko/nflscrapR-data. The data is scraped from NFL play-by-play information (https://www.nfl.com/scores).92
The Figure 27 presents the average expected points for the first quarter. Expected points are calculated using dynamic programming. Each yard line and possession combination is a “state.” We can then calculate the probability of moving from any state to any other state. For some states one of the teams scores. Thus we can determine the expected point differential for each yard line. Note that when a team is within their own 20, the expected points is actually negative. Even though one team may have the ball, it is actually more likely for the other team to get it and score.
Consider a fourth down with 5 yards to go. The ball is currently on the offense’s own 45 yard line (55 yards to the goal line). There are two important numbers, the “to-go” distance of 5 yards and the position on the field which is 55 yards to the goal line. The longer the to-go distance the hard it is to get the distance. The further from the goal line, the easier it is for the other team to score if they get the ball.
Using Figure 27 we can calculate the expected points for the defense from each position. For illustrative purposes assume that we expect the defense to earn 0 points if the offense punts the ball. That is, the team punts the ball from their own 45 to the other team’s 20. The punting earns the team -0 in expected points. Alternatively, the offense could try for the 5 yards. If they get it, they are at the 50 and they will earn 2 points in expectation. But what if they don’t get it? If they turn the ball over on downs, the defense starts at the offense’s 45. They have 45 yards until the goal line. This gives the offense about -2.3 in expected points. So whether to punt depends on the probability of getting the 5 yards.
The offense should go for it if the left-hand side is bigger than zero.
If
This seems straightforward. We just calculate the probability of getting 5 yards from the data. Just take all the fourth down attempts to go 5 yards and calculate the probabilities. The problem is that teams don’t actually go for it on fourth down. We can’t calculate the probability of outcomes for events that do not occur.
Romer’s solution is to use the estimated probability of going 5 yards on third down. A concern with Romer’s approach is third down and fourth down are different. Instead, the solution here is to use the third down data to estimate the game played on third down. Then take the estimated parameters and use them to predict what would occur in the game played on fourth down.
Zero-Sum Game
We can model the third-down game as a **zero-sum game}. These games date to the origins of game theory in the 1920s, when the mathematician John von Neumann was interested in modeling parlor games like cards or chess. In these games, there is a winner and a loser. What one person wins, another person loses.
Consider a relatively simple zero-sum game presented in Table 24. Remember a game has three elements; players, strategies and payoffs. The players are called Offense and Defense. Each has the choice between two strategies, Run and Pass. Payoffs are expected points (for simplicity). In the game, a pass play will get more yards and thus more expected points, but is generally higher risk. For the Defense, to play “Pass,” means to implement a pass oriented defense such as one in which the line-backers drop back into zones. This is a zero-sum game, so the payoff to one player is the negative of the payoff to the other player.
{Off,Def} | Run & Pass |
---|---|
Run | {0,0} |
Pass | {2,-2} |
The game represented in Table 24 has “low” payoffs on the diagonal and “high” payoffs on the off-diagonal. That is, the Offense prefers to choose Run when the Defense chooses Pass. Similarly, they prefer Pass when the Defense chooses Run. Note also, that the payoff is higher from a successful pass than a successful run. This captures the fact that a pass tends to gain more yardage.
What is the solution to this game?
What is the Nash equilibrium? One possibility is that both teams play
Nash Equilibrium
The Nobel prize winning mathematician, John Nash, proved that for relatively general games, there always exists at least one Nash equilibrium. However, there may not exist a pure strategy Nash equilibrium as was discussed above. The only Nash equilibrium may be in mixed strategies.
A mixed strategy is one in which the player places probability weights on the available actions. Instead of choosing Run, the player places some probability of Run and some probability of Pass. It may not be that the player literally uses a coin or some other randomization device. Rather, the assumption is that the other players in the game do not know exactly what the player will do. The other players know only the probability weights that are placed on Pass and Run.
In our problem neither team should telegraph its play choice to the other team. They should keep the other team guessing. Think about the child’s game, Rock/Paper/Scissors. You want to keep switching between the three options. It is not optimal to always choose Rock. If the other player knows you will always choose Rock, they will always choose Paper. In football, both teams try to have fake plays and formations in order to keep the other team guessing.
Determining the mixed strategy Nash equilibrium is tricky. You must show that the player is indifferent between her actions. In a mixed strategy Nash equilibrium, each player is indifferent between Run and Pass. If this is not true, say a player prefers Run, then mixing cannot be optimal. It is optimal to choose Run.
The Nash equilibrium in mixed strategies is where the Offense chooses a probability of Pass (
The equilibrium is the solution to the two equations in Equation 123. The mixed strategy Nash equilibrium is
Third and Fourth Down Game
{Off,Def} | Run | Pass |
---|---|---|
Run | ||
Pass |
Consider the Third Down game represented by Table 25. This is a zero-sum game, so the payoff presented is the payoff to the Offense. It is assumed that if the Offense does not make first down, then they punt. The expected points is dependent on both the current location,
The Third Down Game and the Fourth Down game are similar. In particular, we assume that the probabilities of success conditional on the strategies of the Offense and Defense remain the same. What changes between downs is the expected points associated with the failure. In the Third Down if the Offense fails, it gets
The objective is to estimate the parameters of the game that will be used to model the policy of going for it on fourth down. Those parameters are four conditional probabilities. These are the probability of successfully getting first down conditional on the type of play, Run or Pass, chosen by both the Offense and the Defense. This set of parameters is denoted
Equilibrium Strategies
If we could observe the actions of both the Offense and the Defense, then it would be straightforward to estimate the conditional probabilities of interest. While we have very good data on the action chosen by the Offense, we don’t know what action the Defense chose. The available play by play data doesn’t provide a lot of information on the type of defense that was used in each play.
To identify the conditional probabilities, we combine the observed probabilities with the constraints from the Nash equilibrium.
In equilibrium the following equalities must hold. The probability that the Defense plays Pass (
where the value functions are as follows.
Assuming these equalities hold in the data allows us to estimate the parameters.
Rearranging we have the equilibrium strategies for the Offense and Defense.
In general, the expected points are larger when there is a mis-match in the actions of the two teams. This suggests that the strategies will always lie between 0 and 1. Can you provide an intuitive explanation of these probabilities? It is not at all obvious. For example if the value to the Offense of playing Pass versus Run increases then the probability the Offense plays Pass decreases.
Equilibrium Strategies in R
The estimator involves solving for the equilibrium strategies of the Offense and Defense. The function takes in the parameters (the conditional probabilities) and the expected points for each situation. It then calculates the equilibrium probability of playing Pass for the Offense and the Defense based on Equation 127. Note that the expected points associated with failure depend on which down. For third down it is assumed that it is determined by the expected points from a punt, while for fourth down it is the expected points from turning the ball over to the other team at that position.
# a function for determining the equalibrium strategies.
<- function(theta,Ep) {
q_fun <- Ep[,1]
Ep_run <- Ep[,2]
Ep_pass <- Ep[,3]
Ep_fail <- - (1 - theta[1])*Ep_fail + theta[1]*Ep_run
V_rr <- - (1 - theta[2])*Ep_fail + theta[2]*Ep_run
V_rp <- - (1 - theta[3])*Ep_fail + theta[3]*Ep_pass
V_pr <- - (1 - theta[4])*Ep_fail + theta[4]*Ep_pass
V_pp <- (V_rp - V_rr)/((V_rp - V_rr) + (V_pr - V_pp))
qo <- (V_pr - V_rr)/((V_pr - V_rr) + (V_rp - V_pp))
qd for (i in 1:length(qo)) { qo[i] <- min(max(qo[i],0),1) }
for (i in 1:length(qd)) { qd[i] <- min(max(qd[i],0),1) }
# forcing results to be probabilities.
return(list(qo=qo,qd=qd))
}
Simulation of Third and Fourth Down Game
To illustrate the estimator, consider a simulation of the Third Down game presented above. We observe a similar game played 2,000 times.
set.seed(123456789)
<- 2000 # number of plays
N <- c(0.2, 0.8, 0.5, 0.1)
theta # parameters of the model
# conditional probabilities
<- 50 + 20*runif(N)
Y # current yards to go.
<- Y - rnorm(N,mean=35)
Yk <- ifelse(Yk > 0, Yk, 20)
Yk # yards to go after punt.
<- Y - rnorm(N,mean=15)
Yp <- ifelse(Yp > 0, Yp,0)
Yp # yards to go after pass
<- Y - rnorm(N,mean=3)
Yr <- ifelse(Yr > 0, Yr,0)
Yr # yards to go after run
<- function(x) 5 - 5*x/100
EP # an expected points function to approximate figure above.
# equalibrium strategies
= q_fun(theta, cbind(EP(Yr), EP(Yp), EP(100 - Yk)))
q3 = q_fun(theta, cbind(EP(Yr), EP(Yp), EP(100 - Y))) q4
3rd O Pass | 3rd D Pass | 4th O Pass | 4th D Pass |
---|---|---|---|
Min. :0.5444 | Min. :0.3399 | Min. :0.5629 | Min. :0.3269 |
1st Qu.:0.5572 | 1st Qu.:0.3577 | 1st Qu.:0.5711 | 1st Qu.:0.3390 |
Median :0.5604 | Median :0.3627 | Median :0.5733 | Median :0.3423 |
Mean :0.5604 | Mean :0.3627 | Mean :0.5733 | Mean :0.3423 |
3rd Qu.:0.5636 | 3rd Qu.:0.3678 | 3rd Qu.:0.5754 | 3rd Qu.:0.3457 |
Max. :0.5748 | Max. :0.3881 | Max. :0.5830 | Max. :0.3588 |
Note the difference between third down and fourth down. Table 26 shows that the change in payoffs leads to a subtle change in the strategies for the simulated teams. For the Offense there is a 1 percentage point increase in the probability of Pass, while for the Defense there is a 2 to 3 percentage point decrease in the probability of Pass. Do you see why that would be? Why does the Defense key on Run when moving to 4th down?
<- runif(N) < q3$qo
Pass <-
success3 runif(N) < Pass*(theta[3]*(1 - q3$qd) + theta[4]*q3$qd) +
!Pass)*(theta[1]*(1 - q3$qd) + theta[2]*q3$qd)
(= ifelse(Pass,"pass","run")
play # this is created to simulate the estimator below.
# fourth down
<- runif(N) < q4$qo
Pass <-
success4 runif(N) < Pass*(theta[3]*(1 - q4$qd) + theta[4]*q4$qd) +
!Pass)*(theta[1]*(1 - q4$qd) + theta[2]*q4$qd) (
The Defense moves to become more focused on stopping the Run. This change in the strategies leads to a change in the success rates. In particular, the probability of the Offense successfully completing a pass or run falls by more than a percentage point. These simulated results suggest that using third down success rates leads to biased estimates and makes “going for it” look more valuable than it actually is.
mean(success3)
[1] 0.3865
mean(success4)
[1] 0.369
Generalized Method of Moments
This section introduces a new estimation algorithm, called generalized method of moments (GMM). This algorithm was developed by Nobel prize winning economist, Lars Peter Hansen. While its initial applications were in macroeconomics, GMM has become standard in microeconometric estimation problems involving game theory. However, in order to understand the algorithm, we will take a detour and return to the question of estimating OLS.
Moments of OLS
The algorithm is a generalization of the least squares algorithm presented in Chapter 1. OLS can be estimated by noting that the mean of the error term is equal to zero. That is, when the first **moment} of the error term distribution is zero.
In the model introduced in Chapter 1, we have some outcome that is a function of observables and unobservables.
where
Let the unobserved term be normally distributed,
The second moment is the square of the unobserved term which is equal to variance (
Note that in this case
In Chapter 1 we saw that we can use least squares to estimate
Hansen noted that in various problems there may be more than one **moment} that must equal to zero. Having multiple moment conditions seems great. But it may be too much of a good thing. Multiple moments imply multiple solutions. While in theory there may exist only one set of parameters that satisfy all the moment conditions, in data there may exist many different parameters that satisfy the conditions. Below, we discuss Hansen’s solution to this problem.
Simulated Moments OLS
Consider the data we simulated in Chapter 1. This time there is a slight difference; in this data
set.seed(123456789)
<- 500
N <- 2
a <- 3
b <- runif(N)
x <- rnorm(N, mean=0, sd=2)
u <- a + b*x + u
y
<- function(beta, y, X) {
f_1mom <- as.matrix(y)
y <- as.matrix(cbind(1,X))
X <- mean((y - X%*%beta)^2)
sos return(sos)
}
<- function(par, y, X) {
f_2mom <- as.matrix(y)
y <- as.matrix(cbind(1,X))
X <- exp(par[1]) # use to keep positive
sigma <- par[-1]
beta <- mean(((y - X%*%beta)^2 - sigma^2))^2
sos return(sos)
}
<- optim(par=c(2,3), fn = f_1mom, y=y, X=x)
a # beta
$par a
[1] 1.993527 3.027305
<- optim(par=c(log(2),2,3), fn = f_2mom, y=y, X=x)
b # sigma
exp(b$par[1])
[1] 2.032743
# beta
$par[-1] b
[1] 1.556888 3.385584
We can estimate OLS using two different moment conditions, but that also gives us two different answers. Although neither answer is particularly accurate. The first moment estimator gives estimates of the intercept and slope that are pretty close to the true values of 2 and 3. The second moment estimator also estimates the variance. The estimate is close to the true value of 2, but it does a poor job estimating the intercept and slope.
Can we improve on these estimates by combining the two moment estimators? The most obvious way to do this is to add them together.
<- function(par, y, X) {
f_gmm_simple <- as.matrix(y)
y <- as.matrix(cbind(1,X))
X <- exp(par[1])
sigma <- par[-1]
beta <- mean((y - X%*%beta)^2) +
sos mean(((y - X%*%beta)^2 - sigma^2)^2)
return(sos)
}<- optim(par=c(log(2),2,3), fn = f_gmm_simple, y=y, X=x)
c # sigma
exp(c$par[1])
[1] 2.014251
# beta
$par[-1] c
[1] 1.902530 3.151387
That is, we could equally weight the two conditions. This gives an estimate of the variance and estimates of the intercept and slope that average over the previous two results. The variance estimate is pretty good but the intercept and slope estimates are not particularly close to the true values.
Why use equal weights? Why not use some other weights? Which weights should we use?
GMM of OLS
Let
The analog estimator is then one that finds smallest values for the vector of moment conditions. The estimator minimizes the following analog:
where
Below the estimation procedure will determine
Note that the notation here is pretty confusing. In particular, it is hard to keep track of the different summations and what exactly is going on with the vectors. Part of the confusion is that the ordering is different between Equation 132 and Equation 133. In Equation 132 we take the mean first so we have a vector of means; then we multiply those together. In Equation 133 we multiply vectors together at the observation level and take the mean of those.
It is easier to see the difference to be re-writing Equation 133 using matrices.
where
GMM OLS Estimator in R
The GMM estimator is in two parts. There is a general GMM function that takes in the
<- function(G, K) {
f_gmm <- as.matrix(G)
G <- dim(G)[2]
N if (K==dim(G)[1]) {
# a check that the matrix G has K rows
<- rowMeans(G, na.rm = TRUE)
g <- try(solve(G%*%t(G)/N), silent = TRUE)
W # try() lets the function work even if there is an error
if (is.matrix(W)) {
# if there is no error, W is a matrix.
return(t(g)%*%W%*%g)
}else {
# allow estimation assuming W is identity matrix
return(t(g)%*%g)
}
}else {
return("ERROR: incorrect dimension")
}
}
<- function(par, y, X) {
f_ols_gmm <- as.matrix(y)
y <- as.matrix(cbind(1,X))
X <- exp(par[1])
sigma <- par[2:length(par)]
beta <- y - X%*%beta
g1 <- (y - X%*%beta)^2 - sigma^2
g2 return(f_gmm(t(cbind(g1,g2)),K=2))
}<- optim(par=c(log(2),2,3), fn = f_ols_gmm, y=y, X=x)
d exp(d$par[1])
[1] 2.013754
$par[2:3] d
[1] 1.969362 3.075517
The GMM estimator does a pretty good job. It estimates the variance parameter relatively well and does a better job at estimating the intercept and slope. It is not quite as good as least squares for
GMM of Returns to Schooling
A standard use of GMM is as an instrumental variable estimator. In particular, GMM can be used when we have multiple instruments for the same variable. We saw this in Chapter 3. We have two potential instruments for the level of education, distance to college and parents at home. In Chapter 3 we used these to conduct an over-identification test. Which they passed! More accurately, which they didn’t fail!
Above we used the first and second moment of the distribution of the unobserved term to create our GMM estimator. Here, we use a moment of the joint distribution between the instrument and the unobserved term. Recall an important assumption of an instrument. It is independent of the unobserved term. In the graph, there is no arrow from the unobserved term to the instrument.
One implication of this assumption is that the unobserved term and the instrument are not correlated.
where
Further, if we replace
To see how it works we can return to returns to schooling and the data we used in the first three chapters from Card (1995). The code is identical to the code in Chapter 3. The difference is that instead of using one instrument for level of education, we can use two. Note that for simplicity I don’t instrument for experience.
Warning: NAs introduced by coercion
<- x1$lwage76
y <- cbind(x1$ed76, x1$exp, x1$exp2, x1$black, x1$reg76r,
X $smsa76r, x1$smsa66r, x1$reg662, x1$reg663,
x1$reg664, x1$reg665,x1$reg666, x1$reg667,
x1$reg668, x1$reg669)
x1$age2 <- x1$age76^2 x1
<- function(beta, y, X, Z) {
f_iv_gmm <- as.matrix(y)
y <- as.matrix(cbind(1,X))
X <- as.matrix(Z) # matrix of instruments of schooling
Z <- Z[,1]*(y - X%*%beta)
g1 <- Z[,2]*(y - X%*%beta)
g2 return(f_gmm(t(cbind(g1,g2)),K=2))
}
As a reminder, we can compare the IV estimator to the OLS estimator presented in Chapters 2 and 3. The estimated parameters from the OLS estimator are used as starting values for the GMM estimator.
<- cbind(1,X)
X1 <- solve(t(X1)%*%X1)%*%t(X1)%*%y
beta 2] beta[
[1] 0.07469326
<- optim(par=beta, fn = f_iv_gmm, y=y, X=X,
a Z=cbind(x1$nearc4,x1$momdad14))
$par[2] a
[1] 0.07350604
The GMM estimator allows both distance to college and parents at home to instrument for education. Interestingly, this estimator gives *lower} values for returns to schooling than OLS. This is in contrast to the IV results presented in Chapter 3 and the Heckman estimates presented in Chapter 6.
Estimating the Third Down Game
After a detour to learn more about GMM, we can now use the estimator for estimating our third down game. Remember, we are using the mixed strategy Nash equilibrium to generate moments that we can use to estimate the parameters.
Moment Conditions
The Equation 124 and Equation 125 suggest a method for estimating the parameters of interest. The mixed strategy Nash equilibrium provides the Offense and Defense strategies conditional on the expected points for each option and the conditional probabilities.
where
The first condition states that conditional on the Offense playing Pass, the predicted success rate must be the same as the observed rate in expectation. The second condition is similar but for when the Offense plays Run. The third condition states that the observed probability of the Offense playing pass on third down must be equal to the predicted probability from the third down game, on average.
The Equation 138 presents the sample analogs of the moment conditions in Equation 137.
The GMM estimator finds the vector
Third Down GMM Estimator in R
For all
The Assumption 6 allows the parameters to be estimated with variation in the situation, but holding the parameters constant. It states that for small changes in the yardage, the success probabilities are unchanged conditional on the yardage, the actions of the Offense and Defense, and the observed characteristics,
The GMM estimator has three parts. The first part of the estimator assumes that Equation 125 holds. It uses the Nash equilibrium to determine mixed strategies of the Offense and Defense given the parameters and the expected outcomes from each of the choices. The second part is the analog of the moment condition. The last part determines the estimated weighting matrix conditional on the observed probabilities, the expected points and the estimated parameter values.
# the GMM estimator, which calls the general GMM function above.
<- function(par,Ep,s,play) {
p3_fun <- exp(par)/(1 + exp(par))
theta # using sigmoid function to keep values between 0 and 1
<- q_fun(theta,Ep)
q3 # determine the equalibrium strategies.
# moments
<- play=="pass"
Pass <- Pass*(s - theta[3]*(1 - q3$qd) - theta[4]*q3$qd)
g1 <- (!Pass)*(s - theta[1]*(1 - q3$qd) - theta[2]*q3$qd)
g2 <- Pass - q3$qo
g3 <- t(cbind(g1,g2,g3))
G # note the transpose.
return(f_gmm(G,3))
}
<- cbind(EP(Yr), EP(Yp), EP(100 - Yk))
EP1 <- optim(par=log(2*theta),fn=p3_fun,Ep=EP1,
a1 s=success3,play=play,control = list(maxit=10000))
exp(a1$par)/(1 + exp(a1$par))
[1] 0.2740818 0.6630758 0.4570891 0.2010548
The estimator does an OK job. The true values are
Are NFL Coaches Rational?
The assumption of decision maker rationality underpins many of the models in macroeconomics and microeconomics, including estimators presented in this book. Romer (2006)} argues that NFL coaches are not behaving rationally, and that this has implications for the foundations of economics.
The difficulty with testing the rationality of these decisions is that NFL coaches do not actually go for it on fourth down. Therefore, we cannot actually measure what happens. We use game theory to model third and fourth downs. In the model, the success rates depend upon success rates conditional on the strategies of the Offense and Defense. We used the observed third down information to estimate these conditional success rates. We then use these estimates and the model to determine what the success rates would have been if the coach had decided to go for it on fourth down.
We can determine the “rationality” of NFL coaches by comparing the predicted rate of going for it against the actual rate. If coaches are rational, then the predicted rate of going for it should not be terribly different from the actual rate of going for it. Of course, this is based on the enormous assumption that the econometrician knows more about NFL than an NFL coach! Or at least, the model used here is a reasonable facsimile of third down and fourth down situations.
NFL Data
In order to estimate the parameters of the game we need to estimate the success rates, the Offense’s strategy, and the value functions. The next play listed in the data is assumed to be the next play that occurs in the game. We will use the next play to determine the “result” of the play.
<- read.csv("NFLQ1.csv", as.is = TRUE)
x $id <- as.numeric(row.names(x))
x$res_pos <- c(x$posteam[2:dim(x)[1]],NA)
x$res_ep <- c(x$ep[2:dim(x)[1]],NA)
x$res_ep <- ifelse(x$posteam==x$res_pos,x$res_ep,-x$res_ep)
x$res_game <- c(x$game_id[2:dim(x)[1]],NA)
x$res_down <- c(x$down[2:dim(x)[1]],NA)
x$diff_ep <- x$res_ep - x$ep
x$diff_ep <- ifelse(x$game_id==x$res_game,x$diff_ep,NA)
x$year <- sub("-.*","",x$game_date)
x# this uses "real expressions."
# it subs out everything after the "-".
$succ <- NA
x$succ <- ifelse(x$res_down==1 & x$posteam==x$res_pos &
x$down==3,1,x$succ)
x$succ <- ifelse(x$res_down==4,0,x$succ)
x$pct_field <- x$yardline_100/100
x$year <- as.numeric(x$year) x
Estimating Third Down Game in R
These expected points estimators are used by the Nash equilibrium condition (Equation 125) to determine the Defense’s strategy.
<- lm(diff_ep ~ ydstogo + pct_field + year,
lm_run data=x[x$play_type=="run" &
$down==3 & x$succ==1,])
x<- lm(diff_ep ~ ydstogo + pct_field + year,
lm_pass data=x[x$play_type=="pass" &
$down==3 & x$succ==1,])
x<- lm(diff_ep ~ ydstogo + pct_field + year,
lm_punt data=x[x$play_type=="punt",])
The next step is to determine the expected points after each action that the Offense could take; pass, run or punt.94 The first regression is on the change in expected points after a third down run that gets a first down. Table 27 presents the results from the three OLS regressions on difference in expected points. The results show that successful pass plays get much larger increases in expected points than successful run plays. As expected these effects are large the further to go to first down and the further to go to goal.
(1) | (2) | (3) | |
---|---|---|---|
(Intercept) | -8.146 | 10.331 | -63.722 |
(10.095) | (7.675) | (10.314) | |
ydstogo | 0.220 | 0.153 | 0.019 |
(0.005) | (0.003) | (0.003) | |
pct_field | 0.493 | 1.808 | 1.874 |
(0.069) | (0.053) | (0.102) | |
year | 0.004 | -0.005 | 0.031 |
(0.005) | (0.004) | (0.005) | |
Num.Obs. | 1411 | 3896 | 5860 |
R2 | 0.604 | 0.503 | 0.076 |
R2 Adj. | 0.603 | 0.503 | 0.075 |
AIC | 2240.5 | 7783.5 | 17861.1 |
BIC | 2266.8 | 7814.9 | 17894.4 |
Log.Lik. | -1115.269 | -3886.762 | -8925.532 |
RMSE | 0.53 | 0.66 | 1.11 |
Given these results we can determine the predicted expected points for a run, a pass, and a punt. Note that these are calculated for every play in the data. Later, we will only use this information on the relevant subset.
<- as.matrix(cbind(1,x$ydstogo,x$pct_field,x$year))
X $pass_ep <- x$ep + X%*%lm_pass$coefficients
x$run_ep <- x$ep + X%*%lm_run$coefficients
x$punt_ep <- x$ep + X%*%lm_punt$coefficients x
We canuse GMM to estimate the parameters from the Third Down game. This is done for “to go” distances of 1 to 4 yards and position from the team’s own 30 yard line to the opponent’s 40 yard line (30 yards). Given the sparsity of data we include data from ten yards either side of the position of interest and 1 yard either side of the to-go distance.95
<- matrix(NA,4*30,4)
theta_hat <- 1
k for (i in 1:4) {
<- i # to go distance.
tg for (j in 1:30) {
<- 29 + j # yardline
yl # create subset for analysis
<- x$down==3 &
index $yardline_100 > yl & x$yardline_100 < (yl+20) &
x$ydstogo > tg - 2 & x$ydstogo < tg + 2 &
x$play_type=="pass" | x$play_type=="run")
(x<- x[index,]
y <-
index_na is.na(rowSums(cbind(y$run_ep,y$pass_ep,y$punt_ep,
$succ,
y$play_type=="pass" |
(y$play_type=="run"))))==0
y# GMM to determine parameters (conditional probabilities)
<- y[index_na,]
y_na <- optim(par=rnorm(4,mean=0,sd=2),fn=p3_fun,
a1 Ep=cbind(y_na$run_ep,y_na$pass_ep,
$punt_ep),
y_nas=y_na$succ,play=y_na$play_type,
control = list(maxit=10000))
<- exp(a1$par)/(1 + exp(a1$par))
theta_hat[k,] <- k + 1
k #print(k)
} }
Predicting the Fourth Down Game
We can use the parameter estimates from the third down game to model the fourth down game. The purpose is to model the policy of going for it at various positions on the field and to go distances. The model assumes that both Offense and Defense play a mixed strategy Nash equilibrium.
The game is almost identical to the third down game (Table 25); the difference is the payoffs from the unsuccessful play. In the third down game, the payoff is the expected points from punting the ball away. Here it is the expected points from turning the ball over on downs. That is, the expected points if the ball goes to the other team and that team gets to start where the ball is currently located. The question is whether this one change to the payoffs substantially changes the Nash equilibrium and the predicted outcome of the game. Note that other expected point calculations and the conditional probabilities are all the same as for third down, with the conditional probabilities determined from the GMM procedure.
<- lm(ep ~ pct_field + I(pct_field^2),
lm_over data=x[x$down==1,])
$over_ep <- -cbind(1,(1-x$pct_field),
x1-x$pct_field)^2)%*%lm_over$coefficients (
To estimate the expected points from a turnover on downs, we use the expected points at first down from various distances to goal. The OLS regression assumes that there is a non-linear relationship.
<- function(theta,Ep) {
p4_fun <- q_fun(theta,Ep)
q4 <- q4$qo*(theta[3]*(1 - q4$qd) + theta[4]*q4$qd)
p4p <- (1 - q4$qo)*(theta[1]*(1 - q4$qd) + theta[2]*q4$qd)
p4r <- p4r + p4p
p4 <- p4r*Ep[,1] + p4p*Ep[,2] + (1 - p4)*Ep[,3]
Epgfi # expected points going for it.
return(list(p4=p4,Epgfi=Epgfi))
}
<- prob_actual <- prob_pred <- matrix(NA,4,30)
tab_res <- 1
k for (i in 1:4) {
<- i # to go distance.
tg for (j in 1:30) {
<- 29 + j # yardline
yl # create subset for analysis
<- x$down==3 &
index3 $yardline_100 > yl & x$yardline_100 < (yl+20) &
x$ydstogo > tg - 2 & x$ydstogo < tg + 2 &
x$play_type=="pass" | x$play_type=="run")
(x<- x[index3,]
y # determine predicted success on 4th down.
<- p4_fun(theta_hat[k,],
succ4 cbind(y$run_ep,y$pass_ep,y$over_ep))
# Actual frequency of going for it on 4th down.
<- x$down==4 &
index4 $yardline_100 > yl & x$yardline_100 < (yl+20) &
x$ydstogo > tg - 2 & x$ydstogo < tg + 2
x<- x[index4,]
z $go <- ifelse(z$play_type=="run" |
z$play_type=="pass",1,NA)
z$go <- ifelse(z$play_type=="punt",0,z$go)
z# relative value of punting
<- mean(y$punt_ep - succ4$Epgfi,
tab_res[i,j] na.rm = TRUE)
# predicted probability of going for it
<- mean(y$punt_ep - succ4$Epgfi < 0,
prob_pred[i,j] na.rm = TRUE)
# actual probability of going for it.
<- mean(z$go,na.rm = TRUE)
prob_actual[i,j] <- k + 1
k #print(k)
} }
We can estimate the model at different distances to goal and yards to go to determine whether it is better to punt or “go for it” on fourth down.
1 To Go | 2 To Go | 3 To Go | 4 To Go |
---|---|---|---|
Min. :0.1991 | Min. :0.1726 | Min. :0.4313 | Min. :1.475 |
1st Qu.:0.5708 | 1st Qu.:0.5028 | 1st Qu.:1.0918 | 1st Qu.:1.569 |
Median :0.7642 | Median :1.0102 | Median :1.3094 | Median :1.638 |
Mean :1.0129 | Mean :1.1260 | Mean :1.7323 | Mean :1.972 |
3rd Qu.:1.0491 | 3rd Qu.:1.4337 | 3rd Qu.:1.6666 | 3rd Qu.:1.740 |
Max. :3.0715 | Max. :3.4444 | Max. :4.4630 | Max. :4.065 |
The table suggests that is almost always better to punt the ball away on fourth down. Table 28 presents summary statistics on the difference between expected points from punting over going for it, at each to go distance from 1 yard to 4 yards.
Testing Rationality of NFL Coaches
The Figure 28 presents a histogram of the difference between the actual and predicted probability of going for it at each yard line and to go distance. It does not provide strong evidence of irrationality. In most cases the probabilities are the same or similar. However, there are cases where the model makes a strong prediction to go for it, but no NFL coach does. There are also cases where many NFL coaches do in fact go for it but the model does not predict that they should.
Discussion and Further Reading
The chapter uses game theory to solve the problem of estimating the value of punting the ball on 4th Down in American football. The estimation problem follows from the fact that coaches rarely go for it on 4th Down, and so there is no data. Using game theory and the GMM estimator, we can estimate the policy parameters using data from third downs. We can then use these estimates and the mixed strategy Nash equilibrium to simulate 4th Down. Our analysis suggests that NFL coaches punt the ball the appropriate number of times on 4th Down.
While GMM has become a standard technique, recent work suggests using moment inequalities generated from decision problems or games (Pakes et al. 2015)
Estimating Auction Models
Introduction
According to the travel presenter, Rick Steves, the Aalsmeer auction house is one the largest commercial buildings in the world. Royal Flora Holland, the owner of Aalsmeer, sold 12.5 billion plants and flowers in 2016 through its auction houses. But with $5.2 billion in auction sales, Royal Flora Holland is nowhere near the biggest auction house in the world.96 That honor goes to Google. Google sold $47.6 billion in search ads using what the economist, Hal Varian, called the biggest auction in the world (Varian 2007).97 But while that is impressive, a single auction in 2015 almost beat Google’s annual number. The US Federal Communication Commission’s auction number 97 (AWS-3) raised $44.9 billion dollars.98
Auctions are used to sell and buy a large number of products. Governments use auctions to purchase everything from paper to police body cameras. The US Federal government uses auctions to sell oil drilling rights, FCC spectrum, 10 year bonds and timber access. You can sell and buy items from eBay.com using auctions.
Economists use auction data to determine the underlying valuations for the items. We may be interested in modeling what would happen if a different auction method is used. Economists working in antitrust want to determine the effect of a merger that reduces the number of independent bidders or determine if there is evidence of collusion.
The auctions at Aalsmeer are unique. The auction runs for a short amount of time with a “clock” clicking the price down as the auction continues. As the price falls, the first bidder to hit the button, wins, at whatever price the clock is at. A spokesman for Aalsmeer stated that because the price falls, it is called a Dutch auction. But actually, she got the causality backwards. Because the Dutch popularized these types of auctions for selling flowers, we call them Dutch auctions.
The auction style you may be most familiar with is called an English auction. In this auction, there is an auctioneer who often speaks very very fast and does a lot of pointing while bidders hold up paddles or make hand gestures. In English auctions, the last bidder wins and pays the price at which the bidding stops.
Economic analysis of auctions began with William Vickrey’s seminal 1961 paper, Counterspeculation, Auctions, and Competitive Sealed Bid Tenders. Vickrey pointed out that Dutch auctions and sealed bid auctions are strategically equivalent. In a standard sealed bid auction each bidder submits a secret written bid. The auctioneer chooses the highest bid, and the bidder pays the number written down in her bid.
Vickrey characterized what a bidder should optimally bid in such an auction. He then showed that the same bidder should bid exactly the same amount in a Dutch auction. That is, in a Dutch auction, the bidder should wait until the price falls to the number written down, and then hit the button. Vickrey showed that these two auctions formats are strategically equivalent.99 However, they are not strategically equivalent to an English auction.
Vickrey invented a new auction. In a Vickrey auction, each bidder writes down a bid like in a standard sealed bid auction and the winner is the person who writes down the highest bid. However, the winner pays the amount written down by the second highest bidder. Vickrey showed that his auction was strategically equivalent to an English auction.
Econometric analysis of auctions can be split into two steps. In the first step, we use standard statistical methods to estimate the statistical parameters. In the second step, we use the game theory to determine the structural parameters. We previously used this approach to estimate demand.
This chapter discusses two of the most important auction formats, sealed bid auctions and English auctions. It presents estimators for both. The sealed bid auction estimation is based on Guerre, Perrigne, and Vuong (2000). The English auction analysis uses the order statistic approach of Athey and Haile (2002). In both cases it presents results for simulated data and analysis of timber auctions. The chapter tests whether loggers are bidding rationally in sealed bid auctions and whether loggers colluded in English auctions.
Sealed Bid Auctions
Sealed bid auctions are one of the most commonly used auction formats. These auctions are very prominent in procurement, both in government and the private sector. In a sealed bid auction, each bidder writes down her bid and submits it to the auctioneer. The auctioneer sorts the bids from highest to lowest (or lowest to highest if they are buying instead of selling). The winner is the highest bidder and she pays the amount she wrote down. This is called a first price auction, because the price is determined by the highest bid or first price.
Vickrey pointed out that sealed bid auctions are strategically complicated. To see this, assume that a bidder’s utility for an item is equal to their intrinsic value for the item less the price they pay for the item. For example, a logger bidding in a timber auction will earn profits from the logs less the price paid to the US Forestry service for access to the trees. If a logger bids an amount equal to her expected profits, then if she wins she will earn nothing from the logging. It is optimal for the logger to shade her bid down. The problem is that the more she shades down, the lower her chance of winning the auction. The bidder must calculate the trade off between the probability of winning the auction and the value of winning the auction.
Sealed Bid Model
(Independent Private Values (IPV)) Let
The Assumption 7 makes the exposition a lot simpler. It also seems to be a reasonable approximation for the problems considered. It states that a bidder’s value for the item is unrelated to the values of the other bidders in the auction, except that they draw their valuation from the same distribution. It is a standard simplifying assumption in the auction literature. A contrasting assumption is called “common values.” In a common values auction the item has the exact same value for everyone. Often it is assumed that while the bidders know that they all have the same value, they don’t know exactly what that value is. This leads to an over-bidding problem called the “winner’s curse.”
The bidder maximizes her expected returns from the auction. Assume that the bidder gets 0 if she loses. If she wins, assume she gets her intrinsic value for the item less her bid.
where
If we take first order conditions of Equation 140 then we get the following expression.
Let
We can rearrange this formula to show how much the bidder should shade her bid.
The formula states that the bidder should bid her value, less a shading factor which is determined by how much a decrease in her bid reduces her probability of winning the auction.
It will be useful for our code to write the probability of winning the auction as a function of the bid distribution. Let
If there are two bidders in the auction, then the probability of winning is simply the probability that your bid is higher than the other bidder. If there are more than two bidders, it is the probability that your bid is higher than *all} the other bidders. The independent private values assumption implies it is the probability that each of the other bidders makes a bid less than yours, all multiplied together.
We can also determine the derivative of this function in terms of the bid.
where
Sealed Bid Simulation
In the simulated data, each bidder draws their value from a uniform distribution. Vickrey shows that the optimal bid in this auction is calculated using the following formula.
Vickrey assumes that each bidder knows his own valuation, but only knows the distribution of valuations for other bidders in the auction. In game theory, this is called a game of incomplete information. We generally assume that the outcome of such games is a Bayes Nash equilibrium.100
Remember a game has three parts, a set of players, a set of strategies and a set of payoffs. In the case of auctions, the players are the bidders in the auction. The strategies are the bids. Actually, that is not quite correct. In games of incomplete information, the players do not observe the actions of the other players. When bidding, a bidder does not know what the other bidders are bidding. Instead, it is assumed that bidders know the function that maps from valuations to bids. In Vickrey’s game, bidders know the function represented by Equation 145. They also know their own valuation. The payoffs are the expected value of the auction accounting for the probability of winning, the intrinsic value of the item and the amount bid.
The uniform distribution simplifies the problem, which is why it is used. In each simulated auction, there are different numbers of simulated bidders.
set.seed(123456789)
<- 1000 # number of simulated auctions.
M <- matrix(NA,M,12)
data1 for (i in 1:M) {
<- round(runif(1, min=2,max=10)) # number of bidders.
N <- runif(N) # valuations, uniform distribution.
v <- (N - 1)*v/N # bid function
b <- max(b) # auction price
p <- rep(NA,10)
x 1:N] <- b # bid data
x[1] <- N
data1[i,2] <- p
data1[i,3:12] <- x
data1[i,
}colnames(data1) <- c("Num","Price","Bid1",
"Bid2","Bid3","Bid4",
"Bid5","Bid6","Bid7",
"Bid8","Bid9","Bid10")
<- as.data.frame(data1) data1
The simulation creates a data set with 1,000 auctions. In each auction, there is between 2 and 10 bidders. Note that the bidders are not listed in order.
Sealed Bid Estimator
The estimator uses Equation 142 to back out values from observed bids. To do this, we calculate the probability of winning the auction conditional on the number of bidders. It should be straightforward to determine from this data. Once we have this function, we use the formula to determine the bidder’s valuation from their bid.
The first step is to estimate the bid distribution.
The non-parametric estimate of the distribution function,
The second step is to estimate the derivative of the bid distribution. This can be calculated numerically for some given “small” number,
If there are two bidders, Equation 142 determines the valuation for each bidder.
where
Sealed Bid Estimator in R
The estimator backs out the valuation distribution from the distribution of bids. It limits the data to only those auctions with two bidders. In this special case, the probability of winning is just given by the distribution of bids.103 In the code the term “eps” stands for the Greek letter,
<- function(bids, eps=0.5) {
sealed_2bid # eps for "small" number for finite difference method
# of taking numerical derivatives.
<- rep(NA,length(bids))
values for (i in 1:length(bids)) {
<- mean(bids < bids[i])
H_hat # bid probability distribution
<- (mean(bids < bids[i] + eps) -
h_hat mean(bids < bids[i] - eps))/(2*eps)
# bid density
<- bids[i] + H_hat/h_hat
values[i]
}return(values)
}
<- data1[data1$Num==2,3:4]
bids <- as.vector(as.matrix(bids)) # all the bids
bids <- sealed_2bid(bids) values
It is straightforward to calculate the probability of winning, as this is the probability the other bidder bids less. Given IPV, this is just the cumulative probability for a particular bid. Calculating the density is slightly more complicated. However, we can approximate this derivative numerically by looking at the change in the probability for a “small” change in the bids.104 The value is calculated using Equation 142.
The Figure 29 shows that the bids are significantly shaded from the true values, particularly for very high valuations. The figure presents the density functions for bids and derived valuations from the two-person auctions. The true density of valuations lies at 0.5 and goes from 0 to 1. Here the estimated density is a little higher and goes over its bounds. However, part of the reason may be the method we are using to represent the density in the figure.105
English Auctions
The auction format that people are most familiar with is the English auction. These auctions are used to sell cattle, antiques, collector stamps and houses (in Australia). In the 1970s they were also the standard format used by the US Forestry Service to sell timber access (Aryal et al. 2018).
Vickrey showed that English auctions are strategically very simple. Imagine a bidder hires an expert auction consultant to help them bid in an English auction.
Expert: “What is your value for the item?”
Bidder: “$2,300”
Expert: “Bid up to $2,300 and then stop.”
In sealed bid auctions there is an optimal trade-off between winning and profiting from the auction. In English auctions there is no such trade-off. In econ-speak the high bidder is pivotal in determining the price in sealed bid auctions, but is not pivotal in English auctions.
English auctions are second price auctions. The price in an English auction is determined by the highest losing bid. That is, it is determined by the second price. In English auctions the bidding continues until the second to the last bidder drops out. Once there is only one bidder, everything stops.
The optimal bid for bidder
The Equation 149 suggests that empirical analysis of English auctions is a lot simpler than for sealed bid auctions. If only that were so! To be clear, the “bid” in Equation 149 means the strategy described by the expert. In the data we do not necessarily observe this strategy.
If we could observe all the bid strategies in the auction, then we would have an estimate of the value distribution. But that tends to be the problem. Depending on the context, not all active bidders in the auction may actually be observed making a bid. In addition, if the price jumps during the auction we may not have a good idea when bidders stopped bidding (P. A. Haile and Tamer 2003).
Athey and Haile (2002) provide a solution. They point out that the price in an English auction has a straightforward interpretation as the second highest bid in the auction when valuations follow Assumption 7. The price is the second highest valuation of the people who bid in the auction. Consider if the price is lower than the second highest valuation. How could that be? Why did one of the bidders exit the auction at a price lower than her valuation? Consider if the price is higher than the second highest valuation. How could that be? Why would a bidder bid more than her valuation?
If the price is equal to the second highest valuation, then it is a particular order statistic of the value distribution. Athey and Haile (2002) show how the observed distributions of an order statistic uniquely determine the value distribution.
Order Statistics
To understand how order statistics work, consider the problem of determining the distribution of heights of players in the WNBA. The obvious way to do it is to take a data set on player heights and calculate the distribution. A less obvious way is to use order statistics.
In this method, data is taken from a random sample of teams, where for each team, the height of the tallest player is measured. Assume each team has 10 players on the roster and you know the height of the tallest, say the center. This is enough information to estimate the distribution of heights in the WBNA. We can use the math of order statistics and the fact that we know both the height of the tallest and we know that 9 other players are shorter. In this case we are using the tallest, but you can do the same method with the shortest or the second tallest, etc.
The price is more or less equal to the second highest valuation of the bidders in the auction.106 The probability of the second highest of
The order statistic notation for the second highest bid of
The question raised by Athey and Haile (2002) is whether we can use this formula to determine
Identifying the Value Distribution
Let’s say we observe a two bidder auction with a price equal to the lowest possible valuation for the item; call that
Now take another value,
It is the probability of seeing one bidder with a value between
We can solve
For a finite subset of the valuations we can use this iterative method to calculate the whole distribution. For this to work, each bidder’s valuation is assumed to be independent of the other bidders and comes from the same distribution of valuations (Assumption 7).
English Auction Estimator
The non-parametric estimator of the distribution follows the logic above.
The initial step determines the probability at the minimum value,
where there are
To this initial condition, we can add an iteration equation.
These equations are then used to determine the distribution of the valuations.
English Auction Estimator in R
We can estimate the distribution function non-parametrically by approximating it at
<- function(price, K=100, eps=1e-8) {
english_2bid # K number of finite values.
# eps small number for getting the probabilities
# calculated correctly.
<- min(price)
min1 <- max(price)
max1 <- (max1 - min1)/K
diff1 <- matrix(NA,K,2)
Fv <- min1 - eps
min_temp <- min_temp + diff1
max_temp # determines the boundaries of the cell.
1,1] <- (min_temp + max_temp)/2
Fv[<- mean(price > min_temp & price < max_temp)
gp # price probability
1,2] <- gp/2 # initial probability
Fv[for (k in 2:K) {
<- max_temp - eps
min_temp <- min_temp + diff1
max_temp 1] <- (min_temp + max_temp)/2
Fv[k,<- mean(price > min_temp & price < max_temp)
gp 2] <- gp/(2*(1 - Fv[k-1,2])) + Fv[k-1,2]
Fv[k,# cumulative probability
}return(Fv)
}
Consider simulated data from an English auction. The data set provides the price and the number of bidders in each auction.
<- 10000
M <- matrix(NA,M,2)
data2 for (i in 1:M) {
<- round(runif(1, min=2,max=10))
N <- rnorm(N) # normally distributed values
v <- v # bid
b <- -sort(-b)[2] # auction price
p 1] <- N
data2[i,2] <- p
data2[i,
}colnames(data2) <- c("Num","Price")
<- as.data.frame(data2) data2
Given this data we can determine the value distribution for the two-bidder auctions.
<- data2$Price[data2$Num==2]
price # restrics the data to auctions w/ 2 bidders.
<- english_2bid(price) Fv
The Figure 30 shows that the method does a pretty good job at estimating the underlying distribution. The true distribution has a median of 0, which means that the curve should go through the point where the two dotted lines cross. However, the estimate of the upper tail is not good. The true distribution has a much thicker upper tail than the estimated distribution. How does the estimate change if you use auctions with a larger number of bidders? Remember to change the formula appropriately.
Are Loggers Rational?
In the 1970s, the US Forest Service conducted an interesting experiment. It introduced sealed bid auctions in 1977. Previous to that, most US Forest Service auctions had been English auctions.107 In 1977, the service mixed between auction formats. As discussed above, bidding in sealed bid auctions is strategically a lot more complicated than bidding in English auctions. In the latter, the bidder simply bids her value. In the former, she must trade off between bidding higher and increasing the likelihood of winning against paying more if she does win.
Because of the experiment, we can test whether the loggers in the sealed bid auctions bid consistently with their actions in the English auctions. Our test involves estimating the underlying value distribution using bid data from sealed bid auctions, and comparing that to an estimate of the underlying value distribution using price data from English auctions.
Timber Data
The data used here is from the US Forest Service downloaded from Phil Haile’s website.108
In order to estimate the distributions of bids and valuations it is helpful to “normalize” them so that we are comparing apples to apples. The standard method is to use a log function of the bid amount and run OLS on various characteristics of the auction including the number of acres bid on, the estimated value of the timber, access costs and characteristics of the forest and species (P. Haile, Hong, and Shum 2006).109
<- read.csv("auctions.csv", as.is = TRUE)
x <- lm(log_amount ~ as.factor(Salvage) + Acres +
lm1 + log_value + Haul +
Sale.Size + as.factor(Species) +
Road.Construction as.factor(Region) + as.factor(Forest) +
as.factor(District), data=x)
# as.factor creates a dummy variable for each entry under the
# variable name. For example, it will have a dummy for each
# species in the data.
$norm_bid <- NA
x$norm_bid[-lm1$na.action] <- lm1$residuals
x# lm object includes "residuals" term which is the difference
# between the model estimate and the observed outcome.
# na.action accounts for the fact that lm drops
# missing variables (NAs)
In general, we are looking for a normal-like distribution.
Figure 31 presents the histogram of the normalized log bids. It is not required that the distribution be normal, but if the distribution is quite different from normal, you should think about why that may be. Does this distribution look normal?110
Sealed Bid Auctions
In order to simplify things we will limit the analysis to two-bidder auctions. In the data, sealed bid auctions are denoted “S”.
<- x[x$num_bidders==2 & x$Method.of.Sale=="S",]
y <- y$norm_bid
bids <- sealed_2bid(bids)
values summary(bids)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.0262 -0.7987 -0.1918 -0.2419 0.3437 5.0647
summary(values)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.0262 -0.0218 0.9185 4.8704 2.3108 860.0647
Using the same method that we used above, it is possible to back out an estimate of the value distribution from the bids in the data. We see that comparing the valuations to the bids, the bids are significantly shaded particularly for higher valuations.
English Auctions
Again consider two-bidder auctions. The English auctions are denoted “A”.
<- x[x$num_bidders==2 & x$Method.of.Sale=="A",]
y <- y[y$Rank==2,]$norm_bid
price <- english_2bid(price) Fv
We can back out the value distribution assuming that the price is the second highest bid, the second highest order statistic.
Comparing Estimates
The Figure 32 shows that there is not a whole lot of difference between the estimate of the distribution of valuations from sealed bid auctions and English auctions. The two distributions of valuations from the sealed bid auctions and English auctions lie fairly close to each other, particularly for lower values. This suggests loggers are bidding rationally. That said, at higher values, the two distributions diverge. The value distribution from the sealed bid auctions suggests that valuations are higher than the estimate from the English auctions. What else may explain this divergence?
Are Loggers Colluding?
Is there evidence that bidders in these auctions are colluding? Above, we find mixed evidence that bidders in timber auctions are behaving irrationally. Now we can ask whether they are behaving competitively.
Given the price is an order statistic, prices should be increasing in the number of bidders. Figure 33 presents a plot of the number of bidders in an English auction against the price. The figure suggests that this relationship is not clear. One explanation is that bidders are colluding. It could be that bidders in larger auctions are underbidding. Of course, there may be many other reasons for the observed relationship.
A Test of Collusion
Consider the following test of collusion. Using large English auctions we can estimate the distribution of valuations. Under the prevailing assumptions, this estimate should be the same as for two-bidder auctions. If the estimate from the large auctions suggests valuations are much lower than for two-bidder auctions, this suggests collusion.
In particular, if the inferred valuations in these larger auctions look much like auctions with fewer bidders. That is, bidders may behave “as if” there are actually fewer bidders in the auction. For example, if there is an active bid-ring, bidders may have a mechanism for determining who will win the auction and how the losers may be compensated for not bidding.111 In an English auction, it is simple to enforce a collusive agreement because members of the bid-ring can bid in the auction where their actions are observed.
We can estimate the distribution under the assumption that there are two bidders in the auction and the assumption that there are three bidders in the auction. If these estimates are consistent with the results from the actual two bidder auctions, then we have an estimate of the bid-ring size. Consider an auction with six bidders. If three of them are members of a bid ring, then those three will agree on who should bid from the ring. Only one of the members of the bid ring will bid their value. In the data, this will look like there are actually four bidders in the auction.112
“Large” English Auctions
Consider the case where there are six bidders in the auction. From above, the order statistic formula for this case is as follows.
As above, order statistics are used to determine the underlying value distribution (
Think about the situation where the price in a 6 bidder auction is observed at the minimum valuation. What do we know? As before, one bidder may have a value equal to the minimum or a value above the minimum. That is, their value could be anything. The probability of a valuation lying between the minimum and maximum value is 1. We also know that the five other bidders had valuations at the minimum. If not, one of them would have bid more and the price would have been higher. As there are six bidders, there are six different bidders that could have had the highest valuation. This reasoning gives the following formula for the starting value.
Rearranging, we have
Given this formula we can use the same method as above to solve for the distribution of valuations.
Large English Auction Estimator
Again we can estimate the value distribution by using an iterative process. In this case we have the following estimators.
and
The other functions are as defined in the previous section.
We can also solve for the implied distribution under the assumption that there are three bidders and under the assumption that there are two bidders.113 Note in each auction there are at least six bidders.114
Large English Auction Estimator in R
We can adjust the estimator above to allow any number of bidders,
<- function(price, N, K=100, eps=1e-8) {
english_Nbid <- min(price)
min1 <- max(price)
max1 <- (max1 - min1)/K
diff1 <- matrix(NA,K,2)
Fv <- min1 - eps
min_temp <- min_temp + diff1
max_temp 1,1] <- (min_temp + max_temp)/2
Fv[<- mean(price > min_temp & price < max_temp)
gp 1,2] <- (gp/N)^(1/(N-1))
Fv[for (k in 2:K) {
<- max_temp - eps
min_temp <- min_temp + diff1
max_temp 1] <- (min_temp + max_temp)/2
Fv[k,<- mean(price > min_temp & price < max_temp)
gp 2] <-
Fv[k,/(N*(N-1)*(Fv[k-1,2]^(N-2))*(1 - Fv[k-1,2])) +
gp-1,2]
Fv[k
}return(Fv)
}<- x[x$num_bidders > 5 & x$Method.of.Sale=="A",]
y2 # English auctions with at least 6 bidders.
<- y2$norm_bid[y2$Rank==2]
price2 <- english_Nbid(price2, N=6)
Fv6 <- english_Nbid(price2, N=3)
Fv3 <- english_2bid(price2) Fv2
Evidence of Collusion
The Figure 34 suggests that there is in fact collusion in these auctions! Assuming there are six bidders in the auction implies that valuations are much lower than we estimated for two bidder auctions from both English auctions and sealed-bid auctions. In the chart, the distribution function is shifted to the left, meaning there is greater probability of lower valuations.
If we assume the estimates from the two bidder auctions are the truth, then we can determine the size of the bid-ring. Estimates assuming there are three bidders and two bidders lie above and below the true value, respectively. This suggests that bidders are behaving as if there are between two and three bidders in the auction. This implies that the bid ring has between three and four bidders in each auction.
These results are suggestive of an active bid ring in these auctions in 1977. It turns out that this was of real concern. In 1977, the United States Senate conducted hearings into collusion in these auctions. In fact, this may be why the US Forestry Service looked into changing to sealed bid auctions. The US Department of Justice also brought cases against loggers and millers (Baldwin, Marshall, and Richard 1997). Alternative empirical approaches have also found evidence of collusion in these auctions, including Baldwin, Marshall, and Richard (1997) and Athey, Levin, and Seira (2011).
Large Sealed Bid Auction Estimator
One concern with the previous analysis is that there may be something special about auctions with a relatively large number of bidders. It may that there are unobservable characteristics of the auctions that are leading to the observed differences. To check this possibility we can compare our estimate of the value distribution from large sealed bid auctions.
For this case the probability of winning the auction is slightly different than the simple case above.
and
where
Large Sealed Bid Auction Estimator in R
<- function(bids,N, eps=0.5) {
sealed_Nbid <- rep(NA,length(bids))
values for (i in 1:length(bids)) {
<- mean(bids < bids[i])
H_hat <- (mean(bids < bids[i] + eps) -
h_hat mean(bids < bids[i] - eps))/(2*eps)
<- H_hat^(N-1)
G_hat <- (N-1)*h_hat*H_hat^(N-2)
g_hat <- bids[i] + G_hat/g_hat
values[i]
}return(values)
}
<- x[x$num_bidders > 5 & x$Method.of.Sale=="S",]
y_6 <- y_6$norm_bid
bids_6 <- sealed_Nbid(bids_6, N=6,eps=3) values_6
As for two person auctions, the first step is to estimate each bidder’s valuation.
The results presented in Figure 35 provide additional evidence that there is collusion in the large English auctions. The bidding behavior in the larger sealed bid auctions is consistent with both bidding behavior in small sealed bid auctions and small English auctions.
Discussion and Further Reading
Economic analysis of auctions began with Vickrey’s 1961 paper. Vickrey used game theory to analyze sealed bid auctions, Dutch auctions and English auctions. Vickrey also derived a new auction, the sealed bid second price auction. Empirical analysis of auctions starts with the game theory.
The chapter considers two of the most important auction mechanisms, sealed bid auctions and English auctions. The sealed bid auctions are analyzed using the two-step procedure of Guerre, Perrigne, and Vuong (2000). The first step uses non-parametric methods to estimate the bid distribution. The second step uses the Nash equilibrium to back out the value distribution.
While we observe all the bids in the sealed-bid auctions, we generally only observe the high bids in English auctions. The chapter uses the order statistic approach of Athey and Haile (2002) to estimate the value distribution from these auctions.
Empirical analysis of auctions has changed dramatically over the last twenty years. This section just grazed the surface of the different issues that arise in auctions. One big issue not discussed here is unobserved auction heterogeneity. There may be systematic differences between the auctions that are observed by the bidders but not observed by the econometrician. The most important paper written on this question was written by a former office mate of mine, Elena Krasnokutskaya, (Krasnokutskaya 2011). A recent review by Phil Haile and Yuichi Kitamura is an excellent starting point (P. A. Haile and Kitamura 2019).
Baldwin, Marshall, and Richard (1997) and Athey, Levin, and Seira (2011) analyze collusion in US timber auctions. Aryal et al. (2018) uses US timber auctions to measure how decision makers account for risk and uncertainty.
Panel Data
Introduction
The first part of the book considers the value of exogenous changes in variables to help predict the effect of a policy change. The second part of the book illustrates how economic theory can be used explicitly to predict the effect. This part argues that repeated measurement provides valuable information for predicting the effect of a policy change.
Repeated measurement of the same individual allows us to infer the individual treatment effect and uncover unobserved characteristics of the individual. This is of particular interest with panel data sets. In microeconomic panel data sets, such as the National Longitudinal Survey of Youth, we observe a large number of individuals across many years. This chapter works through the standard approaches of difference in difference and fixed effects. These methods are useful when a subset of individuals observed in the data, face differing levels of exposure to the policy over time.
First Differences
Consider data generated according to Figure 36. We are interested in the causal effect of
However, if we observe the same individual with two different education levels then we observe the same individual at two different times. Unobserved characteristics of the individual may also change between the two time periods. There is both an effect of education and of Time on
The Figure 36 shows that the relationship between
This section estimates the policy effect by taking first differences in the outcome and the policy variables.
First Difference Model
Consider that we observe an outcome of interest for individual
where the outcome of interest
In Chapter 2 we accounted for this additional variable by including it in the regression. Here we can account for the unobserved time effect by taking first differences.
If we assume that the time effect is 1, then we can add an intercept to measure the time effect.
Simulated Panel Data
Consider the simulated panel data set created below. There are 1,000 simulated individuals observed over two time periods. Note that
set.seed(123456789)
<- 1000 # 1000 individuals
N <- 2 # 2 time periods
T <- -2
a <- 3
b <- -4
d <- c(5,6) # unobserved characteristic in the two periods.
u <- runif(N)
x1 <- x1 + runif(N) # change in X over time.
x2 <- as.matrix(rbind(x1,x2)) # creates a matrix T x N.
x <- matrix(rnorm(N*T),nrow = T)
e # additional unobserved characteristic
<- a + b*x + d*u + e
y # matrix with T rows and N columns.
OLS Estimation of First Differences
The simulated data allows us to observe
<- y[2,] - y[1,]
diffy <- x[2,] - x[1,]
diffx <- lm(diffy ~ diffx) lm1
Table Table 29 presents the results of the first difference approach. It does a pretty good job at getting at the true value of
Difference in Difference
Consider a case where we observe two types of individuals in the data. For example, let it be the case that for all individuals in the first time period we observe them with only a high school education. Then for the second time period, say ten years later, we observe two different groups. The first group has a college education. We call this the treated group, while the second group remains with a high school education. We call these people the non-treated group. For this second group, we can estimate the impact of time on the outcome of interest. For this group we can estimate the impact of the ten years on income. Therefore, if the effect of time is the same for both the treated group and the non-treated group we can disentangle the effect of time and the policy variable. We can disentangle the effect of the ten years and attending college on income. Of course, it is not at all obvious that the effect of time would be the same for both groups.
The section presents the difference in difference estimator and illustrates the estimator with simulated data.
Difference in Difference Estimator
In algebra, we have the following derivation of the difference in difference estimator. Assume that the observed outcome of interest for individual
where
Consider observing outcomes
where
For an individual in the non-treated group, we have the following equation.
For this group, the change in the time period does not change the treatment that they receive. For this group, the only change in outcomes is due to time.
Now we can compare the two differences. Equation 163 considers two conditional expectations. The first conditions on the treatment level equal to 1. This only occurs to individuals that are in the treated group for the time period in which they are treated. This is the group that attends college. Equation 164 gives the difference in the conditional expectation for the non-treated group, conditional on the two time periods. Thus, we can estimate
The analog estimator is as follows.
where
Difference in Difference Estimator in R
The difference in difference estimator is used to separate out the effect of time (
<- function(Y,treat) {
f_did <- Y[1,]
y1 <- Y[2,]
y2 <- matrix(NA,3,3)
did # creates the difference in difference matrix
1,1] <- mean(y2[treat==1])
did[# calculates the average outcomes for each
2,1] <- mean(y1[treat==1])
did[1,2] <- mean(y2[treat==0])
did[2,2] <- mean(y1[treat==0])
did[3,1] <- did[1,1] - did[2,1]
did[# calculates the differences.
3,2] <- did[1,2] - did[2,2]
did[1,3] <- did[1,1] - did[1,2]
did[2,3] <- did[2,1] - did[2,2]
did[3,3] <- did[3,1] - did[3,2]
did[row.names(did) <- c("Period 2", "Period 1", "Diff")
colnames(did) <- c("Treated", "Not Treated", "Diff")
return(did)
}
In the simulated data there are two groups, a treated group and a non-treated group. For the treated group, the outcome,
set.seed(123456789)
<- runif(N) < 0.5
treat <- x1 + treat
x2 # this time the change is 1 for the treated group.
<- rbind(x1,x2)
x <- a + b*x + d*u + e
y <- f_did(y,treat) did1
The difference in difference estimator gives an estimate of 2.99, where the true value is 3. Table 30 presents the estimator, where the four cells in the top left are the mean of
Effect of a Minimum Wage Increase in New Jersey
What happens if the minimum wage is increased? Economic theory gives a standard prediction. If the minimum wage is increased above the equilibrium level wage, the demand for workers will fall and the supply of workers will increase. In fact, a minimum wage increase may actually harm more low-wage workers than it helps. While per-hour pay may increase, the number of hours may decrease. At least in theory.
What actually happens if the minimum wage is increased? The section uses the difference in difference estimator with data from Card and Krueger (1994) to determine what actually happened in New Jersey.
Data from Card and Krueger (1994)
Card and Krueger (1994) surveyed restaurants before and after the state of New Jersey increased the minimum wage. In 1991, the US federal minimum wage was $4.25/hour. In April of 1992, the state of New Jersey increased its minimum wage above the federal level to $5.05/hour. In order to see how much a minimum wage increase led to a decrease in labor demand, Card and Krueger (1994) surveyed restaurants in New Jersey before and after the change. The following code imports the Card and Krueger (1994) data. Note that there is a labeling issue with two restaurants receiving the same label.
<- read.csv("cardkrueger.csv", as.is = TRUE)
x <- cbind(x$SHEET,x$EMPFT,x$EMPPT,x$STATE,1)
x1 # SHEET is the firm ID.
# this includes the initial employment levels.
# column of 1s added to represent the initial time period.
colnames(x1) <- c("SHEET","FT","PT","STATE","TIME")
# FT - fulltime, PT - parttime.
<- as.data.frame(x1)
x1 $SHEET <- as.numeric(as.character(x1$SHEET))
x1$SHEET==407,]$SHEET <- c(4071,4072)
x1[x1# there is an issue with the labeling in the data.
# two firms with the same label.
<- cbind(x$SHEET,x$EMPFT2,x$EMPPT2,x$STATE,2)
x2 # the second period of data.
colnames(x2) <- c("SHEET","FT","PT","STATE","TIME")
<- as.data.frame(x2)
x2 $SHEET <- as.numeric(as.character(x2$SHEET))
x2# a number of variables are changed into "factors"
# as.numeric(as.character()) changes them back into numbers.
$SHEET==407,]$SHEET <- c(4071,4072)
x2[x2<- rbind(x1,x2) # putting both periods together.
x3 colnames(x3) <- c("SHEET","FT","PT","STATE","TIME")
$FT <- as.numeric(as.character(x3$FT)) x3
Warning: NAs introduced by coercion
$PT <- as.numeric(as.character(x3$PT)) x3
Warning: NAs introduced by coercion
Warning in hist(as.numeric(x[x$STATE == 1, ]$WAGE_ST), xlim = c(3, 6), xlab =
"Wage", : NAs introduced by coercion
The Figure 37 shows the minimum wage change will have some bite. A fairly large number of firms pay exactly the minimum of $4.25 per hour. Most of the firms pay less than the proposed minimum of $5.05.
Difference in Difference Estimates
The concern with just comparing employment in New Jersey before and after the minimum wage change is that other factors may have also changed between the two time periods. Card and Krueger (1994) use difference in difference to account for the time effects. The authors propose using restaurants in the neighboring state of Pennsylvania as the non-treated group. They argue that restaurants in Pennsylvania are not impacted by the New Jersey law change. In addition, these restaurants are similar enough to the New Jersey restaurants that other changes in the economy will be the same between the two states. The restaurants in New Jersey and Pennsylvania will have the same “time effect” on average.
To see what happens we can follow the procedure presented above.
<- rbind(x3$FT[x3$TIME==1] + x3$PT[x3$TIME==1],
Y $FT[x3$TIME==2] + x3$PT[x3$TIME==2])
x3<- is.na(colSums(Y))==0
index_na <- Y[,index_na]
Y1 <- x3$STATE[x3$TIME==1]==1
treat <- treat[index_na]
treat1 <- f_did(Y1,treat1) did2
The results suggest that the minimum wage increase has no impact on employment in New Jersey. Table 31 presents the difference in difference estimates on the total count of employees, both full time and part time.116 In the table we actually see a very small increase in employment before and after the law change. Meanwhile, in Pennsylvania there is a slight decrease in employment over the same time period. The net results in a slight increase in employment associated with the change in the minimum wage law.
The result seems counter-intuitive, at least counter to standard economic theory. The result is heavily reliant on the assumption that the change in employment in Pennsylvania is a good proxy for the time effect in New Jersey. Is that assumption reasonable?
Fixed Effects
In data sets with more time periods we can use the fixed effects model. As above, it is assumed the time effect is the same for each individual, irrespective of the treatment. Some subset of the sample is treated and they receive the treatment in some subset of the time periods. Usually, there is a pre-treatment and a post-treatment period.
The section presents the general model with individual and time fixed effects.
Fixed Effects Estimator
As with the difference in difference, in the fixed effect model the time effect is assumed to be the same for everyone. This model allows individuals to have different outcomes that are persistent through time. In the restaurant example, we expect some restaurants to have higher employment in both periods, relative to other restaurants. However, with only two time-periods we cannot account for these differences. With more pre-treatment time periods we can attempt to measure differences between restaurants. Accounting for such differences may be particularly important if Pennsylvania and New Jersey tend to have different types of restaurants.117
The general model has the observed outcome as a function of two fixed effects. The first is an individual effect that is allowed to vary across individuals but is persistent through time. The second is a time effect that varies across time but is persistent through the cross section of individuals. There is also a treatment effect that is assumed to affect a subset of individuals in a subset of time periods and an unobserved characteristic that varies across individuals and time.
where
where the treatment occurs in period
Consider our restaurant example but this time we observe employment levels for the restaurants for a number of years prior to the minimum wage increase in New Jersey. We can allow different restaurants to be big or small leading to a different number of employees on average. We can also allow all restaurants to be hit by changes to general economic conditions. Lastly, we assume that only restaurants in New Jersey are impacted by the increase in the minimum wage in New Jersey.
Nuisance Parameter
A nice feature of the fixed effects model is that it accounts for differences between the treated and non-treated groups that are not due to the treatment itself. The
A simple solution is to not include the individual dummy in the estimator. The problem with this approach is that it makes the estimator much noisier and less likely to be correct. This is because all the individual dummy variables have been added to the error term. It may also lead to a biased estimate if there is some systematic difference between the treated group and the non-treated group.
Adjusted Fixed Effects Estimator
One nice solution is to do adjusted estimation. In the first step, the pre-treatment data is used to estimate the
If we substitute that estimate back into Equation 166 we get the following equation. Note that the nuisance parameter has been removed (it has been replaced with an estimate).
where
A potential issue is that we have added extra noise to the estimator (
Two Step Fixed Effects Estimator
Another approach is a two step estimator. In the first step, the individual fixed effects are estimated for each individual separately in the pre-treatment period. The individual fixed effects estimator is simply the mean of the residual of the regression of the outcome variable on the time-dummies.
The second step regresses the outcome less the fixed effect on the policy variables of interest.
The Equation 171 forms the basis for the estimator with the outcome and the “error” term adjusted by netting out the estimated fixed effect.
This estimator highlights the estimation problem. If the pre-treatment period is not very long then each of the
Also note the similarity and difference between the two step estimator and the adjusted estimator.121
Fixed Effects Estimator in R
The following fixed effects estimator takes data in the form of panel data matrices and converts to use lm()
. The function takes advantage of R
’s ability to quickly create a large number of dummy variables using as.factor()
. Note the function has the option to create fixed effects only in the time dimension. It also does not need to include a treatment variable.
<- function(Y, X=NULL, cross=TRUE) {
f_fe <- as.matrix(Y)
Y <- dim(Y)[1]
T <- dim(Y)[2]
N <- matrix(rep(c(1:T),N), nrow=T)
XT # creates a T x N matrix with numbers 1 to T in each row
<- as.vector(Y)
y <- as.vector(XT)
t # set up for different cases
if (cross) { # create cross-section dummies
<- t(matrix(rep(c(1:N),T), nrow=N))
XC # creates a T x N matrix with 1 to N in each column
<- as.vector(XC)
c
}if (is.null(X)==0) { # create treatment variable
<- as.matrix(X)
X <- as.vector(X)
treat
}# estimator
if (cross & is.null(X)==0) { # standard case
<- lm(y ~ treat + as.factor(t) + as.factor(c))
lm1
} else {
if (is.null(X)==0) { # no cross-section
<- lm(y ~ treat + as.factor(t))
lm1
} else { # no treatment
<- lm(y ~ as.factor(t))
lm1
}
}return(lm1)
}
The simulated panel data set has 100 individuals observed over 10 time periods. About half of the individuals are treated and the treatment occurs in the last time period.
set.seed(123456789)
<- 100
N <- 10
T <- runif(N)
alpha <- runif(T)
gamma <- 3
beta <- matrix(rnorm(N*T),nrow=T)
epsilon <- runif(N) < 0.5
treat <- t(matrix(rep(alpha,T),nrow = N)) + gamma + epsilon
y 1,] <- y[1,] + beta*treat
y[<- matrix(0,T,N)
treat1 1,] <- treat treat1[
We can compare the different estimators on the simulated data.
# standard estimator
<- f_fe(y, treat1)
lm1 # No individual fixed effects estimator
<- f_fe(y, treat1, cross = FALSE)
lm2 # Adjusted estimator
<- y[2:T,] # pre-treatment outcomes.
y0 <- colMeans(y0) # calculate alpha.
alphahat <- y - t(matrix(rep(alphahat,T),nrow=N))
y2 # adjust outcome.
<- f_fe(y2, treat1, cross = FALSE)
lm3 # Two step estimator
<- f_fe(y0, cross = FALSE) # adjust for time effects.
lm4 <- matrix(lm4$residuals, nrow=T-1)
y0_res <- colMeans(y0_res) # calculate alpha.
alpha_hat <- y - t(matrix(rep(alpha_hat,T),nrow=N)) # adjust outcome
y3 <- f_fe(y3, treat1, cross=FALSE) lm5
(1) | (2) | (3) | (4) | |
---|---|---|---|---|
(Intercept) | 1.005 | 1.056 | 0.078 | 1.089 |
(0.352) | (0.157) | (0.144) | (0.144) | |
treat | 2.998 | 3.060 | 2.998 | 2.998 |
(0.215) | (0.214) | (0.195) | (0.195) | |
Num.Obs. | 1000 | 1000 | 1000 | 1000 |
R2 | 0.458 | 0.340 | 0.378 | 0.378 |
R2 Adj. | 0.391 | 0.333 | 0.372 | 0.372 |
AIC | 2980.6 | 2978.9 | 2796.9 | 2796.9 |
BIC | 3525.4 | 3037.8 | 2855.8 | 2855.8 |
Log.Lik. | -1379.315 | -1477.443 | -1386.465 | -1386.465 |
RMSE | 0.96 | 1.06 | 0.97 | 0.97 |
The Table 32 provides a nice comparison between the four approaches. The true parameter is 3 and the standard model gives a good estimate. The model in which the individual dummies are dropped gives a relatively poor estimate. The other models give a good estimate.
Effect of a Federal Minimum Wage Increase
In 2007, the Obama administration increased the federal minimum wage to $7.25. The results from the difference in difference analysis of Card and Krueger (1994) suggest that this change will have little or no effect on employment. However, the approach makes quite strong assumptions and looks at the impact on restaurants rather than on individuals.
The NLSY97 data seems well suited to analyzing the impact of Obama’s federal minimum wage increase. The individuals in the data are in their late 20s and early 30s when the changes occur. At least some proportion of these individuals are likely to work in jobs that pay minimum wage or would be impacted by the changes. The analysis here follows Currie and Fallick (1996), who uses NLSY79 to analyze the impact of minimum wage changes in 1979 and 1980.
NLSY97
The following code imports a data set that I have created from NLSY97.122 It then creates two matrix panels, one for income and one for hours worked. For each, the rows are time periods and the columns are individuals.
<- read.csv("NLSY97Panel.csv",as.is=TRUE)
x <- read.csv("NLSY97Panel_names.csv",
x_names header=FALSE,as.is = TRUE)
colnames(x) <- as.vector(x_names[,1])
# create two matrices, with 18 (years)
# rows 8984 (individuals) columns
# one for income and one for hours worked.
<- c(1997:2014)
year <- c("97","98","99","00","01","02","03",
year1 "04","05","06","07","08",
"09","10","11","12","13","14")
# below we need both versions of year.
<- Y <- matrix(NA,18,8984)
W for (i in 1:18) {
<- paste("CVC_HOURS_WK_YR_ALL_",
hrs_name "_XRND",sep="")
year1[i],# paste() is used to concatenate strings.
<- paste("YINC_1700_",year[i],sep="")
inc_name if (hrs_name %in% colnames(x)) {
<- ifelse(x[,colnames(x)==hrs_name] >= 0,
Y[i,] colnames(x)==hrs_name],NA)
x[,
}# %in% asks whether something is an element of a set.
if (inc_name %in% colnames(x)) {
<- ifelse(x[,colnames(x)==inc_name] >= 0,
W[i,] colnames(x)==inc_name],NA)
x[,
}
}<- W[11,]/Y[11,]
rate_07 # calculates the wage rate for each person.
$treat <- ifelse(rate_07<7.26 | W[11,]==0 | Y[11,]==0,1,0)
x# treated if earn less than 7.26/hour or no earnings in 2007.
The data includes information on almost 9,000 individuals who are tracked across 18 years. For each individual we know their income and hours worked for the year. From this we can calculate their average hourly wage rate. We can also determine whether there were some individuals in the data earning less than $7.26 an hour in 2007. We will call this group the treated group. Individuals earning more than this are assumed to be unaffected by the policy change. These are the non-treated group. How reasonable is this assignment? Can you re-do this analysis with a different assignment to treated and non-treated?
Fixed Effects Estimators of the Minimum Wage
We can use the standard fixed effects estimator to determine the effect of the minimum wage change on hours worked. The code creates a treatment variable. Note that the pre-treatment period is the first ten years, while the post-treatment period is the last 8 years.
<- 600 # reduce the size for computational reasons.
N <- 18
T <- Y[,1:N]
y1 <- matrix(0,T,N)
treat1 for (i in 11:T) {
<- x$treat[1:N]
treat1[i,]
}<- f_fe(y1, treat1) lm1
As discussed above, the individual dummy variables are a nuisance to estimate. You can see this by changing the number of individuals used in the estimation. As the number increases the computation takes longer and longer.
It is lot less computationally burdensome to estimate the “adjusted” fixed effects estimator. The outcome is “adjusted” by differencing out the average hours worked for each individual in the pre-treatment period.
<- 8984 # full data set
N <- Y[1:10,1:N] # pre-treatment
y0 <- colMeans(y0, na.rm = TRUE)
alpha_hat <- Y - t(matrix(rep(alpha_hat,T), nrow = N))
y1 <- matrix(0,T,N)
treat1 for (i in 11:T) {
<- x$treat[1:N]
treat1[i,]
}<- f_fe(y1, treat1, cross=FALSE) lm2
We can also estimate the two step fixed effects estimator in order to compare the results.
<- f_fe(y0, cross=FALSE)
lm3 <- y0 - matrix(rep(lm3$coefficients,N),nrow=10)
y0_res <- colMeans(y0_res, na.rm = TRUE) # calculate alpha.
alpha_hat <- Y - t(matrix(rep(alpha_hat,T),nrow=N)) # adjust outcome
y3 <- f_fe(y3, treat1, cross=FALSE) lm4
We can compare results across the three estimators. The estimates are similar. They show a fairly substantial reduction in hours associated with the increase in the minimum wage.
(1) | (2) | (3) | |
---|---|---|---|
(Intercept) | 821.538 | -722.012 | 25.935 |
(185.236) | (8.313) | (8.287) | |
treat | -296.386 | -271.147 | -274.277 |
(40.853) | (7.821) | (7.797) | |
Num.Obs. | 8297 | 127107 | 127107 |
R2 | 0.542 | 0.275 | 0.283 |
R2 Adj. | 0.505 | 0.274 | 0.283 |
AIC | 134051.8 | 2054608.9 | 2053828.7 |
BIC | 138392.5 | 2054804.0 | 2054023.8 |
Log.Lik. | -66407.919 | -1027284.467 | -1026894.360 |
RMSE | 724.08 | 782.98 | 780.58 |
The Table 33 presents the estimates with the standard estimator and a subset of the individuals, as well as the adjusted estimators with all individuals. It shows a 10% reduction in hours for a full-time person working 2,000 hours, but much more for an average individual in the sample. That said, it is not clear that these workers are worse off because the wage increase was substantial.
Are Workers Better Off?
<- mean(x[x$treat==1,]$YINC_1700_2007, na.rm = TRUE)
a <- mean(x[x$treat==1,]$CVC_HOURS_WK_YR_ALL_07_XRND,
b na.rm = TRUE)
<- a/b # 2007 wage rate
c <- mean(x[x$treat==1,]$YINC_1700_2010, na.rm = TRUE)
d # 2010 income
<- mean(x[x$treat==1,]$CVC_HOURS_WK_YR_ALL_10_XRND,
e na.rm = TRUE)
# 2010 hours
- c*(e + 270) d
[1] 4206.07
# actual less counter-factual
The minimum wage increase leads to lower hours but higher incomes. To see what happens to the average treated person, compare their actual income in 2010 to their estimated counterfactual income in 2010. Their actual average income in 2010 is $12,548. Their actual hours worked is 1034. According to our estimates if the minimum wage had not increased, they would have worked 1304 hours. However, they would have earned less. On average their wage prior to the minimum wage increase was $6.40 per hour. Assuming that is the counter-factual wage, then the minimum wage change increased their income by $4,200 per year. According to this analysis, the policy increased income for the treated group despite lowering their hours worked. What criticisms would you have of this analysis?
Discussion and Further Reading
The last part of the book presents the third major approach to estimating the policy variables of interest, repeated measurement. This chapter considers panel data. In this data, we observe outcomes for a large number of the same individuals over a number of time periods.
This data allows us to estimate treatment effects by comparing outcomes before and after the treatment. The problem with this comparison is that time may have caused other changes to occur. The observed difference is affected by both the treatment and time. The panel structure suggests a solution. If there are some individuals who were unaffected by the treatment then only time affected the difference in outcomes. We can use these non-treated individuals to measure the impact of time and difference it out. The classic paper is Card and Krueger (1994).
If we have enough time periods we can account for heterogeneity in the observed outcomes. We can use fixed effects to account for unobserved differences in outcomes between individuals. The next chapter considers panel data with some of the assumptions relaxed. In particular, it considers the synthetic control approach of Abadie, Diamond, and Hainmueller (2010). Chapter 11 accounts for heterogeneity with a mixture model. The chapter revisits Card and Krueger (1994).
Synthetic Controls
Introduction
By observing the same individual with different levels of exposure to a policy we can learn about how the policy affects that individual. However, we have a problem that “time” is also affecting the individual. Chapter 10 showed how panel data could be used to “control for time” using fixed effects.
In the standard fixed effects model, it is assumed that time has the same effect on all individuals. In particular, time has the same effect for both treated and non-treated groups (or control group). This assumption is sometimes called parallel trends. Abadie, Diamond, and Hainmueller (2010) argue that this assumption can be weakened. The authors argue for choosing the control group more carefully. The authors suggest using the pre-treatment period to estimate the relationship between members of the treated group and members of the control group. Instead of placing equal weights on every member of the control group, members of the treated group should place more weight on members of the control group that are similar.
This chapter introduces the synthetic control estimator. It presents the general model, discusses the identification problem and presents the solution provided by Abadie, Diamond, and Hainmueller (2010). The chapter presents two alternative approaches, LASSO and factor models. The estimators are used to determine the effect of the federal minimum wage increase discussed above.
Beyond “Parallel Trends”
The fixed effects model presented in Chapter 10 makes the very strong parallel trends assumption. The impact of time is identical for everyone. This does not seem reasonable. We would like to estimate a model that allows the impact of time to vary, in particular to vary across treated and non-treated groups.
A General Fixed Effects Model
One idea is to allow the fixed effects model to be more general. Below we make two changes to the model.
where, as before,
These two relatively simple changes to the model have a fundamental impact on our ability to estimate the treatment effect. Under this model, the individual treatment effect estimation is given by the following equation.
Note that we simplified things and assumed we can accurately estimate the individual and time fixed effects. The value of interest,
A Slightly Less General Model
Unfortunately, the generalization of the fixed effects model made above means that we can no longer estimate the model. At least we cannot estimate the policy variable of interest. Consider the following slightly less general model.
where
Synthetic Synthetic Control Data
The synthetic data is generated according to a factor model.123 All simulated individuals have their outcome determined by three factor values. However, individuals weight these factor values differently. There are 35 time periods and 200 individuals.
set.seed(123456789)
<- 200 # individuals
N <- 35 # time periods
T <- 3 # factors
K <- 0.1 # noise
sig <- matrix(3*runif(T*K), nrow=T) # factors
F <- matrix(exp(rnorm(N*K)), nrow=K)
Lambda # weights are positive
<- t(t(Lambda)/rowSums(t(Lambda)))
Lambda # sum weights to 1
<- matrix(rnorm(T*N,mean=0,sd=sig),nrow=T) # unobserved
E <- F%*%Lambda # average outcome
Ybar <- Ybar + E # actual outcome Ys
Note there is no policy effect in the simulated data. Each individual’s outcome is determined by their individual weights on three factors, plus an unobserved term.
The Figure 38 presents a plot of outcomes for the 200 individuals in periods 1 and 2. Each individual in the data places a weight on the three factor values, which are represented by the triangles. The implication is that while each individual is unique there are other individuals in the data that have similar outcomes because they weight the factor values in the same way.
Constructing Synthetic Controls with OLS
The goal is to use the non-treated individuals to construct the “time effect” for the treated individuals. As in Chapter 10, once we have accounted for the time effect, we can back out the treatment effect. But how do we construct this synthetic control?
Perhaps the most straightforward way is to use OLS. We discussed in Chapter 1 that OLS captures the correlation between variables. Let’s say we want to construct a “synthetic” version of individual 1 using data from individuals 2 to 5. We can think of the outcomes for individual 1 as the variable we want to explain, while the outcomes for individuals 2 to 5 are the explanatory variables.
The Equation 175 assumes that the observed outcome for synthetic individual 1 in period
We can then write this out in matrix form.
where
Following the same matrix algebra we used in Chapters 1 and 2, OLS determines the weights we should use for our synthetic control.
With these estimated weights the synthetic individual 1’s outcome in some future period is a weighted average of the outcomes for the other synthetic individuals.
OLS Weights in R
We can use the first 30 time periods in the synthetic data constructed above to estimate the synthetic control for individual 1. We can use the last 5 time periods to compare our synthetic prediction to the actual outcomes.
<- lm(Ys[1:30,1] ~ Ys[1:30,2:5])
lm1 <- lm1$coefficients
omega_hat <- cbind(1,Ys[31:35,2:5])%*%omega_hat y1_hat_ols
The Figure 43 plots the actual outcomes and predicted outcomes for synthetic individual 1 in periods 31 to 35. The 45 degree line represents perfect prediction. The chart shows that our OLS based synthetic control tracks the actual outcomes pretty well.
The issue with this method is that our choice of using the next four individuals is completely arbitrary. We can construct more accurate controls using more individuals.
A “Wide” Data Problem
To understand the issue consider what happens when we use the next 28 simulated individuals. The loop finds the root mean squared error as each individual up to 28 is added to the regression.
<- matrix(NA,28,2)
tab_res for (i in 2:29) {
<- lm(Ys[1:30,1] ~ Ys[1:30,2:i])
lm1 <- lm1$coefficients
omega_hat <- Ys[31:35,1] - cbind(1,Ys[31:35,2:i])%*%omega_hat
res -1,1] <- i
tab_res[i-1,2] <- mean(res^2)^(0.5)
tab_res[i
}# creates a table of the rmse for different numbers of
# explanatory variables.
The Figure 39 presents a measure of predictive ability of the synthetic control. The chart uses root mean squared error of the difference between the actual outcome and the predicted outcome. The chart suggests that the prediction is optimized when 5 to 10 other individuals are used. The prediction gets slowly worse as more individuals are used. Why would that be? Then around 25, the prediction starts to go off the rails. Why is that?
We are running into a problem brought up in Chapter 2. We have collinearity and multicollinearity problems. Trying running the regression hashed out below. It should produce some sort of error. Probably something related to not being able to find the inverse of the matrix. There is a collinearity problem when the number of “explanatory variables” (simulated individual outcomes) equals the number of “observations” (time periods). As we add in more individuals the correlation between individuals in the data increases until the point where the regression starts producing garbage and then it finally stops working all together.
<- Ys[1:30,1]
y1 <- Ys[1:30,-1]
Y_n1 #omega_hat <- solve(t(Y_n1)%*%Y_n1)%*%t(Y_n1)%*%y1
The problem of how to choose between potentially collinear explanatory variables is called a “wide” data problem. Here “wide” refers to the fact that our data matrix is very wide relative to its length (the number of observations or time periods). In our simulated data we have 199 explanatory variables and 30 observations. The rest of the chapter discusses three solutions to the wide data problem; the Abadie, Diamond, and Hainmueller (2010) estimator, the LASSO and convex factor models.
Abadie Estimator
Abadie, Diamond, and Hainmueller (2010) provide one popular solution to the wide data problem presented above. Their idea is to restrict the weights. In particular, the authors restrict the weights to be positive and sum to one. The treated unit is assumed to be a convex combination of the non-treated units.
Restricting Weights
Abadie, Diamond, and Hainmueller (2010) present an OLS estimator like the one presented in the previous section. However, the weights are restricted such that each treated individual is a convex combination of the non-treated individuals. We assume the outcomes of individual 1 are a convex combination of the outcomes for the other 199 simulated individuals.
The Figure 40 shows that individual 1 sits in the cloud of simulated outcomes for period 1 and period 2. This means that it is possible to represent individual 1’s outcomes as a convex combination of the other simulated individuals.
The authors assume that
The Abadie, Diamond, and Hainmueller (2010) estimator solves the following constrained least squares problem.
The authors suggest that the restriction can be justified by assuming a factor model is generating the observed data. Note however, that there is no relationship between the assumptions presented here and the assumptions on the factor model presented below.
Synthetic Control Estimator in R
The Equation 179 provides the pseudo-code for the estimator in R
.
<- function(par,yt,ynt) {
f_sc <- as.matrix(yt)
yt <- as.matrix(ynt)
ynt <- exp(par)/(sum(exp(par)))
omega # the restriction on the weights.
<- mean((yt - ynt%*%omega)^2)
sos # sum of squares
return(sos)
}<- dim(Y_n1)[2]
J <- optim(par=rnorm(J), fn=f_sc, yt=y1, ynt=Y_n1,
a control = list(trace=0,maxit=1000000))
<- exp(a$par)/sum(exp(a$par))
omega_hat <- Ys[31:35,-1]%*%omega_hat y1_hat_sc
The Figure 43 shows that the Abadie, Diamond, and Hainmueller (2010) estimator is able to predict pretty well. Like the OLS estimator presented above, the prediction/actual pairs lie along the 45 degree line. The big difference is that the Abadie, Diamond, and Hainmueller (2010) estimator does not need to be restricted, at least not in this case.124
Regularization
The wide data problem is difficult for standard estimators, but it is the type of problem machine learning estimators excel at. This section presents a popular machine learning estimator called the LASSO. This is a regularized estimator. This means that some “cost” is added to the least squares problem. There is an explicit cost or penalty to adding an additional explanatory variable.
Turducken Estimation
A turducken is a dish in which a chicken is stuffed inside of a duck, which is then stuffed inside a turkey. Many machine learning algorithms have this property of estimation within an estimation. The “chicken” layer is the standard estimation problem in which the tuning parameters are taken as given and not estimated. The chicken layer uses some standard criterion such as least squares or maximum likelihood. The “duck” layer takes the results from the chicken layer and determines the appropriate tuning parameter values. The duck layer uses a measure like “root mean squared error” of the model on the test data to evaluate the tuning parameters. There may even be a turkey layer in which different machine learning models are evaluated. This layer may add a criterion such as speed of estimation or accuracy on new test data withheld from the original problem. Here we limit things to “ducken” estimation.
LASSO
The Stanford statistician, Robert Tibshirani, introduced the idea of a regularized OLS estimator he called LASSO for least absolute shrinkage and selection operator (Tibshirani 1996). In the LASSO, the parameters are regularized through a penalty term which is the sum of their absolute values.125 The problem involves solving a constrained optimization problem.
The chicken layer is a standard least squares estimator. Note that the chicken layer is estimated on a subset of time periods (
where
The duck layer uses the held out sample and chooses
where
One argument for using root mean squared error as the appropriate criteria is that it provides a trade-off between two things we may care about in our prediction. It weights the bias, that is the extent our estimate is wrong in expectation, and the variance, that is the extent that our estimate is wrong in fact.
LASSO in R
Using Equation 180 as pseudo-code we have a simple regularized OLS estimator. The estimator defaults to
<-function(omega,y_t,Y_nt,th1 = 0) {
f_reg <- as.matrix(y_t)
y_t <- as.matrix(Y_nt)
Y_nt <- (y_t - cbind(1,Y_nt)%*%omega)^2 + th1*sum(abs(omega))
sos return(mean(sos, na.rm = TRUE))
}
The code splits the data into a training data set, which is the first 25 periods, and a test data set which is the next 5 periods. We will compare predictions using the last 5 periods.
<- y1[1:25] # training data
y1_r <- y1[26:30] # test data
y1_t <- Y_n1[1:25,]
Y_n1_r <- Y_n1[26:30,] Y_n1_t
The following loop goes through the set of tuning parameters and finds the one that minimizes the root mean squared error.
<- function(y_tr,y_tt,Y_ntr,Y_ntt,K=3) {
f_lasso <- as.matrix(y_tr)
y_tr <- as.matrix(y_tt)
y_tt <- as.matrix(Y_ntr)
Y_ntr <- as.matrix(Y_ntt)
Y_ntt set.seed(123456789)
<- c(1:K)-1
theta_values <- dim(Y_ntr)[2]
J <- NULL
omega_th <- 10000 # some large number.
rmse_old for (k in 1:K) {
<- optim(par=rep(0,J+1), fn=f_reg, y_t=y_tr,
ai Y_nt=Y_ntr, th1 = theta_values[k],
control = list(maxit=10000))
<- ai$par
omega <- mean((y_tt - cbind(1,Y_ntt)%*%omega)^2)^(0.5)
rmse_new if (rmse_new < rmse_old) {
<- omega
omega_th <- rmse_new
rmse_old #print(rmse_new)
}
}return(omega_th)
}
<- f_lasso(y1_r,y1_t,Y_n1_r,Y_n1_t)
omega_hat <- cbind(1,Ys[31:35,-1])%*%omega_hat y1_hat_lasso
Testing the LASSO on the synthetic data shows that the estimator does okay. Figure 43 shows that in the example it may have the best predictions of each of the models tested.
Factor Models
We started off the chapter by saying that we could make the fixed effects model more general by adding a factor model. We call this an interactive fixed effects model (Bai 2009). However, we then presented three other methods for estimating the synthetic control weights. We seemed to forget that we had a factor model.
Why didn’t we simply estimate the factor model? The answer is easy. It is hard.
As a reminder, the problem with estimating the synthetic control is that we don’t have enough observations relative to the number of potential variables. This is called a “wide data” problem. Factor models have been suggested as a solution to this problem for close to 100 years.
Factors models date back to at least Hotelling (1933), although he preferred the term “component models.” Hotelling didn’t like the fact that he was using matrix factorization to factor a factor model, the result of which is that today, economists use “principal components to factor factor models.” I guess we didn’t read the memo the whole way through.
Hotelling’s contribution was to formalize a method in use for solving a wide-data problem in psychometrics. Researchers had individual subjects take a battery of tests, so they had data with a small number of individuals and a large number of test results.
Variations on the model presented below are popular in machine learning, under the names “topic models” or “non-negative matrix factorization.” In machine learning these models are categorized as “unsupervised learning” models, where LASSO is “supervised learning models” (Blei 2012, Huang:2014).
So why is it hard? Hotelling (1933) pointed out that the matrix factorization problem at the heart of the approach is not identified. There are many non-trivial ways to factor one matrix into two matrices. You may remember that you can factor the number 12 into a number of different factor pairs: 1 and 12, 2 and 6, and 3 and 4. It is similar for matrix factorization. There are many pairs of matrices that can be multiplied together to get the same observed matrix. In trying to solve one identification problem we have run headlong into a new (actually very old) identification problem.
Fu et al. (2016) suggests a solution to the new identification problem. The paper suggests restricting the factor weights to be positive and sum to 1. This is called a convex matrix factorization problem. Fu et al. (2016) and Adams (2018) suggest that this restriction significantly reduces the identification problem. Under certain restrictions, it may even provide a unique solution to the matrix factorization problem.
Matrix Factorization
Consider we observe some matrix
The simplest case is where there is no error term. This is called an **exact factor model}.
where
One way to show that it is *not} possible to find a unique solution to Equation 182 is add a square
where the equation holds for any full-rank matrix
Convex Matrix Factorization
Are factor models useless?
Possibly. One solution is to assume that the weighting matrix is convex. That is, assume that all the values in the matrix are positive and rows sum to 1. Each individual in the data places weights on the factor values that are like probabilities. Note there is no compelling reason to make this restriction other than it is intuitive.
Convex
Note that Assumption 8 is different from the convexity assumption proposed by Abadie, Diamond, and Hainmueller (2010). Here it is that each individual’s weights on the factors sum to 1. In Abadie, Diamond, and Hainmueller (2010) it is that the treated individual places convex weights on the other individuals in the data. One does not imply the other.
The result of this restriction is that we can now uniquely find the matrix factors. Moreover, we may be able to determine exactly how many factors there are. The proof is a picture.
The Figure 41 presents the outcomes (Ys) for the simulated data presented at the start of the chapter in two of the time periods. The black dots represent the results from an exact factor model. In the simulation this is
Unfortunately, in our problem things are not as simple. Our data is represented by following approximate matrix factorization problem.
In the figure, the observed outcomes are represented by the gray circles. These make the triangle more rounded and make it difficult to factorize the matrix. Adams (2018) considers a restriction that there exists some independence of unobservables across time. Given this restriction, the paper shows that a mixture model can be used to identify the model.127
Synthetic Controls using Factors
If we can do the matrix factorization, we can create our synthetic controls. To do this, first note that we can write out the individual as a weighted average of the factor values.
In addition, we can write the potential controls as a different weighted average.
where
Note that both equations have the same matrix of factor values. This means that we can substitute the second into the first. However, first we need to solve for
Substituting into Equation 185.
Thus, we can write the outcomes of individual 1 as a weighted average of the outcomes of the other individuals. This is our synthetic control.
Estimating the Weights
The Equation 188 shows that in order to create our synthetic control we need to estimate the factor weight matrix (
To make things doable we will assume that the unobserved characteristic in Equation 174 is normally distributed,
Step 2. Find
Similar to the Abadie, Diamond, and Hainmueller (2010) estimator, this is calculated for each individual
Step 1. Solve for
where
Convex Factor Model Estimator in R
Step 1. Estimate the factor value matrix,
<- function(par,Y,K,Reps=10) {
f_cmf <- 1e-20
eps set.seed(123456789)
<- as.matrix(Y)
Y <- dim(Y)[1]
T <- dim(Y)[2]
N <- exp(par[1])
sig <- matrix(exp(par[2:(K*T+1)]),nrow=T)
F <- matrix(0,T,N)
p for (r in 1:Reps) {
<- matrix(runif(N*K),nrow=K)
L <- as.matrix(t(t(L)/rowSums(t(L))))
L <- p + dnorm((Y - F%*%L)/sig)
p
}<- p/Reps
p # small number added becuase of logging in the next step.
<- -(mean(log(p+eps)) - log(sig))
log_lik return(log_lik)
}
We can test the estimator on the synthetic data described in the introduction. Note we will cheat and set the initial parameter values for the standard deviation equal to its true value. Again it is assumed we know the true number of factors. Note that this is estimated on just the first five time periods.
<- c(log(.1),rnorm(15,mean=0,sd=1))
part <- optim(par=part,fn=f_cmf,Y=Ys[1:5,],K=3,
a1 control = list(trace = 0,maxit=1000000))
<- exp(a1$par[1])
sighat sighat
[1] 0.2138843
<- matrix(exp(a1$par[2:length(a1$par)]), nrow=5) Fhat
The estimated value for the standard deviation of the error is not too far from its true value of 0.1. The Table 34 presents the estimated factor values for the synthetic data. The true factor values lie between 0 and 3, so these estimates are a little off.
Step 2. Estimate the weighting matrix,
<- function(par, Y, F, sig) {
f_lambda <- 1e-20
eps <- as.matrix(Y)
Y <- as.matrix(F)
F <- as.matrix(exp(par)/sum(exp(par)))
L <- -sum(log(dnorm((Y - F%*%L)/sig) + eps))
log_lik return(log_lik)
}
<- matrix(NA,3,200)
Lhat for (i in 1:200) {
<- rnorm(3,mean=0,sd=20)
parti <- optim(par=parti, fn=f_lambda, Y=Ys[1:5,i],
ai F=Fhat, sig=sighat)
<- exp(ai$par)/sum(exp(ai$par))
Lhat[,i] #print(i)
}
The Figure 42 presents a plot of the factor weights. The true values are more spread out than the estimates here.
Step 3. Once the factor model is estimated, the component parts can be used to determine the counter-factual outcome and the treatment effect.
<- Lhat[,2:200]
Ln1 <- Lhat[,1]
L1 <- Ys[31:35,2:200]
Yn1 <- Yn1%*%t(Ln1)%*%solve(Ln1%*%t(Ln1))%*%L1 y1_hat_fm
Again, we can compare the predicted values to actual outcomes for individual 1. Figure 43 presents the actual and predicted outcomes with perfect prediction on the 45 degree line. Possibly unsurprising, the factor model seems to do the best. It certainly had the advantage that the data actually came from a factor model.
Returning to Minimum Wage Effects
In Chapter 10 we used a standard fixed effects model to analyze the impact of the 2007-2009 increase in the federal minimum wage on hours worked. The analysis follows the approach of Currie and Fallick (1996) but uses data from NLSY97.
NLSY97 Data
We can now return to the problem of estimating the effect of the federal minimum wage increase. In order to reduce the computational burden, the size of the treated and non-treated groups is substantially restricted.
# using the same data as Chapter 9
<- 600 # restrict for computational reasons.
N2 <- 18
T2 <- Y[,1:N2]
Y2 <- W[,1:N2]
W2 <- W2[11,]/Y2[11,]
rate_07 # calculates the wage rate for each person.
<- ifelse(rate_07<7.26 | W2[11,]==0 | Y2[11,]==0,1,0)
treat1 <- matrix(0,T2,N2)
treat for (i in 11:T2) {
<- treat1[1:N2]
treat[i,]
}<- f_fe(Y2, treat)
lm1 # fixed effects regression.
<- matrix(NA,T2,N2)
Y2 -lm1$na.action] <- lm1$residuals
Y2[# na.action is an index of the NAs created in lm()
# remember we can index a matrix as if it is a vector.
<- Y2[,treat1==1]
Y_t <- Y2[,is.na(colSums(Y_t))==0]
Y_t <- Y_t[,1:5] # restrict for computational reasons.
Y_t <- Y2[,treat1==0]
Y_nt <- Y_nt[,is.na(colSums(Y_nt))==0]
Y_nt # split into treated and non-treated groups
# keep only those with data observed in all time-periods.
Synthetic Control Estimates of the Minimum Wage Effect
<- dim(Y_t)[2]
N_t <- dim(Y_nt)[2]
N_nt <- matrix(NA,8,N_t)
mat_treat for (i in 1:N_t) {
<- rnorm(N_nt)
pari <- optim(par=pari, fn=f_sc, yt=Y_t[1:10,i],
ai ynt=Y_nt[1:10,],
control = list(trace=0,maxit=100000000))
<- exp(ai$par)/sum(exp(ai$par))
omega <- Y_t[11:18,i] - Y_nt[11:18,]%*%omega
mat_treat[,i] #print(i)
}<- mat_treat + lm1$coefficients[2]
mat_treat # add back in the results of the fixed effects.
<- matrix(NA,8,3)
plot_res_sc for (i in 1:8) {
<- quantile(mat_treat[i,],c(0.2,0.5,0.8),
plot_res_sc[i,] na.rm = TRUE)
}# quantiles of results at 0.2, 0.5, and 0.8.
The weights are estimated separately for each individual. Then the treatment effect is measured for each individual.
The Figure 44 presents the distribution of the treatment effect for the five individuals analyzed. At least for these five people, the effect of the minimum wage increase is not large.
LASSO Estimates of the Minimum Wage Effect
The estimator goes through a number of different tuning parameters and then presents the results for the one that minimizes root mean squared error on the test data set.
<- Y_t[1:8,] # training data
Y_tr <- Y_t[9:10,] # test data
Y_tt <- Y_nt[1:8,]
Y_ntr <- Y_nt[9:10,]
Y_ntt <- matrix(NA,8,dim(Y_t)[2])
treat3 for (i in 1:dim(Y_t)[2]) {
<- f_lasso(y_tr=Y_tr[,i],y_tt=Y_tt[,i],
omega Y_ntr=Y_ntr,Y_ntt=Y_ntt,K=10)
<- Y_t[11:18,i] - cbind(1,Y_nt[11:18,])%*%omega
treat3[,i] #print(i)
}
<- treat3 + lm1$coefficients[2]
treat3
<-matrix(NA,8,3)
plot_res_ls for (i in 1:8) {
<- quantile(treat3[i,],c(0.2,0.5,0.8),
plot_res_ls[i,] na.rm = TRUE)
}
The results with this estimator are similar to those presented above. See Figure 44. You should check to see what happens when the set of tuning parameters is changed. The number of treated and non-treated individuals has also been restricted to reduce computational time.
Factor Model Estimates of the Minimum Wage Effect
The convex factor model approach is slightly different from that above.
Step 0. Set the data up. Note that the panel is aggregated up to half-decades. This is done for two reasons. First, it reduces the computational time for the problem. Second, it accounts for the possibility that there is auto-correlation in the annual data.129 Ruling out auto-correlation is important for the identification argument presented in Adams (2018). It also allows more factors to be observed.130 However, the estimator uses most of the individual observations. We don’t need to restrict it for computational reasons.
<- colMeans(Y[1:5,treat0==1], na.rm = TRUE)
Y_1t <- colMeans(Y[6:10,treat0==1], na.rm = TRUE)
Y_2t <- colMeans(Y[11:15,treat0==1], na.rm = TRUE)
Y_3t <- colMeans(Y[1:5,treat0==0], na.rm = TRUE)
Y_1nt <- colMeans(Y[6:10,treat0==0], na.rm = TRUE)
Y_2nt <- colMeans(Y[11:15,treat0==0], na.rm = TRUE)
Y_3nt <- rbind(Y_1t,Y_2t,Y_3t)
Y_t <- is.na(colMeans(Y_t))==0
index_t_na <- Y_t[,index_t_na]
Y_t <- rbind(Y_1nt,Y_2nt,Y_3nt)
Y_nt <- Y_nt[,is.na(colMeans(Y_nt))==0]
Y_nt <- cbind(Y_t,Y_nt)
Y_fm <- dim(Y_t)[2]
N_t <- dim(Y_nt)[2] N_nt
Step 1. Estimate
<- c(log(400),rnorm(6,mean=0,sd=10))
part <- optim(par=part, fn=f_cmf,
a Y=Y_fm[1:2,], K=3, Reps=100,
control = list(trace = 0,maxit=1000000))
# sigma
<- exp(a$par[1])
sighat sighat
[1] 411.6859
<- matrix(exp(a$par[2:length(a$par)]), nrow=2) Fhat
The Table 35 shows that there is a fair amount of variation in the hours.131 The factors make points of a triangle at 0 hours for both time periods, 0 hours for period 1 and 4,000 hours for period 2, and 1700 hours for period 1 and 0 hours for period 2.
Step 2. Determine the weights.
<- matrix(NA,3,dim(Y_fm)[2])
Lhat set.seed(123456789)
for (i in 1:dim(Y_fm)[2]) {
<- optim(par=rnorm(3,mean=0,sd=1), fn=f_lambda,
ai Y=Y_fm[1:2,i], F=Fhat, sig=sighat)
<- exp(ai$par)/sum(exp(ai$par))
Lhat[,i] #print(i)
}
The Figure 45} shows that weights are spread out, perhaps unsurprising given that there are over 5,000 individuals.
<- Lhat[,1:N_t]
Lhat_t <- Lhat[,(N_t+1):dim(Lhat)[2]]
Lhat_nt <- t(Lhat_nt)%*%solve(Lhat_nt%*%t(Lhat_nt))%*%Lhat_t
Omega <- Y_t[3,] - Y_nt[3,]%*%Omega
itt summary(as.vector(itt))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2314.8 -1263.9 -648.6 -529.5 124.8 3371.4
The Figure 46 presents the individual treatment effect for the treated. It shows that a substantial proportion of individuals work fewer hours than they would have if they had not been treated by the increase in the federal minimum wage. That said, it is not clear that these people are worse off. In fact, the median treated individual had their average income increase over $7,000 in the 5 years following the minimum wage increase.
<- colMeans(W[6:10,treat0==1], na.rm = TRUE)
W_2t <- W_2t[index_t_na]
W_2t <- colMeans(W[11:15,treat0==1], na.rm = TRUE)
W_3t <- W_3t[index_t_na]
W_3t <- W_3t - (Y_3t[index_t_na] + itt)*W_2t/Y_2t[index_t_na]
inc summary(as.vector(inc[is.na(inc)==0 & is.infinite(inc)==0]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4411593 1760 7630 55703 15793 53364583
Discussion and Further Reading
Panel data can be used to estimate the treatment effect by comparing outcomes before and after the treatment occurs. The problem is that time also affects the outcomes. In Chapter 10 this is solved by assuming that time has the same effect on the treated group as it does on the non-treated group. This chapter considers relaxing the assumption.
A synthetic control aims to construct a prediction of the counter-factual outcome. For the treated units, we observe their outcome after they have been treated, but not the outcome they would have had if they had not been treated. One way to construct the prediction is to use a weighted average of the outcomes for the non-treated units in the treated period. This is, in fact, what was done in Chapter 10. The difference here, is that pre-treatment period is used to find the appropriate weights for each treated unit.
The simplest and most obvious weights are the OLS weights. However, with only a few time periods and a large number of units, these weights are not identified. Abadie, Diamond, and Hainmueller (2010) suggests two restrictions, that the intercept is zero and the weights add to 1.
An alternative approach to restricting the model is to use a machine learning tool such as regularization. The chapter goes through a popular algorithm called LASSO. This is a least squares algorithm with an extra constraint that restricts the sum of the absolute value of the coefficients. The LASSO model tends to set coefficient values to zero. This leads to biased estimates, but potentially better predictions.
The chapter also shows that instead of making restrictions inferred by a particular factor model, we could simply use a factor model. Hotelling (1933) showed that the factor model is also not identified. The set of matrix factors that are consistent with the data is not unique. However, Fu et al. (2016) shows that restricting the factor weights to be positive and add to 1 (convex), it is possible to find unique matrix factors. The chapter presents a maximum-likelihood algorithm for finding the synthetic controls based on the convex factor model.
An excellent review of the various synthetic control techniques is Doudchenko and Imbens (2016). Guido Imbens, Susan Athey and co-authors have some review papers on using machine learning techniques in economics including in panel data models (Athey et al. 2019).
Mixture Models
Introduction
I am a colon cancer survivor. When I was diagnosed, I was told that had about an 75% probability of surviving 5 years (that was ten years ago!). My high survival probability is due to a combination of surgery and chemotherapy. Prior to the introduction of chemotherapy, the prognosis for someone with my diagnosis was 50%. The original chemos increased the 5-year survival to 70% and then newer chemos increased it to 75%. Am I alive because I had chemo?
I don’t know.
The chemo may have done nothing. Prior to chemo, half the people with my diagnosis survived. I could have survived without chemo.
The example suggests that the treatment effect of chemotherapy is heterogeneous. For some, it has no effect, while for others it saves their life. There may even be a group of people that chemo harms. Imagine if we could determine which group each patient is in? Many fewer people would be exposed to the horrors of chemotherapy. Mixture models may be the answer.
Mixture models have become an important part of the econometrician’s tool kit. However, in many cases they are not identified without strong and non-credible assumptions. Moreover, many researchers seem to be unaware that their results are relying on such non-credible assumptions.
This part of the book highlights how repeated measurement can be used to predict the effect of policies. This chapter considers more general cases than the panel data used in Chapters 10 and 11. The chapter uses the mixture model to solve the problem of measurement error. The chapter considers the problem of estimating returns to schooling using data from twins. It also returns to the question of whether an increase in the minimum wage will cause restaurants to decrease employment. The chapter highlights the great promise of mixture models, as well as the identification problems that arise.
Two-Type Mixture Models
Consider the simplest mixture problem. There are two hidden types which we will denote
The density of
The question is whether we can determine the component parts of the mixture by observing the outcomes (
In my example above, the hidden types may represent how a particular colon cancer patient is affected by chemotherapy. Can we uncover the true effects (the
Simulation of Two-Type Mixture
The simulation is based on the model presented above. Note that there are two observed outcomes
set.seed(123456789)
require(mvtnorm)
<- 10000
N <- c(0,10,-1,3)
mu <- diag(1,4)*c(2,1,3,1)
Sig # diagonal matrix with different variances.
<- rmvnorm(N,mean=mu,sigma=Sig)
x <- 0.4
alpha <- runif(N)
ranN <- ifelse(ranN < alpha,x[,1],x[,2])
y1 <- ifelse(ranN < alpha,x[,3],x[,4]) y2
Looking at Figure 47 and the distribution of
Simply observing the outcome may not be enough to discern the underlying distributions. If it was, the chapter would be very short.
Knowing the Component Distributions
What if we knew the underlying type distributions? Could we back out the mixture weights? The answer is yes. To see this, consider the second mixture distribution broken into its component parts. In the simulated data we have the following relationship.
where
As we have a linear equation with one unknown (
At the median, we can calculate the density of the mixed distributions and use the formula presented in Equation 195. We get 0.43 where the true value is 0.4.
<- quantile(y2,0.5)
q_y2 <- density(y2)$y[315] # about the median
a1 <- dnorm(q_y2,mean=3)
b1 <- dnorm(q_y2,mean=-1,sd=3^(0.5))
c1 - b1)/(c1 - b1) (a1
50%
0.425599
In some cases it is reasonable to assume the component distributions are known. In Appendix A, we analyze a statistical approach called empirical Bayesian estimation. This approach relies on the fact that we know the component distributions under random sampling.
Observing Multiple Signals
What if we don’t know the component distributions? In general, we are back to square one.
We can make progress if we observe more than one outcome being generated by the same process. Consider the simulated data generated above. I didn’t highlight it, but both
We can back out the underlying mixtures by looking carefully at the distribution of one outcome conditional on the the second outcome. To see this, write out the probability of observing a certain value of
The simulated data satisfies an important assumption. Conditional on knowing the state, the value of
The conditional probability is a mixture over the two component distributions. Importantly, the mixture weights can vary. The value of
Rearranging we have the conditional probability as a function of objects that vary with
If there is a case where
To illustrate, the next loop creates a matrix of conditional probabilities for our two simulated outcomes (
Fu et al. (2016) point out that if these probabilities are precisely estimated then they will lie on a line between the two true underlying states. Moreover, the observed end points will correspond to the two true states. Identifying the model is simply a matter of plotting the data.
<- min(y1,y2)
min1 <- max(y1,y2)
max1 <- 10
K1 <- 7
K2 <- 1e-5
eps <- (max1 - min1)/K1
del1 <- (max1 - min1)/K2
del2 <- matrix(NA,K2+1,K1+1)
prob_mat # conditional probability matrix
<- matrix(NA,K2+1,3)
prob_mat_x # actual hidden type conditional probability matrix
<- min1 - eps
low1 <- low1 + eps + del1
high1 for (k1 in 1:K1) {
<- min1 - eps
low2 <- low2 + eps + del2
high2 1,1+k1] <- (low1 + high1)/2
prob_mat[<- y1[y2 > low1 & y2 < high1] # conditioning
y1_c for (k2 in 1:K2) {
1+k2,1] <- (low2 + high2)/2
prob_mat[1+k2,1+k1] <- mean(y1_c > low2 & y1_c < high2)
prob_mat[# probability conditional on 2nd signal
1+k2,2] <- mean(x[,1] > low2 & x[,1] < high2)
prob_mat_x[1+k2,3] <- mean(x[,2] > low2 & x[,2] < high2)
prob_mat_x[# probabilities conditional on actual hidden type.
<- high2 - eps
low2 <- low2 + eps + del2
high2 #print(k2)
}<- high1 - eps
low1 <- low1 + eps + del1
high1 #print(k1)
}1] <- prob_mat[,1] prob_mat_x[,
The Figure 48 plots the probabilities of two outcomes, one in the left tail and one in the right. These probabilities are denoted by the circles. Each probability is calculated given different values for
The Figure 48 shows that we can use the result of Fu et al. (2016) to identify the mixture model. If the conditional probabilities are estimated accurately then they must lie on the line between the two true types. If the conditional probabilities vary enough then the extremes will correspond to the case where only one type is true.
Two Signal Mixture Models
Below, we consider the measurement error problem associated with self-reported education levels. We are interested in comparing the difference in earnings for twins with different education levels. The problem is that most twins get the same level of education. This means that our analysis may be impacted by measurement error. It may be that all the differences we observe are due to reporting mistakes.
The proposed solution is to use two signals of the education level for each person in the study. Luckily, in the data set the twins were asked about both their own education level and the education level of their sibling. This means that for each person we have two signals of their true education level. One signal is the person’s own report of their own education level. The second signal is the person’s sibling’s report of their education level.
Model of Two Signal Data
Consider a problem in which there are
For simplicity assume that the type is represented by
where
Simulation of Two Signal Data
In the simulated data there are four unknown types, say four education levels.
set.seed(123456789)
<- 10000
N <- c(1:4)
u <- c(0.2,0.5,.25,0.05)
p <- runif(N)
ranN <- ifelse(ranN < p[1], u[1],
theta ifelse(ranN < p[2], u[2],
ifelse(ranN < p[3], u[3], u[4])))
<- rnorm(N,mean=0,sd=0.3)
e1 <- rnorm(N,mean=0,sd=0.3)
e2 <- 1 + .25*theta + e1
x1 <- 0.5 + 0.1*theta + e2
x2 summary(x1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2179 1.3847 1.7037 1.6980 2.0207 3.2098
summary(x2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.3676 0.5651 0.7897 0.7848 1.0046 1.9819
The two simulated signals have overlapping distributions. We can extend the ideas above to use these two signals to determine the unobserved types.
Conditional Independence of Signals
We can write down the joint density by the Law of Total Probability.
We are interested in estimating the distribution of the hidden types (
However, if we are willing to assume that
It is a subtle but important change. In the first case we cannot determine
We have assumed that we have two signals and these signals are independent of each other conditional on the true underlying type. I think this assumption is reasonable for our twin’s education reporting example. It implies that once we know the true education level for a person, then observing the sibling report provides no additional information about their own report.
Two Signal Estimator
The algorithm for solving the mixture model is suggested by Benaglia, Chauveau, and Hunter (2009). The authors suggest simultaneously solving three equations.
The first is the posterior (
The second is the prior distribution which is calculated as the posteriors weighted by the probability of the two outcomes
The third is the likelihood function for the outcomes
Note that there may be multiple priors that satisfy these equations. There is no guarantee that there will be a unique solution. However, depending on the data, bounds around the solution set may be tight.
Two Signal Estimator Algorithm
Benaglia, Chauveau, and Hunter (2009) suggest the following algorithm for solving the system.
Step 0. Let
Step 1. Solve for the likelihood functions given the categorization.
Step 2. Solve for the prior.
Step 3. Solve for the posterior.
Step 4. Determine types.
Go to Step 1.
Stop when the estimate of the prior (
Mixture Model Estimator in R
Mixture models can be estimated using npEM()
from the package mixtools
. It stands for non-parametric EM estimator, even though it is not actually an EM estimator.135 The estimator below is based on the npEM()
code. The original uses calls to C
. This version uses R
’s built in density()
function.
<- function(X,centers=2,tol=1e-9,
mixmod_fun maxiter=500,verb=FALSE) {
<- as.matrix(X)
X
# find the initial set of types using kmeans().
<- kmeans(X,centers)
a1
# the function accepts either a matrix or just the number
# of hidden types.
if (length(centers) > 1) {
<- dim(centers)[1]
K else {
} <- centers
K
}<- a1$cluster
t1 <- matrix(NA,dim(X)[1],K)
T1 for (k in 1:K) {
<- ifelse(t1==k,1,0)
T1[,k]
}
# determine the initial prior estimate
<- colMeans(T1)
g1
# loop to calculate the prior and posterior
<- sum(abs(1 - g1))
diff <- 0
i while (diff > tol & i < maxiter) {
# calculate the posterior
<- t(matrix(rep(g1,dim(X)[1]), nrow=K))
gamma
# calculate the likelihood functions
for (k in 1:K) {
for (j in 1:dim(X)[2]) {
<- density(X[t1==k,j],from=min(X),to=max(X))
h1 $y <- h1$y/sum(h1$y)
h1<-
gamma[,k] *sapply(c(1:dim(X)[1]), function(x)
gamma[,k]$y[which.min((h1$x - X[x,j])^2)])
h1
}
}<- gamma/rowSums(gamma)
gamma
# update the categorization of types
<- max.col(gamma)
t1
# update the prior estimate
<- colMeans(gamma)
g2
# check for convergence
<- sum(abs(g1 - g2))
diff <- g2
g1 <- i + 1
i if (verb) {
print(diff)
print(i)
}
}return(list(g = g2, gamma=gamma, t = t1))
}
<- cbind(1 + 0.25*u, 0.5 + 0.1*u)
centers # columns are number of signals (2 in this case)
# rows are number of types (4 in this case.)
<- mixmod_fun(X=cbind(x1,x2),centers=centers) a
We can use this model to determine the underlying types from simulated data. Figure 49 shows that the mixture model does a pretty good job of discerning the underlying types.
mean(x1[a$t==1])
[1] 1.190989
mean(x1[a$t==2])
[1] 1.471282
mean(x1[a$t==3])
[1] 1.876143
mean(x1[a$t==4])
[1] 2.135286
The estimator finds that the hidden types are
Twins Reports and Returns to Schooling
The book returns to more or less where it began, with the question of measuring returns to schooling. The bane of the econometrician’s efforts to estimate returns to schooling is “unmeasured ability.” Ashenfelter and Krueger (1994) and Ashenfelter and Rouse (1998) suggest a unique solution - find a set of twins, ask them their education level and their wage and measure the differences in wages associated with the differences in education levels. This first-difference approach assumes that twins have the same unmeasured ability and differences it out.
The problem is measurement error. Most twins have the same education level, so any observed difference in education level may not be real. Observed differences may be due to one or the other of the twins misreporting his education level. Such measurement error is likely to lead to estimates of returns to education that tend toward 0.
The section shows how a mixture model estimator can be used to estimate each person’s true education level conditional on his own report and the report of his sibling. These true education levels are then used to estimate returns to schooling.
Mixture Model of Twin Reports
The mixture model approach solves the measurement error problem in two steps. In the first step, each individual’s education level is assumed to be measured by two different signals. The first signal is the individual’s own report of his education level. The second signal is the individual’s sibling’s report of his education level. Note that a sibling’s report is likely to have different characteristics than an individual’s own report.
To use the approach above we need to assume that the individual’s own report is independent of their sibling’s report of their education level, conditional on the true education level. This assumption seems quite reasonable. It implies that the reports only differ from the truth due to errors and mistakes.
Here, these two signals are used to estimate four “types” of individuals. Each type has a conditional distribution over education levels, where each is labeled for the mode of the conditional distribution of education level - 12, 14, 16 and 18.136 These levels line up with the certificate granting years - high school diploma, associates degree, bachelor’s degree and master’s degree. To be clear, while the number of latent types is imposed on the estimated model, the conditional distributions (and their modes) are determined by the data.
The second step uses the results to estimate returns to schooling. It takes estimates from the mixture model to calculate the posterior probability of being a member of each education level type. These probabilities are used to weight the average education level for each type to give each person’s estimated education level - the “true” education level. Each twin is then matched with his sibling to calculate the difference in estimated true education levels. These differences are used in the place of the naive and IV estimated differences in the first difference regression models presented in Ashenfelter and Rouse (1998).
Twin Reports Data
The analysis uses the same data as Ashenfelter and Rouse (1998).137
require(foreign) # package for reading older Stata data sets.
<- read.dta("pubtwins.dta") x
Warning in read.dta("pubtwins.dta"): cannot read factor labels from Stata 5
files
<- x[is.na(x$first)==0,] # first twin
xf <- x[is.na(x$first)==1,] # second twin
xs # educ "Twin1 Education";
# educt "Twin1 Report of Twin2 Education";
# educ_t "Twin2 Education";
# educt_t "Twin2 Report of Twin1 Education";
# the data is written out in pairs for each set of twins.
<- lm(dlwage ~ deduc, data=xf)
lm1 <- lm(dlwage ~ deduc + duncov + dmaried + dtenure,
lm2 data=xf)
The Figure 50 shows two important takeaways from the data. First, for many twins there is no difference in the education levels. Second, there is a distinct positive relationship between differences in education level and income for the twins in the sample.
Ashenfelter and Krueger (1994) and Ashenfelter and Rouse (1998) present evidence that first differencing gives biased estimates. This bias is due to measurement error which is exacerbated by the first differencing. The authors’ solution is to use an IV approach in which alternative reports are used as an instrument for the individual’s own reports. The naive approach suggests an effect of 7% while the IV approach gives an estimate of 8.8%. Can you replicate these IV estimates?
Mixture Model Approach to Measurement Error
The IV approach suggested by Ashenfelter and Krueger (1994) makes very strong assumptions about the structure of the measurement error.138 The mixture model approach suggests that the measurement error issue can be addressed without the need for the strong assumptions of IV. Note that both the IV method and the mixture model method assume that reports are conditionally independent.
The first step is to determine the education level associated with each type.
We need to split the data between “first” and “second” twins so that we can make the correct comparison.
set.seed(123456789)
<- cbind(xf$educ,xf$educt_t)
yf <- cbind(xs$educ,xs$educt_t)
ys <- rbind(yf,ys) y1
The data is recombined to estimate the mixture model.
<- matrix(rep(c(12, 14, 16, 18),2), nrow = 4)
centers <- mixmod_fun(y1,centers,tol=1e-25) b1
The Figure 51 presents the estimates of the likelihood of observing the twin’s own report conditional on their true education level being 12, 14, 16 or 18. The plots show that for some people who report 13 years, their true education level is 12 years, while for others their true education level is 14 years.
The plots suggest that the model is not point identified. Adams (2016) shows that point identification depends on the existence of cases where the likelihood uniquely determines the type. The plot shows that the likelihoods have tails that are always overlapping. This means that the no likelihood is able to uniquely determine the type. That said, the existence of high likelihood ratios suggests that the identified set is tightly bound.
Estimating Returns to Schooling from Twins
Using the posterior estimate of each twin’s “type” we can assign them and their sibling an education level. Then we can calculate the first-difference.
<- c(12,14,16,18)
ed $deduc <- b1$gamma[1:340,]%*%ed - b1$gamma[341:680,]%*%ed
xf# difference in "true" education levels
# for each twin
<- lm(dlwage ~ deduc, data = xf)
lm3 <- lm(dlwage ~ deduc + duncov + dmaried + dtenure, data=xf) lm4
(1) | (2) | (3) | (4) | |
---|---|---|---|---|
(Intercept) | 0.030 | 0.026 | 0.032 | 0.030 |
(0.028) | (0.026) | (0.027) | (0.026) | |
deduc | 0.061 | 0.073 | 0.068 | 0.082 |
(0.019) | (0.018) | (0.019) | (0.018) | |
duncov | 0.086 | 0.085 | ||
(0.056) | (0.055) | |||
dmaried | 0.024 | 0.049 | ||
(0.073) | (0.073) | |||
dtenure | 0.023 | 0.024 | ||
(0.003) | (0.003) | |||
Num.Obs. | 340 | 333 | 340 | 333 |
R2 | 0.031 | 0.170 | 0.036 | 0.178 |
R2 Adj. | 0.028 | 0.160 | 0.033 | 0.168 |
AIC | 507.5 | 455.5 | 505.6 | 452.2 |
BIC | 518.9 | 478.3 | 517.1 | 475.0 |
Log.Lik. | -250.727 | -221.740 | -249.803 | -220.085 |
RMSE | 0.51 | 0.47 | 0.50 | 0.47 |
The Table 36 presents the OLS results for log wages on the first-difference in education levels. The standard errors are reported in parenthesis. Note that these standard errors do not account for the mixture model estimation used in calculating the first difference in education level. These results can be compared to the first difference estimates in Ashenfelter and Rouse (1998). The mixture model estimates are somewhat higher suggesting that measurement error is in fact biasing down the first difference estimates. However, they are lower than the IV estimates (not presented).
Revisiting Minimum Wage Effects
I began the chapter by claiming that mixture models could be used to estimate heterogeneous treatment effects. This section tests that idea on Card and Krueger’s controversial minimum wage paper.139 The original paper was controversial for two reasons. First, it purported to show that a bedrock result of economics wasn’t true. Economic theory predicts that a minimum wage increase will lead to a decrease in the demand for labor. The difference in difference results showed that the minimum wage increase in New Jersey caused a (small) increase in the demand for labor. Second, the paper was used to justify an increase in the federal minimum wage, a controversial policy.
A possible explanation for the result is that the change in the minimum wage in New Jersey had heterogeneous effects. Economic theory is clear that an increase in the cost of labor hours will lead to a decrease in demand for labor hours. It is less clear on how that decrease will be achieved. Will restaurants decrease the number of full-time staff or the number of part-time staff or substitute from full-time to part-time or the other way around?
Restaurant Employment
Warning: NAs introduced by coercion
Warning: NAs introduced by coercion
The analysis uses the same data as Chapter 10, but with the mixture model method described above. The analysis uses the pre-treatment period (period 1) to estimate the hidden types. With these in hand, it uses the post-treatment period (period 2) to estimate the difference in difference for each hidden type. Note that the section assumes that the types do not change from period 1 to period 2. The two signals of the restaurant’s hidden type are the number of full-time employees and the number of part-time employees.
<- rbind(x3$FT[x3$TIME==1],x3$PT[x3$TIME==1],
Y $FT[x3$TIME==2],x3$PT[x3$TIME==2])
x3<- Y[1:2,is.na(colSums(Y))==0] # period 1
Y1 <- Y[3:4,is.na(colSums(Y))==0] # period 2
Y2 <- x3$STATE[x3$TIME==1]==1
treat <- treat[is.na(colSums(Y))==0] treat1
The Table 37 presents summary statistics for the number of full-time and part-time employees in the surveyed restaurants. Note that these firms tend to have many fewer full-time employees than part-time employees.
Mixture Model Estimation of Restaurant Type
In the education example above there was a clear measure of the “hidden type,” the years of education, particularly the diploma granting years. Here there is no clear indication. I proceed by guessing that there are firms that hire small, medium and large numbers of full-time employees and part-time employees. For full-time these are the groups centered at 2, 6 and 11. For part-time they are the groups centered at 11, 17 and 25. Moreover, there are some firms that hire a small number of both, some that hire a small number of one and a large number of the other. Lastly, through trial-and-error I paired down the initial groups to 5.140
<- c(2,6,2,2,6)
p1 <- c(11,17,17,25,11)
p2 <- mixmod_fun(X=t(Y1),centers=cbind(p1,p2),tol=1e-15)
a $g a
[1] 0.30724141 0.07859604 0.26042828 0.17348538 0.18024888
The Figure 52 shows that there are more or less five types of restaurants. There are small restaurants that hire a small number of both types of employees. There are small restaurants that tend to hire part-time employees and ones that tend to hire full-time employees. Then there are large restaurants that either have more part-time employees or more full-time employees.
Heterogeneous Minimum Wage Effect
The model determines the hidden type for each restaurant. With this information, we can run the difference in difference estimator presented in Chapter 10 for each type.
The Figure 53 suggests that there was an effect of the increase in the minimum wage on small restaurants. There seems to be a shift down in the number of full-time and part-time employees that these firms hire after the minimum wage increases. The cloud disperses with some firms increasing employment, but relative to Pennsylvania firms there seem to be more firms with lower or zero levels of employment of part-time workers.
To see whether there is an effect for the small type we can estimate the difference and difference model for full-time workers. The estimate suggests that the minimum wage increase reduced full-time employment by 17% for small restaurants.
<- c(Y1[1,][!treat1 & a$t==1],Y1[1,][treat1 & a$t==1])
y_11 # full-time employment for type 1 restaurants
# prior to the minimum wage increase
<- c(Y2[1,][!treat1 & a$t==1],Y2[1,][treat1 & a$t==1])
y_12 # full-time employment for type 1 restaurants after
# the minimum wage increase.
= c(rep(0,length(Y1[1,][!treat1 & a$t==1])),
treat rep(1,length(Y1[1,][treat1 & a$t==1])))
<- f_did(rbind(y_11,y_12),treat)
did1 3,3]/did1[1,1] did1[
[1] -0.1710775
We can do the same analysis for part-time workers. Here the effect of the minimum wage increase is smaller, an 8% decrease in part-time workers for small restaurants.
<- c(Y1[2,][!treat1 & a$t==1],Y1[2,][treat1 & a$t==1])
y_13 <- c(Y2[2,][!treat1 & a$t==1],Y2[2,][treat1 & a$t==1])
y_14 = c(rep(0,length(Y1[2,][!treat1 & a$t==1])),
treat rep(1,length(Y1[2,][treat1 & a$t==1])))
<- f_did(rbind(y_13,y_14),treat)
did2 3,3]/did2[1,1] did2[
[1] -0.08023572
The mixture model analysis finds that for type 1 restaurants (small restaurants) the minimum wage increase is associated with a large decrease in the number of full-time employees and a smaller but significant decrease in the number of part-time employees.
The Figure 54 presents the distribution of part-time and full-time employees by each of the other four hidden types. The figure shows that it is difficult to determine if there is any measurable effect of the minimum wage increase. In each case, the data is spread out in a cloud with no discernible decrease in employment. Can you replicate the difference and difference estimates for these types?
Discussion and Further Reading
Mixture models play an important role in microeconometrics. Unfortunately, mixture models are generally not identified without reliance on hard-to-justify parametric restrictions. The chapter considers the case where the econometrician observes multiple signals of the same underlying process. An example is reports of educational attainment. In this case we have two reports, one self-reported and the other, a report from the person’s sibling. These two signals can be used to bound the distribution of the prior. In some cases the bounds can be tight.
Hu (2017) presents a comprehensive review of the three signal identification result and its applications. P. A. Haile and Kitamura (2019) limits the application to auctions with unobserved heterogeneity, but the paper provides an excellent overview of the various problems that occur in econometrics and a discussion of their solutions. The estimator presented here is based on Benaglia, Chauveau, and Hunter (2009). While very nice to use, it is not guaranteed to converge. Levine, Hunter, and Chauveau (2011) suggest a similar algorithm which does have nice convergence properties. This is a majorization-minimization algorithm.
The question of returns to schooling bookends this book. Chapters 1 and 2 presented OLS estimates using data from the NLSM 1966. The analysis suggested that one additional year of schooling increased wages by 7.8%. Card (1995) pointed out that this estimate is biased because there are unobserved characteristics that affect both the amount of schooling an individual receives and their income. The paper uses distance to college as an instrument and estimates that the returns to schooling is closer to 14%. Chapter 3 also presents LATE estimates that vary from 19% to 32%, although those estimates don’t include other explanatory variables. These results suggest that there is a fair amount of heterogeneity in school returns. Chapter 6 returns to the question using the Heckman selection model and gets an average effect of 14%. The model estimates that there is variation in school returns and in fact returns are not positive for everyone. Chapter 9 uses GMM to re-estimate Card’s IV approach with two instruments. It finds that returns are slightly smaller than suggested by OLS. This chapter uses reports from twins to estimate returns to schooling. After accounting for measurement error with a mixture model, the twins analysis suggests an average return of 8% for each additional year of schooling.
The chapter revisits Card and Krueger (1994) and their analysis of the minimum wage increase in New Jersey. The result suggests that minimum wage increase resulted in a substantial reduction in employment for small restaurants.
Measuring Uncertainty
Introduction
In microeconometrics it is important to provide information to the reader about how uncertain we are about the estimates. Standard microeconomic models of decision making suggest that people are risk averse and care about the uncertainty surrounding estimates (Kreps 1990). Less standard, but increasingly common models suggest that decision makers care explicitly about the extent of the uncertainty (Klibanoff, Marinacci, and Mukerji 2005). How do we measure and report this uncertainty?
This chapter introduces the two main approaches used in statistics; classical statistics and Bayesian statistics. It may surprise you that statistics does not have a standard approach. Rather, statistics has two approaches that differ in important ways. The chapter discusses the problems and benefits of each. It also introduces a third way, called **empirical Bayesian} statistics. The third method marries classical and Bayesian statistics. The three approaches are illustrated by considering the question of whether a baseball player who played one game of major league baseball, John Paciorek, was better than the great Babe Ruth.
Classical Statistics
The approach you are probably most familiar with from economics is called classical statistics. Classical statistics uses the analogy principle. This principle states that if the econometrician is interested in some characteristic of the population, then she should use the analogy in the sample. Consider that we have a sample of data on player statistics for the WNBA. If we are interested in the average of height of WNBA players then we should use the average height of players in the sample. This is the sample analogy.
This section shows how the analogy principle can be used to provide information about the uncertainty surrounding our estimate.
A Model of a Sample
Consider a case where we observe a sample,
In a typical problem we are interested in estimating the mean,
Simulation of a Sample
Consider the simulation of a sample of size 10 drawn from a normal distribution with mean -3 and standard deviation of 5.
set.seed(123456789)
<- 10
N <- -3
mu <- 5
sigma <- rnorm(N, mean = mu, sd = sigma)
x <- density(x) dens_x
The Figure 55 shows that the sample distribution differs substantially from the true distribution. However, we are generally not interested in determining the true underlying distribution. Rather, we are interested in some aggregate characteristic of the distribution such as the mean.
mean(x)
[1] -3.484872
sd(x)
[1] 5.097328
The sample mean is -3.49, which is also substantially different from the true value of -3. The sample standard deviation is 5.10 which is not that different from the true value of 5.
How do we convey to the reader or the policy maker that our estimate of various statistics may not be accurate?
Many Imaginary Samples
One idea is to think about how our estimate would vary if we had a large number of different samples, all drawn from the same true distribution. For each imaginary sample that we draw from the true distribution, we will get a different estimate. As the number of samples gets large, we will have a distribution of the estimates. This distribution provides information about the uncertainty of our sample estimate. Luckily, with computers we can actually conduct this thought experiment.
<- 500
M <- matrix(NA,M,N)
sample_est for (m in 1:M) {
<- rnorm(N, mean = -3, sd = 5)
sample_est[m,] }
Consider the following simulated data. Figure 56 presents the histogram of the distribution of sample means. The distribution is centered around the true mean of -3, but varies by more than 3 on each side. This illustrates that if we took a large number of samples our estimate would be correct on average.
We say that our estimate of the mean is unbiased. I generally don’t find this particularly interesting. “Your estimate is almost certainly wrong but correct in expectation.” Of course, it is also good to know if your estimate will have a tendency to under-estimate or over-estimate the true value.
More importantly, the weight of the distribution is around the true value. However, we cannot rule out the possibility of our mean estimate being as low as -6 or as high as 0. The extent of the dispersion is determined by the size of the sample. Try running the same experiment with a larger sample, say
If we were able to observe a large number of imaginary samples, we would be able to show how accurate our estimate is. But we don’t observe any imaginary samples.
Law of Large Numbers
Another idea is to consider what happens if we have a large number of imaginary observations. If we had a very large sample our estimate of the mean would be very close to the true value. With 100,000 observations the sample mean is -3.001, which is quite close to -3. An estimator with this property is called consistent.
set.seed(123456789)
<- 100000
N <- rnorm(N, mean = -3, sd = 5)
x mean(x)
[1] -3.001334
Theorem 2 Law of Large Numbers. If
Various versions of Theorem 2 state that as the sample size gets large, the sample estimate converges (in some sense) to the true value. The Law of Large Numbers suggests that if our sample is “large enough” then our estimate may be close to the true value. Seems nice, but may not be that useful if our sample size is 10. It also does not provide us with any information about how uncertain our estimate actually is.
Central Limit Theorem
A central result to much of standard classical statistics is the Central Limit Theorem. The theorem states that as the number of observations gets large, then the estimated sample mean is distributed normally with a mean equal to the true mean and a standard deviation equal to the true standard deviation divided by the square root of the number of observations.
Theorem 3 Central Limit Theorem. If
The following is a simulation of the Central Limit Theorem. It presents the density of the distribution of estimated sample means as the size of the sample increases. Note that the distributions are normalized so that they will be standard normal (if the sample size is large enough).
<- 1000
M <- matrix(NA,M,4)
sample_means for (m in 1:M) {
1] <- mean(rnorm(10, mean = -3, sd = 5))
sample_means[m,2] <- mean(rnorm(100, mean = -3, sd = 5))
sample_means[m,3] <- mean(rnorm(1000, mean = -3, sd = 5))
sample_means[m,4] <- mean(rnorm(10000, mean = -3, sd = 5))
sample_means[m, }
The interesting part of Figure 57 is that even for small samples the density is close to a standard normal distribution.141
The Central Limit Theorem suggests a way of determining the uncertainty associated with the estimate. It gives a distribution of the sample mean. Moreover, the variance of the distribution, the uncertainty, is determined by the true variance and the sample size. The uncertainty associated with the estimate is smaller if the variance is smaller and if the sample size is larger.
The problem is that we don’t know the variance. We don’t know the component parts of the distribution of the sample mean.
Approximation of the Limiting Distribution
The standard solution to not knowing the true values is to use an approximation based on the analogy principle (Charles F. Manski 1990). Thus, we simply replace the unknown true mean with its known analogy, the sample mean. We replace the unknown true variance with the known sample analogy, the sample variance.
Approximate Distribution
The Assumption 9 states that the estimated mean is distributed normally with a mean equal to itself and a variance equal to the sample variance divided by the sample size. This approximation gives us a measure of uncertainty that we can actually use.
There are two problems with this approach. First, there is no particularly good reason to believe that the sample mean is normally distributed. Sure, we know that it must be if the sample size is large, but our sample size is 10. Second, the measure of uncertainty is also estimated. Worse, it may be estimated poorly. Our measure of uncertainty is itself uncertain. The more uncertainty there is, the less certain we can be about how much uncertainty there is.
Simulation of Approximate Distributions
Consider five different samples and the corresponding approximate distribution of the sample mean from each.
<- 5
M <- 10
N <- matrix(NA,M,2)
sample_means <- NULL
x2 for (m in 1:M) {
<- rnorm(N,mean=mu,sd=sigma)
x1 <- c(mean(x1),sd(x1)/(N)^(.5))
sample_means[m,] <- c(x2,x1)
x2
}<- sort(x2) x2
The Figure 58 does not paint as rosy a picture as Figure 57. The estimates of the distribution of the sample mean are all over the place. Our estimate of uncertainty is affected by how uncertain our estimates are.
Bootstrap
An alternative approach to representing uncertainty, is called bootstrapping. This idea was developed by Stanford statistician, Brad Efron. The idea takes the analogy principle to its natural conclusion. If we need the true distribution we should use its sample analog, that is the sample itself. Using a computer we can create counterfactual pseudo samples by re-drawing from the sample with replacement.
Consider the following example using sample
<- 10
N <- rnorm(N, mean=mu, sd=sigma)
x1 <- 1000
M <- matrix(NA,M,1)
bs_mean for (m in 1:M) {
<- round(runif(N,min=1,max=N))
index_bs 1] <- mean(x1[index_bs])
bs_mean[m,
}<- density(bs_mean) dens_bs
The Figure 59 shows that our estimate of uncertainty is still wrong. The bootstrap measure of uncertainty no longer makes the arbitrary assumption that the sample mean is distributed normally. However, the small sample implies that we still have a poor estimate of the uncertainty.
Hypothesis Testing
A standard method for representing uncertainty is via an hypothesis test. Above, the sample average is -3.49. Let’s say it is important to know the probability that the true mean is in fact 0 or is in fact less than 0. Can we rule out the possibility that the true mean is greater than 0?
The Equation 212 presents a standard representation of the hypothesis that the true mean is greater than or equal to 0. We call this the null-hypothesis.
Note that we are not testing whether the true mean is -3.49. Rather we are testing whether it is not greater than 0. We are ruling out a possibility.
To test the hypothesis, we begin by assuming the hypothesis is true. That is, we assume that the true mean is 0. Given that assumption, we can ask whether it is reasonable that we would observe a sample average of -3.49. How do we determine what is reasonable?
What is the probability that we could see a sample with a sample average of -3.49 if the true mean is 0? To answer this question we need to know the distribution of sample means. Above, we assumed that the distribution of sample means is approximately normal with a mean equal to the true mean and a variance equal to the true variance divided by the sample size. By assumption, the true mean is 0. We don’t know the true variance, but we can use the analogy principle and replace it with the sample variance, which is 5.10.
This will produce a small probability but that is partly because we are asking about a point rather than a region. A better way is to ask at what sample mean do we no longer think the hypothesis is true?
Let
= 0.05
alpha sd(x)*qnorm(alpha)/sqrt(10)
[1] -2.596642
The critical region is less than -2.60. In Figure 60, this is represented as the area under the curve to the left of the solid vertical line. Given the assumptions, we can reject the hypothesis that the true mean is 0 at the 5% level if the sample mean is less than -2.60, which it is.
An alternative is to calculate the p-value. This is the probability of observing a sample mean less than or equal to -3.49 given the hypothesis is true.
pnorm(sqrt(10)*((-3.49)/sd(x)))
[1] 0.01352641
This is 0.014 or 1.4%.
The nice property of the hypothesis test is that it allows the econometrician to assume the true mean, which is an unknown object. In addition, there are occasions where the null hypothesis is a value of interest. For example, Chapter 2 discusses the hypothesis that the direct effect of race on mortgage lending is 0. This particular hypothesis may have legal implications for bankers.
However, it is unclear why a hypothesis of 0 makes sense when describing the uncertainty around our estimate of -3.49. It may be preferable to present the 5th and 95th percentile of the approximate or bootstrap distribution. This is referred to as the **confidence interval}.
Bayesian Statistics
The major competitor to the classical approach is given the appellation “Bayesian” statistics (Berger 1985). One should be careful about the names because “Bayesian” is also used to refer to the standard approach to decision making under uncertainty used in microeconomic theory. It may be best to think of them as two different approaches to decision making under uncertainty with the same name (Berger 1985, Kreps:1990).
The section discusses how to calculate the posterior.
Bayes’ Rule
The Bayesian approach also provides us with a measure of the uncertainty but it uses Bayes’ rule to do so. If we knew the set of possible true parameters,
where
Determining the Posterior
To calculate the posterior, we need to know the prior and the likelihood functions. Assume we know the likelihood functions. Let’s say we know the possible set of distributions that generated the data. This may not be as crazy as it sounds. Consider the problem of determining a true probability. Say the probability that you will become King or Queen of England. Unless your name is Prince William, this is a pretty small number. Nevertheless it is a number between 0 and 1. I don’t know the probability that you will become King or Queen of England but I do know that it is a number between 0 and 1. Here the problem is more complicated, but let’s say that I know that the true mean must lie between -5 and 10. Also for simplicity, assume that I know that the true standard deviation is 5. So
If I know the possible set of true means, the likelihood of observing the sample conditional on the true mean is just the likelihood function presented for OLS in Chapter 5. So we have the sample, the set of parameters and the likelihood function. All we need to use Equation 216 is the prior distribution.
What is the prior distribution? IDK. We could make an assumption that the prior distribution is uniform. In some cases, the uniform prior gives a sense of uncertainty around what the estimate could be.
Determining the Posterior in R
<- function(x, mu, sigma) {
log_norm = (x - mu)/sigma
z return(sum(log(dnorm(z)) - log(sigma)))
}<- 15*c(1:100)/100 - 10
Theta <- rep(1/100,100) # uniform approximation
g <- sapply(Theta, function(theta) log_norm(x1, theta, 5))
h <- exp(h + log(g))/sum(exp(h + log(g))) gamma
The Figure 61 presents the Bayesian posterior of the mean given the sample and a uniform prior. This posterior suggests that the estimate is quite uncertain with the weight of the distribution running between -7 and 0.
Given the assumptions of the Bayesian approach, the Bayesian approach is the “correct” and rational approach to presenting statistical results. It is even consistent with Bayesian decision theory.
However, the assumptions are where the rubber hits the road. The Bayesian approach makes very strong assumptions about what information the econometrician has available. Alternatively, it makes strong assumptions about things that the econometrician does not know. Moreover, these assumptions matter.
Neither the classical nor the Bayesian approach makes a lot of sense when data sets are relatively small. One uses non-credible information to make non-credible claims, while the other uses non-credible assumptions to make non-credible claims.
Empirical Bayesian Estimation
In the middle of last century, Harvard statistician, Herb Robbins, suggested that there may exist a third way. He proposed an approach that used the logic of Bayesian statistics but replaced the strong assumptions with the classical idea of estimating the prior distribution using the analogy principle. He called this approach empirical Bayesian.
A Large Number of Samples
The empirical Bayesian approach may make sense when the econometrician has access to a large number of related samples. In panel data models, we can think of each time series for each individual as a separate but related sample. Similarly, we can think of each auction as a sample of bids from a large number of related samples.
Robbins pointed out that with a large number of related samples it may be possible to estimate the prior. That is, instead of assuming the prior, the data would provide an estimate of the prior. Given that the prior can be estimated, the rest of the procedure is Bayesian.
Robbins (1956) states that the observed distribution of sample means can be written as a mixture distribution, where a likelihood function is weighted by the true prior distribution.
where
Robbins states that if the equation can be uniquely solved then we have an estimate of the prior. In a series of papers, Brad Efron and coauthors present assumptions which allow the equation to be uniquely solved.
Solving for the Prior and Posterior
Given both
Here we use an idea developed by statisticians at Penn State. It is similar to the mixture model estimator used in Chapter 11 (Benaglia, Chauveau, and Hunter 2009).
where
The posteriors can be aggregated up to provide an estimate of the prior.
where
We can say that the system converges if the following inequality holds.
for some small number
I am unaware of any proof showing that this method will actually lead to the solution, but if it does, then if conditions presented in Efron (2014) hold, it must be the unique solution.143
Solving for the Prior in R
Previously, we generated 500 imaginary samples in order to illustrate the idea that estimates of means have distributions. Now imagine we actually have a data set of these 500 samples. We don’t know all the possible true means, but it seems reasonable to assume that the set of true means spans the observed data. We can approximate that set with 300 “true means” evenly spaced across the range. To make things simpler, assume that we know the mean is distributed normally with a standard deviation of 5. What we don’t know is the true mean.
Note that we have 500 samples and we will calculate the posterior for each. We use a matrix with 500 rows and 300 columns.
= (max(sample_est) - min(sample_est))*c(1:300)/300 +
Theta min(sample_est)
<- matrix(NA,500,300)
H for (i in 1:500) {
<- sample_est[i,]
x <- sapply(c(1:300), function(k)
H[i,] log_norm(x, mean(Theta[k]), 5))
#print(i)
}
Given the assumptions described above, we can calculate the likelihood of observing each sample given all the possible “true distributions.” This provides an estimate of
The simplest place to start is to assume that the initial prior is a uniform distribution over the possible parameter values. This is similar to the assumption made for the Bayesian estimator. Here, it is only the starting point for estimating the prior.
<- rep(1/300,300)
g0 <- log(t(matrix(rep(g0, 500),nrow=300)))
log_G0 <- exp(H + log_G0)/sum(exp(H) + log_G0)
gamma0 <- colSums(gamma0)/sum(gamma0) # new estimate of prior g1
Given this assumption and the likelihood calculated above, we have our first iteration of the estimate of the posterior. Integrating over the posterior estimates provides the next iteration of the prior.
We repeat the process to get a second iteration of the estimate of the prior. With these two we can then create a while ()
loop to determine when the process converges.
<- log(t(matrix(rep(g1, 500), nrow=300)))
log_G1 <- exp(H + log_G1)/sum(exp(H + log_G1))
gamma1 <- colSums(gamma1)/sum(gamma1) g2
The Figure 62 presents the results of the first two iterations of the estimated priors. Note we start with a uniform or “flat” prior. The first iteration looks like a normal distribution. The second iteration is a normal distribution with thinner tails.
<- 1e-3 # small number
eps <- g1
g_old <- g2
g_new <- sum(abs(g_new - g_old))
diff while (diff > eps) {
<- g_new
g_old <- log(t(matrix(rep(g_old,500),nrow=300)))
log_G <- exp(H + log_G)/sum(exp(H + log_G))
gamma <- colSums(gamma)/sum(gamma)
g_new <- sum(abs(g_new - g_old))
diff #print(diff)
}
The following is a method for creating a representation of the prior distribution. It creates a long vector where each sample mean is repeated in correspondence to its frequency draw from the prior.
<- 100000
J <- unlist(lapply(c(1:300), function(k)
g_dist rep(mean(Theta[k]),round(J*g_new[k]))))
The Figure 63 presents the estimated density of the prior distribution for the simulated data generated above. For comparison, it presents the density of the observed sample means. Note that the empirical Bayesian prior is a lot tighter than the observed distribution of sample means. This is a standard result and is reason that the eBayesian estimator is called a shrinkage estimator. It “shrinks” the estimates in toward the true mean.
Estimating the Posterior of the Mean
Once we have the prior estimated, we can use Bayes’ rule to determine the posterior distribution of the mean conditional on the observed sample. Consider the sample of 10 observations generated above.
<-
h_x1 sapply(c(1:300), function(k) log_norm(x1, Theta[k], 5))
<- exp(h_x1 + log(g_new))/sum(exp(h_x1 + log(g_new))) gamma_x1
Given this calculation of the posterior distribution, we can use the same procedure as above to create a density that can be plotted.
<- 100000
J <- unlist(lapply(c(1:300), function(k)
g_dist_x1 rep(Theta[k],round(J*gamma_x1[k]))))
The Figure 64 presents the analog estimate which is close to -4, the true value which is -3 and the posterior estimate of the mean which is -2.88. Figure 64 presents the posterior distribution using the eBayesian procedure. The fact that the posterior estimate is close to the true value is not really that interesting. It occurs because the sample has only 10 observations, so the posterior is mostly determined by the prior. The variance is a lot smaller than suggested by the classical procedures and even the standard Bayesian procedure.
The Sultan of the Small Sample Size
In baseball, the Sultan of Swing refers to Babe Ruth. Over his lifetime in major league baseball, the Bambino had a batting average of 0.341. That is, for every 1,000 “at bats,” he hit the ball and got on base or scored a home run, 341 times.144
In 2013, the New York Times ran an article about John Paciorek (Hoffman 2013). You have probably never heard of Paciorek, but if you have, you know that he has a lifetime batting average in major league baseball of 1.000. Paciorek had a “hit” with every “at bat,” which totaled 3. Paciorek was a rookie who came up to the major leagues with the Houston Colt .45s for the last game of the 1963 season. He and his teammates had a great game against the hapless Mets. In the off-season, Paciorek was injured and never played another game in the majors.
We may ask whether Paciorek is “better” than Ruth because he has a higher lifetime batting average. You know the answer. Paciorek had only 3 at bats, while Ruth had thousands. If we think that each player has a “true” lifetime batting average, then the Law of Large Numbers suggests that Ruth’s is probably pretty close to it. How close is Paciorek’s? How uncertain is our estimate of Paciorek’s lifetime batting average?
The section presents the standard classical and Bayesian approaches to representing the uncertainty around the estimate of John Paciorek’s lifetime batting average. It shows that neither provides a satisfactory answer. It shows how to determine the empirical Bayesian posterior and shows that it does provide a satisfactory answer.
Classical or Bayesian?
Classical statistics gives a surprising answer. The analogy principle states that if we want to compare Paciorek to Ruth on their true lifetime batting average then we should use the lifetime batting average we observe. It says that Paciorek is better than Ruth, because 1.000 > 0.341. Moreover there is little uncertainty. Using both the approximation method and the bootstrap method we get the same answer; there is no variance around the estimate of 1.000. Remember our sample data is
Bayesian statistics does slightly better. With only 3 at bats, the posterior distribution of Paciorek’s lifetime batting average is completely dominated by the prior. What is the prior? Batting averages are numbers between 0 and 1, so one guess is a uniform distribution. However, anyone who knows anything about baseball will tell you that a uniform prior is ridiculous.
Uniform Prior
In addition to being a ridiculous assumption about baseball, the analysis in this section shows that it gives a ridiculous result.
Nevertheless, it is worth going through the exercise. As above, we calculate the likelihood function. Let all the true possible probabilities be the 999 numbers from
<- function(N, p_hat, p) {
log_bin = round(p_hat*N)
K return(lchoose(N,K) + K*log(p) + (N-K)*log(1-p))
}# lchoose gives the log of the binomial coefficient.
<- c(1:999)/1000
Theta <- log(rep(1/1000,999))
log_g <- sapply(Theta, function(theta) log_bin(3,1,theta))
h_jp <- exp(h_jp + log_g)/sum(exp(h_jp + log_g))
gamma_jp <- 100000
J <- unlist(lapply(c(1:999), function(j)
gamma_dist_jp rep(j,round(J*gamma_jp[j]))))
# mean
sum(gamma_jp*Theta)
[1] 0.7995997
The Figure 65 suggests that Paciorek’s lifetime batting average is very likely to be greater than Ruth’s. In fact, the posterior probability is 0.986. However, this result is all driven by the assumption that the prior is uniform.
sum(gamma_jp[342:999])
[1] 0.9863721
Estimating the Prior
To estimate the prior we need to observe a large number of lifetime batting averages. Luckily, thanks to people like Sean Lahman, we have access to data on almost every player to play Major League Baseball or Negro League Baseball between 1871 and 2018.146
<- 1000
M <- read.csv("bat_ave.csv", as.is = TRUE,nrows=M)
x # data set created from Lahman's data.
# limited the observations for computational reasons.
<- matrix(NA,M,999)
H for (m in 1:M) {
<- sapply(Theta, function(theta)
H[m,] log_bin(round(x$AB[m]),x$AVE[m],theta))
#print(m)
# sapply uses apply on a vector.
}<- t(matrix(rep(log_g,M),nrow = 999))
log_G <- exp(H + log_G)/sum(exp(H + log_G))
gamma0 <- colSums(gamma0)/sum(gamma0) g1
This uses the same method for estimating the prior as above.
<- 0.001
eps <- exp(log_g)
g_old <- g1
g_new <- sum(abs(g_new - g_old))
diff while (diff > eps) {
<- g_new
g_old = log(t(matrix(rep(g_old,M),nrow=999)))
Log_G <- exp(H + Log_G)
Gamma <- Gamma/rowSums(Gamma)
Gamma <- colSums(Gamma)/sum(Gamma)
g_new <- sum(abs(g_new - g_old))
diff #print(diff)
}
Paciorek’s Posterior
Using the likelihood for John Paciorek calculated in the previous section and the prior estimated above we can create vectors for graphing.
<-
gamma_jp2 exp(h_jp + log(g_new))/sum(exp(h_jp + log(g_new)))
<- 100000
J <- unlist(lapply(c(1:999), function(j)
g_dist_mlb rep(Theta[j],round(J*g_new[j]))))
# lapply uses apply on a list.
# unlist turns a list into a vector.
<- unlist(lapply(c(1:999), function(j)
gamma_dist_jp2 rep(Theta[j],round(J*gamma_jp2[j]))))
# mean
sum(gamma_jp2*Theta)
[1] 0.2506879
The mean of the posterior is 0.251. We could take this is our estimate of John Paciorek’s true lifetime batting average. John’s brothers Tom and Jim finished their MLB careers with lifetime batting averages of 0.282 and 0.228 respectively.
The Figure 66 presents the density functions of the estimated prior and the new posterior distribution for John Paciorek. The prior using data from almost 150 years of baseball suggests that the probability of any player having a lifetime batting average of 0.341 or greater is remote. The probability that John Paciorek’s innate lifetime batting average is greater than 0.341 is 0.005, which is a lot smaller than both 1.000 and 0.986.
sum(gamma_jp2[342:999])
[1] 0.004715468
Decision Theory
Statistics is fundamentally about making decisions under uncertainty. In some cases the statistician herself has a decision problem. Should I collect more data given data collection is costly and the data I have provides some information? In other cases, the statistician acts as a go-between. They sit between the data and the ultimate decision maker. The statistician provides a summary of data to the decision maker. But how should she summarize the data? What information should the decision maker get?
Consider the problem above. If you are limited to providing one number about John Paciorek’s batting average, which number would you choose? Consider a more realistic example. You work for a baseball team and you are considering a trade of one first baseman for another. Assume, unrealistically, that the team-manager only cares about batting average. If you are to provide a single number to the team-manager that summarizes the information about the two players, which number does that?
Decision Making Under Uncertainty
Economists have been thinking about how to make decisions under uncertainty for a very long time. The standard paradigm is called expected utility theory. In this model there is some outcome
If there are
The problem with this standard economic approach for our purposes is that we don’t know what
Compound Lotteries
One response to the fact that we don’t know
If we don’t know
where there are
For most economists, this is generally where they stop. In fact, some argue that a rational decision maker must use Equation 222 as a basis for decision making under uncertainty. The problem with the representation in Equation 222 is that it does not really account for uncertainty over what
It is easiest to see the implication if we assume that
This is pretty strange given we have the whole field of statistics dedicated to determining the best way to measure uncertainty over
What in the World Does Wald Think?
In the 1940s, Hungarian-American mathematician, Abraham Wald, took decision theory directly to the statistician’s problem. In Wald’s set up, the decision maker has sample data and must make a choice given their utility. Interestingly, Wald considered the problem as a game with two players, Nature and Statistician. Nature chooses the true state of the world, although Nature could have mixed strategies (a distribution over the true state of the world). The Statistician does not know Nature’s choice, although she may know the mixed strategies (Wald 1949).147
Wald states that the Statistician should optimize against Nature by considering expectations over a utility function of the true state of the world and the outcome the Statistician receives.
where
Importantly, if the “loss” function
Two Types of Uncertainty?
In the discussion above there are two probability distributions,
Yes. The first type is fixed, albeit unknown. It is Nature’s choice. The second type is not fixed. The second type of uncertainty varies with data and information. It may even be under control of the Statistician. For example, the Statistician may have the ability to create more data.
If we are to give just one number to summarize the data it should explicitly account for uncertainty associated with the sample data. For example the utility representation of Klibanoff, Marinacci, and Mukerji (2005),
where
Discussion and Further Reading
Given that much of the focus of statistics is providing information about uncertainty, there is surprisingly little agreement about the best approach. Moreover, the two leading candidates, classical and Bayesian statistics, do a surprisingly poor job at representing the uncertainty.
I find the empirical Bayesian approach suggested in Robbins (1956) and developed by Brad Efron and coauthors to be a compelling third way. Admittedly, the applicability of the approach may be limited, but I come across more and more applications every day. The approach also seems to be consistent with modern ideas on rational decision making under uncertainty. See Klibanoff, Marinacci, and Mukerji (2005) and Gilboa (2013).
Here, the eBayesian approach is illustrated with strong assumptions on the distribution functions. Efron (2014) points out that if the set of outcomes is finite then under random sampling the likelihood function is known; it is the multinomial function. Note, in the baseball example the likelihood function is a binomial function under random sampling.
Statistical Programming in R
Introduction
R
is based on the language S
, which was developed for statistical analysis by the same people who brought you C
. When coding in any language it is important to understand how the language works. How it thinks.
R
thinks that everything is a vector. Vectors are one dimensional “list-like” objects. They are the heart and soul of R
. R
can analyze vectors with amazing speed. A vector is not a list and not a matrix and it can only contain one type of value (numbers or strings, etc.) Lists can generally contain multiple types of objects, including other lists. Matrices are generally two dimensional objects that contain numbers.
In contrast to R
, Stata
thinks everything is a flat data set. It is optimized for operations involving columns of data. Matlab
is a matrix based language, while Python
and Mathematica
are based on LisP
(List Processing).
The chapter discusses the characteristics of objects including vectors, matrices and lists. It discusses basic control functions, statistics and standard optimization.
Objects in R
This section describes how R
uses basic objects like vectors, matrices, lists and data frames. It also discusses manipulation of objects including mathematical operations.
Vectors
Consider the following operations. Create a vector of numbers called c()
function. This function concatenates or joins vectors together. Note what happens to the numbers when you do this.
<- c(1,2,3)
a # "<-" means assign which has a different meaning than "="
<- c("one","two","three")
b a
[1] 1 2 3
b
[1] "one" "two" "three"
# a + b
# you cannot use a numerical operator like "+" on a
# non-numeric object.
<- c(a,b)
d d
[1] "1" "2" "3" "one" "two" "three"
# note that the d is a vector of strings.
There are a couple of R
-centric coding features. It is important to use <-
when assigning a variable name. It is possible to use =
but it doesn’t always mean the same thing. Also note that c(1,2)
is a vector with two elements, both numbers, while c("1","2")
is a vector with two elements, both characters.
R
can manipulate vectors very quickly. When coding it is important to remember this.
<- c(4, 5, 6)
b + b a
[1] 5 7 9
*b a
[1] 4 10 18
/b a
[1] 0.25 0.40 0.50
+ 2 a
[1] 3 4 5
*2 a
[1] 2 4 6
/2 a
[1] 0.5 1.0 1.5
Check out the operations above. Look carefully at what happens in each case. In particular, note that each operation is cell by cell. Also note what happens when you have a mathematical operation involving a single number and a vector. The single number operates on each element of the vector.
<- c(4,5)
b + b a
Warning in a + b: longer object length is not a multiple of shorter object
length
[1] 5 7 7
*b a
Warning in a * b: longer object length is not a multiple of shorter object
length
[1] 4 10 12
Note that things get a little strange when the two objects are not the same length. Importantly, R
may do the operation regardless! It may or may not give you a warning!! Can you work out what it actually did?148
Matrices
R
allows researchers to use a fairly large array of objects, which is very nice but it leads to some issues if you are not careful. Matrices are one such object and they can be very useful. In creating matrices we can use “cbind” which joins columns together or “rbind” which joins rows together. Note that the objects being joined must be the same length.
<- c(1,2,3)
a <- c(4,5,6)
b <- cbind(a,b)
A <- rbind(a,b)
B A
a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
B
[,1] [,2] [,3]
a 1 2 3
b 4 5 6
is.matrix(A)
[1] TRUE
t(A)
[,1] [,2] [,3]
a 1 2 3
b 4 5 6
The transpose operation is t()
in R
.
As with vectors, arithmetic operations on matrices are cell by cell. However, the matrices must be the same dimension in order to do cell by cell operations. Try the operations that are commented out below (remove the #
).
<- A + 2
C + C A
a b
[1,] 4 10
[2,] 6 12
[3,] 8 14
<- B*2
D *D B
[,1] [,2] [,3]
a 2 8 18
b 32 50 72
# A*B
# A + B
^2 A
a b
[1,] 1 16
[2,] 4 25
[3,] 9 36
+ t(B) A
a b
[1,] 2 8
[2,] 4 10
[3,] 6 12
You can also do standard matrix multiplication using the operator %*%
to distinguish it from cell by cell multiplication. This operation follows the mathematical rules of matrix multiplication. For this operation to work, the two “inside” dimensions must be the same. Below, we are multiplying a
%*%B A
[,1] [,2] [,3]
[1,] 17 22 27
[2,] 22 29 36
[3,] 27 36 45
# B%*%A
# t(B)%*%A
Lists
The first computer language I really learned was called Logo
. The language was developed by Seymour Papert in MIT’s Artificial Intelligence Lab. Like its antecedent, Scratch
, Logo
was designed to help children learn mathematics and programming. Logo
is based on LisP
. My father, who was a computer scientist, would get very excited about the list processing ability of Logo
. As a ten year old, I didn’t quite understand the fuss. Today, using the list processing features of R
, I am wistful of Logo
’s abilities. R
is not a list based language but it can process lists.
<- list(a,b)
a_list a_list
[[1]]
[1] 1 2 3
[[2]]
[1] 4 5 6
<- list(a_list,A)
b_list b_list
[[1]]
[[1]][[1]]
[1] 1 2 3
[[1]][[2]]
[1] 4 5 6
[[2]]
a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
<- list(A,B)
c_list c_list
[[1]]
a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
[[2]]
[,1] [,2] [,3]
a 1 2 3
b 4 5 6
<- c(c("one","two"),c_list)
c_list c_list
[[1]]
[1] "one"
[[2]]
[1] "two"
[[3]]
a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
[[4]]
[,1] [,2] [,3]
a 1 2 3
b 4 5 6
<- list(c("one","two"),c_list)
d_list d_list
[[1]]
[1] "one" "two"
[[2]]
[[2]][[1]]
[1] "one"
[[2]][[2]]
[1] "two"
[[2]][[3]]
a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
[[2]][[4]]
[,1] [,2] [,3]
a 1 2 3
b 4 5 6
Lists can be very useful for storing things, particularly different types of objects. Lists don’t require the different elements to be of the same type. This feature makes them very useful and you will find them in the background for a number of R
statistical functions. They can also store objects within objects.
One thing to notice above is that the operations c()
and list()
look similar but don’t actually do the same thing. Look carefully at the difference.
Data Frames
Data frames are a very important object for statistical analysis. The data frame is like a matrix, but it allows different types of elements. The type must be the same for each column. Data frames also behave a lot like lists. Importantly, you can call a column using the $
symbol. See discussion in the next section.
<- read.csv("minimum wage.csv", as.is = TRUE)
x # as.is = TRUE keeps the data in the same format
# as it was originally.
# note R prefers to change types to factor.
# another option is stringsAsFactors = FALSE.
is.character(x$State)
[1] TRUE
is.numeric(x$Minimum.Wage)
[1] TRUE
The object x
is a data frame. It has one column of strings and two numeric columns.
Interacting with Objects
This section discusses how we can transform objects and retrieve information from them.
Transforming Objects
We can transform objects of various types into various other types with various degrees of success. Transformations, or confirmations, of an object type usually involve as.someobjecttype()
. Note that matrix
is often what I use to transform a vector into a matrix. Note that you need to state how many rows or columns it has.
as.vector(B)
[1] 1 4 2 5 3 6
matrix(c(a,b),nrow=3)
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
as.matrix(cbind(a,b))
a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
as.list(B)
[[1]]
[1] 1
[[2]]
[1] 4
[[3]]
[1] 2
[[4]]
[1] 5
[[5]]
[1] 3
[[6]]
[1] 6
unlist(a_list)
[1] 1 2 3 4 5 6
as.character(A)
[1] "1" "2" "3" "4" "5" "6"
as.factor(B)
[1] 1 4 2 5 3 6
Levels: 1 2 3 4 5 6
as.vector(as.numeric(as.character(as.factor(a)))) == a
[1] TRUE TRUE TRUE
The oddest, and possibly the most frustrating type in R
, is the factor type. R
likes to store character vectors as factors because it is computationally efficient. But it is easy to confuse a factor object with a numeric object. They are not the same. I often have errors in my code due to this mistake.
Logical Expressions
Like other operations, logical operators are cell by cell in R
. The ==
is used to determine whether something is true or false. We can also use order operations, >
, <
, >=
and <=
. The !
is used for not. The symbols &
and |
are used for “and” and “or” for combining logical arguments. You can also use &&
which will ask if all the elements are the same.
==b a
[1] FALSE FALSE FALSE
==t(B) A
a b
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
1]]==a_list[[2]] a_list[[
[1] FALSE FALSE FALSE
> b a
[1] FALSE FALSE FALSE
> 5 b
[1] FALSE FALSE TRUE
>= 5 b
[1] FALSE TRUE TRUE
<= 4 b
[1] TRUE FALSE FALSE
!= b a
[1] TRUE TRUE TRUE
> 4) & a == 3 (b
[1] FALSE FALSE TRUE
# (b > 4) && a == 3 (This may have been depricated)
> 4) | a == 3 (b
[1] FALSE TRUE TRUE
# (b > 4) || a == 3
Retrieving Information from a Position
There are a number of ways to retrieve information from objects. The simplest is to request it by its position in the object. Objects, such as vectors, have an index which gives the position of every object in the vector. Note R
thinks of matrices as just rearranged vectors, so the index can also be used for matrices. As matrices are two-dimensional objects, information can also be retrieved using the matrix coordinates.
1] a[
[1] 1
3] b[
[1] 6
5] A[
[1] 5
2,2] A[
b
5
We are not limited to a single index number. We can retrieve a subset of the object using various index notation including -
to mean “not.” Note that there is no “end” index notation in R
. In its stead I use length()
or dim()
. Note that dim()
only works for objects with more than 1 dimension, like matrices.
1:2] a[
[1] 1 2
-3] a[
[1] 1 2
1,] A[
a b
1 4
c(1,5)] A[
[1] 1 5
2:length(a)] a[
[1] 2 3
<- cbind(A,2*A)
D D
a b a b
[1,] 1 4 2 8
[2,] 2 5 4 10
[3,] 3 6 6 12
3:dim(D)[2]] D[,
a b
[1,] 2 8
[2,] 4 10
[3,] 6 12
Retrieving information from lists can be frustrating. Consider the three different methods below. Note that the first two look very similar and seem to produce similar results, but they are actually quite different. If I understood the difference I would tell you! Suffice it to say, the double bracket thingy is probably what you want.
2] a_list[
[[1]]
[1] 4 5 6
2][2] a_list[
[[1]]
NULL
2]] a_list[[
[1] 4 5 6
2]][2] a_list[[
[1] 5
As well as the index, positions in an object may be named. In that case, the name can be used to retrieve the information. This is particularly useful for lists. To retrieve a named item in a list you can use the $
. RStudio
has a nice feature of popping up the options when you type the $
, so you don’t need to remember all the names.
names(a_list) <- c("first", "second")
$first a_list
[1] 1 2 3
names(a_list)
[1] "first" "second"
names(a_list)[2]
[1] "second"
<- list(a_list,B=B)
b_list b_list
[[1]]
[[1]]$first
[1] 1 2 3
[[1]]$second
[1] 4 5 6
$B
[,1] [,2] [,3]
a 1 2 3
b 4 5 6
$B b_list
[,1] [,2] [,3]
a 1 2 3
b 4 5 6
Retrieving the Position from the Information
Often we are trying to determine where certain information lies in the object. We can use logical expressions and the function which()
to find the index positions of vectors.
which(a > 2)
[1] 3
which(A > 2)
[1] 3 4 5 6
which(a_list[[2]] > 2)
[1] 1 2 3
If we are going to use the index to subset another object, we don’t need to use which()
.
> 2] b[a
[1] 6
> 2] B[A
[1] 2 5 3 6
colnames(A)=="b"] A[,
[1] 4 5 6
We can also use match()
or %in%
to determine which information belongs to a set. Note that %in%
returns TRUE
or FALSE
, while match()
returns the index or NA
if there is no match. We can use match()
to find objects in a list. Note that we need to be careful that we are looking for the correct type of object in the list. But match()
can also find things that are not exactly format. See example below.
<- c("one", "two", "three", "four")
d c("one", "four", "five") %in% d
[1] TRUE TRUE FALSE
match(a,c(1,2))
[1] 1 2 NA
match(a_list,c(4,5,6))
[1] NA NA
match(a_list,list(c(4,5,6)))
[1] NA 1
match("1",c(3,5,8,1,99))
[1] 4
Statistics
This section discusses basic statistical operations in R
.
Data
set.seed(123456789)
<- c(1:1000)
a <- c("one","two",NA,4,5:1000)
b <- rnorm(1000)
e <- 2 - 3*a + e
c <- as.data.frame(cbind(a,b,c)) x
Consider the simulated data. The data frame object has a nice property that allows you to have objects in which some variables are numeric while others are strings. That said, you need to be careful and keep track of the variable types. Note that all the variables in x
are factors. R
has transformed them.
<- as.data.frame(cbind(a,b,c), stringsAsFactors = FALSE) x
What type are the variables in x
now?
$a <- as.numeric(x$a)
x$c <- as.numeric(x$c)
xwrite.csv(x,"x.csv")
Some useful functions for creating simulated data are rnorm()
and runif()
. These are random number generators. You can generate pseudo-random numbers from various R
functions. Use related functions to calculate probability distributions and densities. Note that these numbers are generated according to a particular function. They are generated such that if they start with a particular number, a seed, then they will always give the same sequence. So if you can get the system to always start with the same number, you can have the results be exactly reproducible. In R
, set.seed()
does this. I always set the seed with 123456789.
As a general rule, I like to keep my scripts in the same folder as the data. In this way there are no long paths for writing or reading files. If you want to do this in RStudio
, then make sure to go to Session > Set Working Directory > To Source File Location
.
Missing Values
R
has a particular way of dealing with missing values. It uses NA
for missing. Any operation of a value with a missing will give back a missing. This is a nice feature which forces the programmer to think about how to deal with missing information. R
will also treat NA
as numeric.
2 + NA
[1] NA
2*NA
[1] NA
2 + c(3,4,6,7,NA)
[1] 5 6 8 9 NA
Often in microeconometrics we want to ignore the missing values. We may be assuming that they are missing-at-random. We did this in our analysis of returns to schooling for men in Chapters 1 and 2.
We can have many R
functions skip over the missing using the option na.rm = TRUE
. We see this option used below. Some functions, like lm()
, will do this automatically.
Summary Statistics
We can do some basic summary statistics on the data using mean()
and sd()
. We can also use the quantile function to find out more about the distribution.
mean(x$a)
[1] 500.5
sd(x$a)
[1] 288.8194
quantile(x$a,c(1:5)/5)
20% 40% 60% 80% 100%
200.8 400.6 600.4 800.2 1000.0
# rowMeans(x[,c(1,3)])
colMeans(x[,c(1,3)])
a c
500.500 -1499.488
$d <- NA
x2:1000,]$d <- c(2:1000)/10
x[# mean(x$d)
mean(x$d, na.rm = TRUE)
[1] 50.1
Regression
The standard OLS function is lm()
for linear model. When this function is run it creates a list of objects which can be used for various things. For example, you can recall just the coefficient estimates or the residuals from the regression. A nice feature of these functions is that you can just use the variable names in the regression call. You can do this as long as you specify the data frame object using data=.
There are various built-in non-linear regression functions that may be called using the glm()
procedure.
<- read.csv("x.csv", as.is = TRUE)
x #summary(x)
<- lm(c ~ a, data=x)
lm1 summary(lm1)[[4]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.097396 0.0641312421 32.70475 4.934713e-160
a -3.000170 0.0001109953 -27029.69905 0.000000e+00
$coefficients lm1
(Intercept) a
2.097396 -3.000170
<- glm(c > -1500 ~ a, family = binomial(link=probit),
glm1 data=x)
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
$coefficients glm1
(Intercept) a
3799.867658 -7.592143
Note that when R
reads in data using the read.csv()
function it creates an extra variable.
Control
This section discusses basic control methods used in R
. There is a folklore that you cannot do loops in R
. This is not quite correct. There are good and bad ways to do loops in R
.
Loops
Looping is a fundamental part of programming. The basic loop is the “for-loop.” The “while-loop” is an alternative. The for ()
loop is better when there is a particular number of iterations, while the while ()
loop is better when you want it to stop when some value is reached.
Looping in R
Looping in R
has special challenges. Consider the following problem of creating a matrix with three consecutive numbers in each row.
# don't do it this way!!!
<- NULL
A <- Sys.time()
start_time for (i in 1:10000) {
<- rbind(A,c(i,i+1,i+2))
A
}Sys.time() - start_time
Time difference of 0.2259481 secs
400,] A[
[1] 400 401 402
sum(A)
[1] 150045000
# A faster way
<- matrix(NA,10000,3)
A <- Sys.time()
start_time for (i in 1:10000) {
<- c(i,i+1,i+2)
A[i,]
}Sys.time() - start_time
Time difference of 0.005707026 secs
400,] A[
[1] 400 401 402
sum(A)
[1] 150045000
# An even faster way! (Sometimes)
<- c(1:10000)
a <- Sys.time()
start_time <- t(matrix(sapply(a, function(x) c(x,x+1,x+2)),nrow = 3))
A Sys.time() - start_time
Time difference of 0.006199837 secs
400,] A[
[1] 400 401 402
sum(A)
[1] 150045000
There is a particular way not to loop in R
. R
loops are very very slow if they involve creating and recreating objects. These types of loops also tend to be very memory hungry. In the previous example you can get a significant speed increase by creating the object at the start and then filling it in during the loop.
The example above also illustrates the apply()
function. Using apply()
is often the fastest way of looping. I don’t find it very intuitive and so I tend not to use this method, but it may be useful if you are trying to speed up some code.
If Else
R
uses two different types of “if-else” commands. The command you will use most of the time is ifelse()
. This function operates on vectors. The command takes three arguments. The first is the logical statement to be checked, the second is what to do if the statement is true and the third is what to do if the statement is false. Note that like other operations on vectors this function works cell by cell.
<- c(1,2,3,4,5)
a <- ifelse(a==3,82,a)
b a
[1] 1 2 3 4 5
b
[1] 1 2 82 4 5
The more standard programming command is if (){}
or if (){} else{}
. This command checks the logical statement and then runs the command in the curly brackets. Note that you don’t have to specify the “else.”
<- "Chris"
A # A <- "Write Your Name Here"
if (A=="Chris") {
print("Hey Chris")
else {
} print(paste("Hey",A))
}
[1] "Hey Chris"
Optimization
Optimization is fundamental to statistical modeling. The main command in R
uses functions as inputs, so the section discusses how to create and use these objects. It also discusses the basic optimizer, optim()
.
Functions
If you plan to have procedures that you will use repeatedly, then it may be more efficient to create a function. Functions also help keep your script clean and tidy. They also make debugging easier by focusing you where the location of the bug is.
<- x[,c(2,4)]
y apply(y, 2, mean)
a c
500.500 -1499.488
colMeans(y)
a c
500.500 -1499.488
<- c(1:dim(y)[1])
b summary(sapply(b, function(x) sum(y[x,])), digits = 2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2000.0 -1500.0 -1000.0 -1000.0 -500.0 0.5
summary(rowSums(y), digits = 2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2000.0 -1500.0 -1000.0 -1000.0 -500.0 0.5
We can define functions on the fly like when we use the apply()
function.
<- function(x) {
my_mean if (is.numeric(x)) {
return(mean(x, na.rm = TRUE))
}else return("Not Numeric!")
}
my_mean(x$b)
[1] "Not Numeric!"
my_mean(x$a)
[1] 500.5
my_mean(c(x$a,NA,mean(x$a)))
[1] 500.5
When writing functions it is good practice to check that the input is the correct type and provide a warning or error code if it is not the expected type.
<- function(Y_in, X_in, Z_in = X_in, Reps = 100,
lm_iv min_in = 0.05, max_in = 0.95) {
# takes in the y variable, x explanatory variables
# and the z variables if available.
# defaults: Z_in = X_in,
# Reps = 100, min_in = 0.05, max_in = 0.95
# Set up
<- is.na(rowSums(cbind(Y_in,X_in,Z_in)))
index_na <- as.matrix(Y_in[index_na==0])
Yt <- as.matrix(cbind(1,X_in))
Xt <- Xt[index_na==0,]
Xt <- as.matrix(cbind(1,Z_in))
Zt <- Zt[index_na==0,]
Zt # turns the inputs into matrices
# removes observations with any missing values
# add column of 1s to X and Z
# Bootstrap
<- c(1:Reps)
r <- sapply(r, function(x)
ibs round(runif(N_temp, min = 1, max = N_temp)))
<- sapply(ibs, function(x)
bs_temp solve(t(Zt[ibs,])%*%Xt[ibs,])%*%t(Zt[ibs,])%*%Yt[ibs])
# Present results
<- matrix(NA,dim(Xt)[2],4)
res_temp 1] <- colMeans(bs_temp)
res_temp[,for (j in 1:dim(Xt)[2]) {
2] <- sd(bs_temp[,j])
res_temp[j,3] <- quantile(bs_temp[,j],min_in)
res_temp[j,4] <- quantile(bs_temp[,j],max_in)
res_temp[j,
}colnames(res_temp) <-
c("coef","sd",as.character(min_in),as.character(max_in))
return(res_temp)
}
Here is a version of the instrumental variable function. This function uses matrix algebra to calculate the instrumental variable regression. It presents the bootstrap results. The input variables listed at the start of the function can be set to default values. This means that the user doesn’t have to enter these values. At the start of the function you can define objects and variables. This is a good place to confirm that objects are what you think they are. This version of the function uses sapply()
to do the bootstrap loop.149
optim()
The standard optimization procedure in R
is optim()
. The algorithms used in R
were developed by John C. Nash (no relation to the Nobel prize winning mathematician) in the 1970s. Nash developed the optimizer in a language called BASIC in order to run on the particular machine that Nash was using. According to John C. Nash, there are better optimization algorithms available (Nash 2014).
<- function(par_in, Y_in, X_in) {
f_ols <- par_in
beta <- as.matrix(Y_in)
Y_in <- as.matrix(cbind(1,X_in))
X_in if (length(beta)==dim(X_in)[2]) {
return(mean((Y_in - X_in%*%beta)^2, na.rm = TRUE))
} else {
return("The number of parameters doesn't match.")
}
}
<- optim(par=c(2,-3),fn=f_ols,Y_in=x$c,X_in=x$a)
lm_ols lm_ols
$par
[1] 2.097287 -3.000170
$value
[1] 1.024609
$counts
function gradient
67 NA
$convergence
[1] 0
$message
NULL
lm(x$c ~ x$a)
Call:
lm(formula = x$c ~ x$a)
Coefficients:
(Intercept) x$a
2.097 -3.000
Above is a simple least squares function. The problem is solved using optim()
. The function, optim()
, takes initial starting values, a function to optimize and variable values. It has a large number of default settings. In particular, it uses the Nelder-Mead procedure as a default. One thing to note is that this procedure may give different results depending on the starting values used. What happens if you change the starting values to par=c(1,1)
?
Discussion and Further Reading
While I have been programming for a very long time, I am not a programmer. The objective of this chapter is to give you some basic insights into R
. If you are interested in really learning how to program in R
then I suggest purchasing one of the many books out there or taking an online course. I highly recommend Kabacoff (2011) as it is written by a statistician/computer scientist. I also suggest Paarsh and Golyaev (2016) as an excellent introduction to data analysis in economics.
Note
The views expressed here are those of the author and do not reflect those of the Federal Trade Commission or the Congressional Budget Office.
I am grateful for the help I received from friends and colleagues who read and commented on the book. I am particularly grateful to Emek Basker, David Vanness and Greg Kirwin who read early versions of the early chapters. I am grateful to George Deltas, Peter Newberry, Joris Pinkse and Nathan Wilson, for their thoughts and insights into the “structural chapters.” Also, to Ott Toomet and Grant Farnsworth who reviewed sections at the request of the editor. Finally, I am very grateful to Brian Krauth, David Prentice and Devesh Raval who worked through the whole text making suggestions and edits.
All errors are my own.
References
Footnotes
We use the term regression to refer to a function that summarizes the relationship in the data.↩︎
An estimator is the mathematical method by which the relationship in the data is determined.↩︎
http://www.opn.ca6.uscourts.gov/opinions.pdf/11a0084p-06.pdf↩︎
https://www.cbo.gov/sites/default/files/Public_Policy_Issues_Memo_Rivlin_1976.pdf↩︎
In almost all cases, you should punt on fourth down.↩︎
Unfortunately, there are some issues with characters, so instead I just present the table of regression results, which is the 4th element of the list created by
summary()
.↩︎OLS can estimate models where the unobserved characteristic enters multiplicatively as this model is additive in logs.↩︎
Under additional assumptions we can use other algorithms such as maximum likelihood or the generalized method of moments.↩︎
For a matrix algebra refresher check out Khan Academy and Sol’s excellent presentation of the basic concepts. Appendix B presents additional discussion of the use of matrices in
R
.↩︎This is useful to remember if you are ever hungry in Athens because it is the first letter of the famous Greek sandwich, the souvlaki (
).↩︎See discussion in Appendix A.↩︎
An estimate is said to be biased if we take a large number of imaginary samples and the average imaginary estimate differs from the true value. This issue is discussed in detail in Appendix A.↩︎
No. In matrix algebra, order matters!↩︎
They are not generally equal because we didn’t include the unobserved term (
).↩︎The rank of the matrix refers to the number of linearly independent columns (or rows). A matrix is full-rank if all of its columns (rows) are linearly independent of each other.↩︎
A transpose of a matrix is the same matrix with the columns and rows swapped.↩︎
Remember that for matrix multiplication the “inside” numbers need to match.↩︎
We traditionally use the ``hat’’ notation to represent that
is the estimate of .↩︎In the simulation the unobserved term has a mean of 0. As mentioned above, the average of the unobserved term gets closer to zero when
is larger.↩︎Under certain conditions the optimal value, minimum or maximum, can be determined by the first order condition. The first order condition can be found by taking the derivative and setting the equation to zero.↩︎
This is arbitrary, but I know the true values lie between these two values.↩︎
See the discussion above for the previous estimators.↩︎
The derivation of these numbers is discussed in Appendix A.↩︎
This and other similar surveys can also be found at the US Bureau of Labor Statistics (https://www.bls.gov/nls/home.htm).↩︎
This data is found here: https://sites.google.com/view/microeconometricswithr/table-of-contents Note before running this code you need to create a script, say
Card.R
and save that file with your data. You can then go to theRStudio
menu and set the working directory to the location of your script (Session -> Set Working Directory -> To Source File Location).↩︎Actually, the combined effect is captured by adding the two coefficients together. The model can’t separate the two effects.↩︎
The table uses the stargazer package (Hlavac 2018). If you are using stargazer in Sweave then start the chunk with results=tex embedded in the chunk header.↩︎
Matrix rank refers to the number of linearly independent columns or rows.↩︎
Micronumerosity refers to the problem of small sample size.↩︎
As an exercise try to replicate the whole table. Note that you will need to carefully read the discussion of how the various variables are created.↩︎
See discussion in Chapter 1 regarding interpreting the coefficient.↩︎
The UCLA computer scientist, Judea Pearl, is a proponent of using DAGs in econometrics. These diagrams are models of how the data is generated. The associated algebra helps the econometrician and the reader determine the causal relationships and whether or not they can be estimated (Pearl and Mackenzie 2018).↩︎
It is the reciprocal of
because we follow the arrow backwards from to .↩︎You should think about the reasonableness of this assumption for the problem discussed below.↩︎
See discussion of hypothesis testing in Appendix A.↩︎
See earlier discussion.↩︎
The code book for the data set is located here: https://sites.google.com/view/microeconometricswithr/table-of-contents↩︎
See discussion of hypothesis testing in Appendix A.↩︎
It may more appropriate to use a probit or logit. These models are discussed in Chapter 5.↩︎
Alexandria Ocasio-Cortez is often refered to as AOC.↩︎
Some of these changes are discussed in Chapter 1.↩︎
This chapter discusses the second, but the two are mathematically related.↩︎
Field experiments are randomized trials in which people or villages or schools are assigned between trial arms.↩︎
If we don’t know the possible values, we could use the observed values. This assumption may be less “natural” than we may prefer.↩︎
Would an IV estimator discussed in Chapter 3 satisfy Assumption 2?↩︎
Does
satisfy the assumptions of an instrumental variable?↩︎See New State Ice Co vs. Liebmann 285 US 262 (1932).↩︎
https://www.nobelprize.org/prizes/economic-sciences/2000/mcfadden/lecture/↩︎
https://www.ftc.gov/system/files/documents/public_statements/295971/130103googlesearchstmtofcomm.pdf↩︎
This is a standard assumption in economics. It means that the unobserved utility can be thought of as money.↩︎
In machine learning, discrete choice is referred to as a classification problem.↩︎
This is actually a probit model presented in more detail below.↩︎
Independent means that if I know the first two coin tosses result in Heads, the probability of a Head in the third coin toss is the same as if I saw two Tails or any other combination. The previous coin toss provides no additional information about the results of the next coin toss, if the true probability is known.↩︎
See discussion of the analogy principle in Appendix A.↩︎
Optimums do not vary with monotonic transformations.↩︎
The function
optimize()
is used when optimizing over one variable, whileoptim()
is used for optimizing over multiple variables.↩︎This has to do with the ability of this function to quickly run through vectors. See Appendix B for a discussion of programming in
R
.↩︎This assumes that the outcomes are independent and identically distributed.↩︎
See the formula for standard normal.↩︎
Thanks to Joris Pinkse for pointing this issue out.↩︎
This issue is discussed in Chapter 1.↩︎
Note in the simulated data, the
term is dropped so that the assumption holds.↩︎It is used in neural network estimation models. In machine learning this is called a “sigmoid” function.↩︎
It is this property that makes it useful as an “activation” function in neural network models.↩︎
See Goldberger (1991) for the best exposition that I am aware of.↩︎
The logit function has the advantage that it is very easy to estimate, but the disadvantage that places a lot of restrictions on how preferences determine choices. Given these restrictions there is interest in using more flexible models such as the multi-layered logit (deep neural-network) or the mixed logit model discussed in Nevo (2000).↩︎
We will use the term “choice” to refer to assignment to treatment that is associated with some sort of economic decision. It does not mean that the observed individual actually had a “choice.”↩︎
See Appendix A for a longer discussion of what it means for an estimator to be unbiased.↩︎
According to former Tobin students, the estimator once took an army of graduate research assistants armed with calculators to estimate the model.↩︎
There are various exemptions such as for employees receiving tips.↩︎
This data and other similar data sets are available from the Bureau of Labor Statistics here: https://www.nlsinfo.org/investigator/pages/login.jsp. This version can be downloaded from here: https://sites.google.com/view/microeconometricswithr/table-of-contents↩︎
When dealing with joint distributions it is useful to remember the relationship
where and represent events associated with each of two random variables. The term on the left-hand side represents the joint distribution. The middle term is the probability of observing A conditional on the value of B, multiplied by the probability of observing B. The term on the far right is the other way around.↩︎I can never remember exactly how to do this, so I always keep a copy of Goldberger (1991) close by. In
R
you can sometimes get around remembering all this and use a package likemvtnorm
to account for multivariate normal distributions.↩︎See discussion of the bivariate probit in the previous chapter.↩︎
Assuming a 50 week work-year.↩︎
Again, this data set is available from the Bureau of Labor Statistics here: https://www.nlsinfo.org/investigator/pages/login.jsp. This version can be downloaded from here: https://sites.google.com/view/microeconometricswithr/table-of-contents↩︎
See a discussion of this data in Chapters 1 and 2.↩︎
I have simplified things somewhat in order to use the Heckman model presented above.↩︎
This is often called a control function, and the approach is called a control function approach.↩︎
The next chapter explores this approach.↩︎
See Chapter 3 for discussion of the assumptions.↩︎
https://perso.telecom-paristech.fr/eagan/class/igr204/datasets↩︎
Note in the Berry model we take the log of shares. Remember that log of 0 is infinity, so the code adds a small number to all the shares.↩︎
The cleaned data is available here: https://sites.google.com/view/microeconometricswithr/↩︎
The number $267,888,011 is total revenue from the original Dominicks sales data, wcer.csv. Here we used a fraction of the original data (52/367) to get the annual amount.↩︎
The version of the data used here is available at https://sites.google.com/view/microeconometricswithr/table-of-contents↩︎
Variance of a random variable
is , where .↩︎For simplicity we won’t consider options to kick a field goal.↩︎
You should try different values.↩︎
eMarketer.com, 7/26/16↩︎
Bidders choose the same bid strategy in both auction formats.}↩︎
This is a Nash equilibrium where it is assumed that players use Bayes’ rule to update their information given the equilibrium strategies. Because of the IPV assumption, there is no information provided by the other bidders. This is not the case in common values auctions.}↩︎
A non-parametric estimator makes no parametric assumptions.}↩︎
This is a finite difference estimator.↩︎
The probability of winning is the probability that your bid is higher than the other bidders in the auction.↩︎
This is an example of using finite differences to calculate numerical derivatives.↩︎
The kernel density method assumes the distribution can be approximated as a mixture of normal distributions.↩︎
Officially, the price may be a small increment above the bid of the second highest bidder. We will ignore this possibility.↩︎
You may think of this as just some academic question. But the US Senator for Iowa, Senator Church, was not happy with the decision. “In fact, there is a growing body of evidence that shows that serious economic dislocations may already be occurring as a result of the sealed bid requirement.” See Congressional Record September 14 1977, p. 29223.↩︎
http://www.econ.yale.edu/~pah29/timber/timber.htm. The version used here is available from here: https://sites.google.com/view/microeconometricswithr/table-of-contents.↩︎
Baldwin, Marshall, and Richard (1997) discuss the importance of various observable characteristics of timber auctions.}↩︎
It is approximately normal, but it is skewed somewhat to lower values. This may be due to low bids in the English auction. How does the distribution look if only sealed bids are graphed?↩︎
Asker (2010) presents a detailed account of a bid-ring in stamp auctions.↩︎
In the bid ring mechanism discussed in Asker (2010), the collusion actually leads to higher prices in the main auction.↩︎
See Equation 150 for the other cases.↩︎
For simplicity it is assumed that all of these auctions have six bidders. Once there are a large enough number of bidders in the auction, prices do not really change with more bidders. In fact, these methods may not work as the number of bidders gets large (Deltas 2004).↩︎
You need to transform both
and from matrices to vectors in order to run the regression.↩︎Card and Krueger (1994) present results on a measure called full-time equivalence. It is unclear how that measure is calculated.↩︎
Chapter 11 uses a mixture model to account for this heterogeneity.↩︎
See Appendix A for a discussion of this property.↩︎
It can be consistently estimated as the number of time periods gets large, but microeconometric panels tend not to have a large number of time periods.↩︎
The average
will tend to zero. See Chapter 1 and Appendix A.↩︎An alternative approach to dealing with the nuisance estimator is to assume that the nuisance parameter has a particular distribution. This approach is generally called “random effects.”↩︎
The data is available here: https://sites.google.com/view/microeconometricswithr/table-of-contents.↩︎
In particular, it is generated using a convex factor model, which is discussed in more detail below.↩︎
Eventually, the estimator also runs into collinearity and multicollinearity problems Doudchenko and Imbens (2016).↩︎
Another standard regularization procedure is to have a penalty equal to the square of the values. This is called a ridge regression.↩︎
From Chapter 1 we know that the standard OLS model gives unbiased estimates. If the model presented above is true, then this regularized OLS model must give biased estimates of the true weights. The importance of this is up for debate.↩︎
Mixture models are discussed in Chapter 11.↩︎
Remember that with matrices, order matters.↩︎
Auto-correlation refers to correlation in the unobserved characteristics over time.↩︎
Try re-doing without aggregation. The auto-correlation in the data will tend to imply that there are only two factors.↩︎
One issue with the estimator is that it will tend to run the estimate of the standard deviation down to zero if the starting value is too small.↩︎
See Appendix A for a discussion of the use of Bayesian updating.↩︎
kmeans()
is a method for grouping correlated observations.↩︎Levine, Hunter, and Chauveau (2011) presents a similar algorithm which is guaranteed to converge.↩︎
An expectation-maximization (EM) algorithm is a widely used alternative approach involving parametric assumptions.↩︎
The fact that particular individuals may be poor at estimating both their own or their sibling’s education level will not affect the procedure used here except through sampling correlation. See Ashenfelter and Krueger (1994) for discussion of this issue in regards to the IV approach.↩︎
The data is available here: https://dataspace.princeton.edu/jspui/handle/88435/dsp01rv042t084↩︎
These assumptions are discussed in Chapter 3.↩︎
The ideas in this section are inspired by recent work by Jean-Marc Robin, an econometrician at Sciences Po in Paris.}↩︎
I re-ran the mixture model a number of times. I began with all nine types and each time I removed the type which were estimated to be the smallest.↩︎
We should be a little careful because we are using a normal distribution to approximate the density.↩︎
The prior distribution is the probability distribution over the true values prior to observing the data. The posterior distribution is distribution after observing the data.↩︎
Levine, Hunter, and Chauveau (2011) shows that a similar algorithm does converge.↩︎
An “at bat” refers to each time the batter goes up to bat, but does not include the times when the batter is “walked” and moves to first base without hitting the ball.↩︎
Of course, there is an assumption. We are assuming random sampling. That is each at-bat is iid.↩︎
Go to http://www.seanlahman.com/baseball-archive/statistics/ This is almost 20,000 players.↩︎
A more modern game-theoretic approach would use the Nash equilibrium and set it up as a game of incomplete information. That is, the Statistician does not know Nature’s type. Each type plays an equilibrium strategy in choosing the true state of the world. The Statistician does not know Nature’s type, however by interacting with Nature through experiments, the Statistician can learn Nature’s type. In Bayes-Nash equilibrium, the Statistician uses Bayes’ rule to update the distribution over Nature’s type and the true state of the world.↩︎
In the case of a + b, it did the following operations, 1 + 4, 2 + 5, 3 + 4. That is,
R
simply started over again with the first element of b. This is referred to as recycling.↩︎In some cases it may be faster to call a compiled language like
C
. It is also possible to create compiled functions inR
, see Kabacoff (2011).↩︎