Chapter 8 Frisch-Waugh-Lovell Theorem
8.1 Theorem in plain English
The Frisch-Waugh-Lovell Theorem (FWL; after the initial proof by Frisch and Waugh (1933), and later generalisation by Lovell (1963)) states that:
Any predictor’s regression coefficient in a multivariate model is equivalent to the regression coefficient estimated from a bivariate model in which the residualised outcome is regressed on the residualised component of the predictor; where the residuals are taken from models regressing the outcome and the predictor on all other predictors in the multiva riate regression (separately).
More formally, assume we have a multivariate regression model with k predictors:
ˆy=ˆβ1x1+...ˆβkxk+ϵ.
FWL states that every ˆβj in Equation @ref{eq:multi} is equal to ˆβ∗j, the residual ϵ=ϵ∗ in:
ϵy=ˆβ∗jϵxj+ϵ∗
where:
ϵy=y−∑k≠jˆβykxkϵxj=xj−∑k≠jˆβxjkxk.
where ˆβyk and ˆβxjk are the regression coefficients from two separate regression models of the outcome (omitting xj) and xj respectively.
In other words, FWL states that each predictor’s coefficient in a multivariate regression explains that variance of y not explained by both the other k-1 predictors’ relationship with the outcome and their relationship with that predictor, i.e. the independent effect of xj.
8.2 Proof
8.2.1 Primer: Projection matrices14
We need two important types of projection matrices to understand the linear algebra proof of FWL. First, the prediction matrix that was introduced in Chapter 4:
P=X(X′X)−1X′.
Recall that this matrix, when applied to an outcome vector (y), produces a set of predicted values (ˆy). Reverse engineering this, note that ˆy=Xˆβ=X(X′X)−1X′y=Py.
Since Py produces the predicted values from a regression on X, we can define its complement, the :
M=I−X(X′X)−1X′,
since My=y−X(X′X)−1X′y≡y−Py≡y−Xˆβ≡ˆϵ, the residuals from regressing Y on X.
Given these definitions, note that M and P are complementary:
y=ˆy+ˆϵ≡Py+MyIy=Py+MyIy=(P+M)yI=P+M.
With these projection matrices, we can express the FWL claim (which we need to prove) as:
y=X1^β1+X2^β2+ˆϵM1y=M1X2ˆβ2+ˆϵ,
8.2.2 FWL Proof 15
Let us assume, as in Equation ??? that:
Y=X1ˆβ1+X2ˆβ2+ˆϵ.
First, we can multiply both sides by the residual maker of X1:
M1Y=M1X1ˆβ1+M1X2ˆβ2+M1ˆϵ,
which first simplifies to:
M1Y=M1X2ˆβ2+M1ˆϵ,
because M1X1ˆβ1≡(M1X1)ˆβ1≡0ˆβ1=0. In plain English, by definition, all the variance in X1 is explained by X1 and therefore a regression of X1 on itself leaves no part unexplained so M1X1 is zero.
Second, we can simplify this equation further because, by the properties of OLS regression, X1 and ϵ are orthogonal. Therefore the residual of the residuals are the residuals! Hence:
M1Y=M1X2ˆβ2+ˆϵ◻.
A couple of interesting features come out of the linear algebra proof:
FWL also holds for bivariate regression when you first residualise Y and X on a n×1 vector of 1’s (i.e. the constant) – which is like demeaning the outcome and predictor before regressing the two.
X1 and X2 are technically of mutually exclusive predictors i.e. X1 is an n×k matrix {X1,...,Xk}, and X2 is an n×m matrix {Xk+1,...,Xk+m}, where β1 is a corresponding vector of regression coefficients β1={γ1,...,γk}, and likewise β2={δ1,...,δm}, such that:
Y=X1β1+X2β2=X1ˆγ1+...+Xkˆγk+Xk+1ˆδ1+...+Xk+mˆδm,
Hence the FWL theorem is exceptionally general, applying not only to arbitrarily long coefficient vectors, but also enabling you to back out estimates from any partitioning of the full regression model.
8.3 Coded example
set.seed(89)
## Generate random data
df <- data.frame(y = rnorm(1000,2,1.5),
x1 = rnorm(1000,1,0.3),
x2 = rnorm(1000,1,4))
## Partial regressions
# Residual of y regressed on x1
y_res <- lm(y ~ x1, df)$residuals
# Residual of x2 regressed on x1
x_res <- lm(x2 ~ x1, df)$residuals
resids <- data.frame(y_res, x_res)
## Compare the beta values for x2
# Multivariate regression:
summary(lm(y~x1+x2, df))
##
## Call:
## lm(formula = y ~ x1 + x2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.451 -1.001 -0.039 1.072 5.320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.33629 0.16427 14.222 <2e-16 ***
## x1 -0.31093 0.15933 -1.952 0.0513 .
## x2 0.02023 0.01270 1.593 0.1116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.535 on 997 degrees of freedom
## Multiple R-squared: 0.006252, Adjusted R-squared: 0.004258
## F-statistic: 3.136 on 2 and 997 DF, p-value: 0.04388
##
## Call:
## lm(formula = y_res ~ x_res, data = resids)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.451 -1.001 -0.039 1.072 5.320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.986e-17 4.850e-02 0.000 1.000
## x_res 2.023e-02 1.270e-02 1.593 0.111
##
## Residual standard error: 1.534 on 998 degrees of freedom
## Multiple R-squared: 0.002538, Adjusted R-squared: 0.001538
## F-statistic: 2.539 on 1 and 998 DF, p-value: 0.1114
Note: This isn’t an exact demonstration because there is a degrees of freedom of error. The (correct) multivariate regression degrees of freedom is calculated as N−3 since there are three variables. In the partial regression the degrees of freedom is N−2. This latter calculation does not take into account the additional loss of freedom as a result of partialling out X1.
8.4 Application: Sensitivity analysis
Cinelli and Hazlett (2020) develop a series of tools for researchers to conduct sensitivity analyses on regression models, using an extension of the omitted variable bias framework. To do so, they use FWL to motivate this bias. Suppose that the full regression model is specified as:
Y=ˆτD+Xˆβ+ˆγZ+ˆϵfull,
where ˆτ,ˆβ,ˆγ are estimated regression coefficients, D is the treatment variable, X are observed covariates, and Z are unobserved covariates. Since, Z is unobserved, researchers measure:
Y=ˆτObs.D+XˆβObs.+ϵObs
By FWL, we know that ˆτObs. is equivalent to the coefficient of regressing the residualised outcome (with respect to X), on the residualised outcome of D (again with respect to X). Call these two residuals Yr and Dr.
And recall that the regression model for the final-stage of the partial regressions is bivariate (Yr∼Dr). Conveniently, a bivariate regression coefficient can be expressed in terms of the covariance between the left-hand and right-hand side variables:
ˆτObs.=cov(Dr,Yr)var(Dr).
Note that given the full regression model in Equation ???, the partial outcome Yr is actually composed of the elements ˆτDr+ˆγZr, and so:
ˆτObs.=cov(Dr,ˆτDr+ˆγZr)var(Dr)
Next, we can expand the covariance using the expectation rule that cov(A,B+C)=cov(A,B)+cov(A,C) and since ˆτ,ˆγ are scalar, we can move them outside the covariance functions:
ˆτObs.=ˆτcov(Dr,Dr)+ˆγcov(Dr,Zr)var(Dr)
Since cov(A,A)=var(A) and therefore:
ˆτObs.=ˆτvar(Dr)+ˆγcov(Dr,Zr)var(Dr)≡ˆτ+ˆγcov(Dr,Zr)var(Dr)≡ˆτ+ˆγˆδ
Frisch-Waugh is so useful because it simplifies a multivariate equation into a bivariate one. While computationally this makes zero difference (unlike in the days of hand computation), here it allows us to use a convenient expression of the bivariate coefficient to show and quantify the bias when you run a regression in the presence of an unobserved confounder. Moreover, note that in Equation ???, we implicitly use FWL again since we know that the non-stochastic aspect of Y not explained by X are the residualised components of the full Equation ???.
8.5 Regressing the partialled-out X on the full Y
Finally, in Mostly Harmless Econometrics (MHE; Joshua D. Angrist and Pischke (2009)), the authors note that you also get an identical coefficient to the full regression if you regress the residualised predictor on the non-residualised outcome. Notice this is different from standard FWL where you also residualise the outcome before estimating regression coefficients.
This feature is interesting, not least because it reduces the number of steps one has to do in order to manually estimate the coefficients! To see why this claim holds, we can use a similar strategy to the OVB example above.
Let’s consider the full regression model to be:
Y=β1X1+β2X2+ϵ.
and the model where only X1 is residualised as:
Y=β∗1M2X1+ϵ.
We want to show that $_1 = ^*_1.
We can start by expressing the coefficient from the residualised equation as follows:
β∗1=cov(Y,M2X1)var(M2X1)
Next we can substitute Y with the righthand side of the full regress model above. We then apply covariance rules to move the scalar beta values outside of the covariances, and split the fraction into simpler parts:
β∗1=cov(β1X1+β2X2+ϵ,M2X1)var(M2X1)=β1cov(X1,M2X1)var(M2X1)+β2cov(X2,M2X1)var(M2X1)+cov(ϵ,M2X1)var(M2X1)
Notice that, by definition, ϵ is the component of the outcome that is unexplained by both X1 and X2, therefore its covariance with the residualised component of X1 will be zero. Likewise, M2X1 is that part of X1 that is unrelated to X2, and so similarly cov(X2,M2X1)=0. Hence, the last two terms above drop out of the equation:
β∗1=β1cov(X1,M2X1)var(M2X1)+β20var(M2X1)+cov(ϵ,M2X1)var(M2X1)=β10var(M2X1)
Now, it should be clear that to reach the end of the proof the remaining fraction needs to equal 1. Consequently, therefore, cov(X1,M2X1)=var(M2X1). I don’t find this equality intuitive at all. Hence, let’s briefly prove it as a lemma:
Let’s define the covariance of X1 and M2X1 in terms of expectations, and then rearrange the terms (this is the standard rearrangement in many textbooks/online resources): cov(X1,M2X1)=E[(X1−E[X1])(M2X1−E[M2X1])]=E[X′1M2X1]−E[X1]E[M2X1]
One feature of the residual maker M is that it is idempotent, i.e. that M′M=M (this is also true of the projection matrix P). So we can change the first term by substituting in this equivalence, and then rearranging using some basic linear algebra: cov(X1,M2X1)=E[X′1M′2M2X1]−E[X1]E[M2X1]=E[(M2X1)′(M2X1)]−E[X1]E[M2X1]
Now we get to the crux of proving this lemma. Notice that since M2X1 are the residuals of regressing X1 on X2, we know by definition that the mean of these residuals will be zero. Hence: cov(X1,M2X1)=E[(M2X1)′(M2X1)]−E[X1]×0=E[(M2X1)′(M2X1)].
We can perform a very similar trick with respect to the variance of the residuals: \begin{align} \begin{aligned} var(M_2X_1) &= E[(M_2X_1)^2] - E[(M_2X_1)]^2 \\ &= var(M_2X_1) = E[(M_2X_1)^2] \\ &= E[(M_2X_1)'(M_2X_1)] = cov(X_1,M_2X_1) \ \ \whitesquare \end{align} \end{aligned}
Notice we start with the formal definition of variance, and can immediately drop the second term as we know the mean is zero so the squared mean is also zero. Finally, with a small rearrangement, we arrive at the equivalence.
With that in hand, let’s go back to the MHE proof knowing that cov(X1,M2X1)=var(M2X1). Therefore: β∗1=β1var(M2X1)var(M2X1)=β1×1=β1 ◼
And we’re done! We arrive at the same β1 coefficient, despite only residualising the predictor. Notice also that this proof generalises to including more independent variables. Rather than M2 we can think about M as being the residual maker regressing X1 on k other independent variables. Given that the residual maker will mean that any specific covariance term (i.e. βkcov(Xk,MX1)) will be zero, once we expand cov(Y,MX1) with the full linear equation, all of these additional terms will fall away. As an exercise, it may be worth trying the above proof with four or five independent variables to see this in action.
ˆβ1=cov(M2X1,Y)var(M2X1)=cov(M2X1,ˆβ1X1+ˆβ2X2)var(M2X1)=ˆβ1cov(M2X1,X1)var(M2X1)+ˆβ2cov(M2X1,X2)var(M2X1)=ˆβ1+ˆβ2×0=ˆβ1
This follows from two features. First, cov(M2X1,X1)=var(M2X1). Second, it is clear that cov(M2X1,X2)=0 because M2X1 is X1 stripped of any variance associated with X2 and so, by definition, they do not covary. Therefore, we can recover the unbiased regression coefficient using an adapted version of FWL where we do not residualise Y – as stated in MHE.
References
Citation Based on lecture notes from the University of Oslo’s Econometrics – Modelling and Systems Estimation” course (author attribution unclear), and Davidson, MacKinnon, et al. (2004).}↩︎
Citation: Adapted from York University, Canada’s wiki for statistical consulting.↩︎