5.1 Ordinary Least Squares

Ordinary Least Squares (OLS) is the backbone of statistical modeling, a method so foundational that it often serves as the starting point for understanding data relationships. Whether predicting sales, estimating economic trends, or uncovering patterns in scientific research, OLS remains a critical tool. Its appeal lies in simplicity: OLS models the relationship between a dependent variable and one or more predictors by minimizing the squared differences between observed and predicted values.


Why OLS Works: Linear and Nonlinear Relationships

OLS rests on the Conditional Expectation Function (CEF), E[Y|X], which describes the expected value of Y given X. Regression shines in two key scenarios:

  1. Perfect Fit (Linear CEF):
    If E[Yi|X1i,,XKi]=a+Kk=1bkXki, the regression of Yi on X1i,,XKi exactly equals the CEF. In other words, the regression gives the true average relationship between Y and X.
    If the true relationship is linear, regression delivers the exact CEF. For instance, imagine you’re estimating the relationship between advertising spend and sales revenue. If the true impact is linear, OLS will perfectly capture it.

  2. Approximation (Nonlinear CEF):
    If E[Yi|X1i,,XKi] is nonlinear, OLS provides the best linear approximation to this relationship. Specifically, it minimizes the expected squared deviation between the linear regression line and the nonlinear CEF.
    For example, the effect of advertising diminishes at higher spending levels? OLS still works, providing the best linear approximation to this nonlinear relationship by minimizing the squared deviations between predictions and the true (but unknown) CEF.

In other words, regression is not just a tool for “linear” relationships—it’s a workhorse that adapts remarkably well to messy, real-world data.


5.1.1 Simple Regression (Basic) Model

The simplest form of regression is a straight line:

Yi=β0+β1Xi+ϵi

where

  • Yi: The dependent variable or outcome we’re trying to predict (e.g., sales, temperature).
  • Xi: The independent variable or predictor (e.g., advertising spend, time).
  • β0: The intercept—where the line crosses the Y-axis when X=0.
  • β1: The slope, representing the change in Y for a one-unit increase in X.
  • ϵi: The error term, accounting for random factors that X cannot explain.

Assumptions About the Error Term (ϵi):

E(ϵi)=0Var(ϵi)=σ2Cov(ϵi,ϵj)=0for all ij

Since ϵi is random, Yi is also random:

E(Yi)=E(β0+β1Xi+ϵi)=β0+β1Xi

Var(Yi)=Var(β0+β1Xi+ϵi)=Var(ϵi)=σ2

Since Cov(ϵi,ϵj)=0, the outcomes across observations are independent. Hence, Yi and Yj are uncorrelated as well, conditioned on the X’s.


5.1.1.1 Estimation in Ordinary Least Squares

The goal of OLS is to estimate the regression parameters (β0, β1) that best describe the relationship between the dependent variable Y and the independent variable X. To achieve this, we minimize the sum of squared deviations between observed values of Yi and their expected values predicted by the model.

The deviation of an observed value Yi from its expected value, based on the regression model, is:

YiE(Yi)=Yi(β0+β1Xi).

This deviation represents the error in prediction for the i-th observation.

To ensure that the errors don’t cancel each other out and to prioritize larger deviations, we consider the squared deviations. The sum of squared deviations, denoted by Q, is defined as:

Q=ni=1(Yiβ0β1Xi)2.

The goal of OLS is to find the values of β0 and β1 that minimize Q. These values are called the OLS estimators.

To minimize Q, we take partial derivatives with respect to β0 and β1, set them to zero, and solve the resulting system of equations. After simplifying, the estimators for the slope (b1) and intercept (b0) are obtained as follows:

Slope (b1):

b1=ni=1(XiˉX)(YiˉY)ni=1(XiˉX)2.

Here, ˉX and ˉY represent the means of X and Y, respectively. This formula reveals that the slope is proportional to the covariance between X and Y, scaled by the variance of X.

Intercept (b0):

b0=1n(ni=1Yib1ni=1Xi)=ˉYb1ˉX.

The intercept is determined by aligning the regression line with the center of the data.


Intuition Behind the Estimators

  • b1 (Slope): This measures the average change in Y for a one-unit increase in X. The formula uses deviations from the mean to ensure that the relationship captures the joint variability of X and Y.

  • b0 (Intercept): This ensures that the regression line passes through the mean of the data points (ˉX,ˉY), anchoring the model in the center of the observed data.


Equivalently, we can also write these parameters in terms of covariances.

The covariance between two variables is defined as:

Cov(Xi,Yi)=E[(XiE[Xi])(YiE[Yi])]

Properties of Covariance:

  1. Cov(Xi,Xi)=σ2X
  2. If E(Xi)=0 or E(Yi)=0, then Cov(Xi,Yi)=E[XiYi]
  3. For Wi=a+bXi and Zi=c+dYi,
    Cov(Wi,Zi)=bdCov(Xi,Yi)

For a bivariate regression, the slope β in a bivariate regression is given by:

β=Cov(Yi,Xi)Var(Xi)

For a multivariate case, the slope for Xk is:

βk=Cov(Yi,˜Xki)Var(˜Xki)

Where ˜Xki represents the residual from a regression of Xki on the K1 other covariates in the model.

The intercept is:

β0=E[Yi]β1E(Xi)

Note:

  • OLS does not require the assumption of a specific distribution for the variables. Its robustness is based on the minimization of squared errors (i.e., no distributional assumptions).

5.1.1.2 Properties of Least Squares Estimators

The properties of the Ordinary Least Squares estimators (b0 and b1) are derived based on their statistical behavior. These properties provide insights into the accuracy, variability, and reliability of the estimates.


5.1.1.2.1 Expectation of the OLS Estimators

The OLS estimators b0 (intercept) and b1 (slope) are unbiased. This means their expected values equal the true population parameters:

E(b1)=β1,E(b0)=E(ˉY)ˉXβ1.

Since the expected value of the sample mean of Y, E(ˉY), is:

E(ˉY)=β0+β1ˉX,

the expected value of b0 simplifies to:

E(b0)=β0.

Thus, b0 and b1 are unbiased estimators of their respective population parameters β0 and β1.


5.1.1.2.2 Variance of the OLS Estimators

The variability of the OLS estimators depends on the spread of the predictor variable X and the error variance σ2. The variances are given by:

Variance of b1 (Slope):

Var(b1)=σ2ni=1(XiˉX)2.

Variance of b0 (Intercept):

Var(b0)=σ2(1n+ˉX2ni=1(XiˉX)2).

These formulas highlight that:

  • Var(b1)0 as the number of observations increases, provided Xi values are distributed around their mean ˉX.
  • Var(b0)0 as n increases, assuming Xi values are appropriately selected (i.e., not all clustered near the mean).

5.1.1.3 Mean Square Error (MSE)

The Mean Square Error (MSE) quantifies the average squared residual (error) in the model:

MSE=SSEn2=ni=1e2in2=ni=1(YiˆYi)2n2,

where SSE is the Sum of Squared Errors and n2 represents the degrees of freedom for a simple linear regression model (two parameters estimated: β0 and β1).

The expected value of the MSE equals the error variance (i.e., unbiased Estimator of MSE:):

E(MSE)=σ2.


5.1.1.4 Estimating Variance of the OLS Coefficients

The sample-based estimates of the variances of b0 and b1 are expressed as follows:

Estimated Variance of b1 (Slope):

s2(b1)=^Var(b1)=MSEni=1(XiˉX)2.

Estimated Variance of b0 (Intercept):

s2(b0)=^Var(b0)=MSE(1n+ˉX2ni=1(XiˉX)2).

These estimates rely on the MSE to approximate σ2.

The variance estimates are unbiased:

E(s2(b1))=Var(b1),E(s2(b0))=Var(b0).


Implications of These Properties

  1. Unbiasedness: The unbiased nature of b0 and b1 ensures that, on average, the regression model accurately reflects the true relationship in the population.
  2. Decreasing Variance: As the sample size n increases or as the spread of Xi values grows, the variances of b0 and b1 decrease, leading to more precise estimates.
  3. Error Estimation with MSE: MSE provides a reliable estimate of the error variance σ2, which feeds directly into assessing the reliability of b0 and b1.

5.1.1.5 Residuals in Ordinary Least Squares

Residuals are the differences between observed values (Yi) and their predicted counterparts (ˆYi). They play a central role in assessing model fit and ensuring the assumptions of OLS are met.

The residual for the i-th observation is defined as:

ei=YiˆYi=Yi(b0+b1Xi),

where:

  • ei: Residual for the i-th observation.
  • ˆYi: Predicted value based on the regression model.
  • Yi: Actual observed value.

Residuals estimate the unobservable error terms ϵi:

  • ei is an estimate of ϵi=YiE(Yi).
  • ϵi is always unknown because we do not know the true values of β0 and β1.

5.1.1.5.1 Key Properties of Residuals

Residuals exhibit several mathematical properties that align with the OLS estimation process:

  1. Sum of Residuals:
    The residuals sum to zero:

    ni=1ei=0.

    This ensures that the regression line passes through the centroid of the data, (ˉX,ˉY).

  2. Orthogonality of Residuals to Predictors:
    The residuals are orthogonal (uncorrelated) to the predictor variable X:

    ni=1Xiei=0.

    This reflects the fact that the OLS minimizes the squared deviations of residuals along the Y-axis, not the X-axis.


5.1.1.5.2 Expected Values of Residuals

The expected values of residuals reinforce the unbiased nature of OLS:

  1. Mean of Residuals:
    The residuals have an expected value of zero:

    E[ei]=0.

  2. Orthogonality to Predictors and Fitted Values:
    Residuals are uncorrelated with both the predictor variables and the fitted values:

    E[Xiei]=0,E[ˆYiei]=0.

These properties highlight that residuals do not contain systematic information about the predictors or the fitted values, reinforcing the idea that the model has captured the underlying relationship effectively.


5.1.1.5.3 Practical Importance of Residuals
  1. Model Diagnostics:
    Residuals are analyzed to check the assumptions of OLS, including linearity, homoscedasticity (constant variance), and independence of errors. Patterns in residual plots can signal issues such as nonlinearity or heteroscedasticity.

  2. Goodness-of-Fit:
    The sum of squared residuals, e2i, measures the total unexplained variation in Y. A smaller sum indicates a better fit.

  3. Influence Analysis:
    Large residuals may indicate outliers or influential points that disproportionately affect the regression line.

5.1.1.6 Inference in Ordinary Least Squares

Inference allows us to make probabilistic statements about the regression parameters (β0, β1) and predictions (Yh). To perform valid inference, certain assumptions about the distribution of errors are necessary.


Normality Assumption

  • OLS estimation itself does not require the assumption of normality.
  • However, to conduct hypothesis tests or construct confidence intervals for β0, β1, and predictions, distributional assumptions are necessary.
  • Inference on β0 and β1 is robust to moderate departures from normality, especially in large samples due to the Central Limit Theorem.
  • Inference on predicted values, Ypred, is more sensitive to normality violations.

When we assume a normal error model, the response variable Yi is modeled as:

YiN(β0+β1Xi,σ2),

where:

  • β0+β1Xi: Mean response
  • σ2: Variance of the errors

Under this model, the sampling distributions of the OLS estimators, b0 and b1, can be derived.


5.1.1.6.1 Inference for β1 (Slope)

Under the normal error model:

  1. Sampling Distribution of b1:

    b1N(β1,σ2ni=1(XiˉX)2).

    This indicates that b1 is an unbiased estimator of β1 with variance proportional to σ2.

  2. Test Statistic:

    t=b1β1s(b1)tn2,

    where s(b1) is the standard error of b1: s(b1)=MSEni=1(XiˉX)2.

  3. Confidence Interval:

    A (1α)100% confidence interval for β1 is:

    b1±t1α/2;n2s(b1).


5.1.1.6.2 Inference for β0 (Intercept)
  1. Sampling Distribution of b0:

    Under the normal error model, the sampling distribution of b0 is:

    b0N(β0,σ2(1n+ˉX2ni=1(XiˉX)2)).

  2. Test Statistic:

    t=b0β0s(b0)tn2,

    where s(b0) is the standard error of b0: s(b0)=MSE(1n+ˉX2ni=1(XiˉX)2).

  3. Confidence Interval:

    A (1α)100% confidence interval for β0 is:

    b0±t1α/2;n2s(b0).


5.1.1.6.3 Mean Response

In regression, we often estimate the mean response of the dependent variable Y for a given level of the predictor variable X, denoted as Xh. This estimation provides a predicted average outcome for a specific value of X based on the fitted regression model.

  • Let Xh represent the level of X for which we want to estimate the mean response.
  • The mean response when X=Xh is denoted as E(Yh).
  • A point estimator for E(Yh) is ˆYh, which is the predicted value from the regression model:

ˆYh=b0+b1Xh.

The estimator ˆYh is unbiased because its expected value equals the true mean response E(Yh):

E(ˆYh)=E(b0+b1Xh)=β0+β1Xh=E(Yh).

Thus, ˆYh provides a reliable estimate of the mean response at Xh.


The variance of ˆYh reflects the uncertainty in the estimate of the mean response:

Var(ˆYh)=Var(b0+b1Xh)(definition of ˆYh)=Var((ˉYb1ˉX)+b1Xh)(since b0=ˉYb1ˉX)=Var(ˉY+b1(XhˉX))(factor out b1)=Var(ˉY+b1(XhˉX))=Var(ˉY)+(XhˉX)2Var(b1)+2(XhˉX)Cov(ˉY,b1).

Since Cov(ˉY,b1)=0 (due to the independence of the errors, ϵi), the variance simplifies to:

Var(ˆYh)=σ2n+(XhˉX)2σ2ni=1(XiˉX)2.

This can also be expressed as:

Var(ˆYh)=σ2(1n+(XhˉX)2ni=1(XiˉX)2).

To estimate the variance of ˆYh, we replace σ2 with MSE, the mean squared error from the regression:

s2(ˆYh)=MSE(1n+(XhˉX)2ni=1(XiˉX)2).


Under the normal error model, the sampling distribution of ˆYh is:

ˆYhN(E(Yh),Var(ˆYh)),ˆYhE(Yh)s(ˆYh)tn2.

This result follows because ˆYh is a linear combination of normally distributed random variables, and its variance is estimated using s2(ˆYh).


A 100(1α)% confidence interval for the mean response E(Yh) is given by:

ˆYh±t1α/2;n2s(ˆYh),

where:

  • ˆYh: Point estimate of the mean response,
  • s(ˆYh): Estimated standard error of the mean response,
  • t1α/2;n2: Critical value from the t-distribution with n2 degrees of freedom.

5.1.1.6.4 Prediction of a New Observation

When analyzing regression results, it is important to distinguish between:

  1. Estimating the mean response at a particular value of X.
  2. Predicting an individual outcome for a particular value of X.

Mean Response vs. Individual Outcome

  • Same Point Estimate
    The formula for both the estimated mean response and the predicted individual outcome at X=Xh is identical:
    ˆYpred=ˆYh=b0+b1Xh.

  • Different Variance
    Although the point estimates are the same, the level of uncertainty differs. When predicting an individual outcome, we must consider not only the uncertainty in estimating the mean response (ˆYh) but also the additional random variation within the distribution of Y.

Therefore, prediction intervals (for individual outcomes) account for more uncertainty and are consequently wider than confidence intervals (for the mean response).


To predict an individual outcome for a given Xh, we combine the mean response with the random error:

Ypred=β0+β1Xh+ϵ.

Using the least squares predictor:

ˆYpred=b0+b1Xh,

since E(ϵ)=0.


The variance of the predicted value for a new observation, Ypred, includes both:

  1. Variance of the estimated mean response: σ2(1n+(XhˉX)2ni=1(XiˉX)2),
  2. Variance of the error term, ϵ, which is σ2.

Thus, the total variance is:

Var(Ypred)=Var(b0+b1Xh+ϵ)=Var(b0+b1Xh)+Var(ϵ)=σ2(1n+(XhˉX)2ni=1(XiˉX)2)+σ2=σ2(1+1n+(XhˉX)2ni=1(XiˉX)2).

We estimate the variance of the prediction using MSE, the mean squared error:

s2(pred)=MSE(1+1n+(XhˉX)2ni=1(XiˉX)2).

Under the normal error model, the standardized predicted value follows a t-distribution with n2 degrees of freedom:

YpredˆYhs(pred)tn2.

A 100(1α)% prediction interval for Ypred is:

ˆYpred±t1α/2;n2s(pred).


5.1.1.6.5 Confidence Band

In regression analysis, we often want to evaluate the uncertainty around the entire regression line, not just at a single value of the predictor variable X. This is achieved using a confidence band, which provides a confidence interval for the mean response, E(Y)=β0+β1X, over the entire range of X values.

The Working-Hotelling confidence band is a method to construct simultaneous confidence intervals for the regression line. For a given Xh, the confidence band is expressed as:

ˆYh±Ws(ˆYh),

where:

  • W2=2F1α;2,n2,

    • F1α;2,n2 is the critical value from the F-distribution with 2 and n2 degrees of freedom.
  • s(ˆYh) is the standard error of the estimated mean response at Xh:

    s2(ˆYh)=MSE(1n+(XhˉX)2ni=1(XiˉX)2).


Key Properties of the Confidence Band

  1. Width of the Interval:
    • The width of the confidence band changes with Xh because s(ˆYh) depends on how far Xh is from the mean of X (ˉX).
    • The interval is narrowest at X=ˉX, where the variance of the estimated mean response is minimized.
  2. Shape of the Band:
    • The boundaries of the confidence band form a hyperbolic shape around the regression line.
    • This reflects the increasing uncertainty in the mean response as Xh moves farther from ˉX.
  3. Simultaneous Coverage:
    • The Working-Hotelling band ensures that the true regression line E(Y)=β0+β1X lies within the band across all values of X with a specified confidence level (e.g., 95%).

5.1.1.7 Analysis of Variance (ANOVA) in Regression

ANOVA in regression decomposes the total variability in the response variable (Y) into components attributed to the regression model and residual error. In the context of regression, ANOVA provides a mechanism to assess the fit of the model and test hypotheses about the relationship between X and Y.

The corrected Total Sum of Squares (SSTO) quantifies the total variation in Y:

SSTO=ni=1(YiˉY)2,

where ˉY is the mean of the response variable. The term “corrected” refers to the fact that the sum of squares is calculated relative to the mean (i.e., the uncorrected total sum of squares is given by Y2i)

Using the fitted regression model ˆYi=b0+b1Xi, we estimate the conditional mean of Y at Xi. The total sum of squares can be decomposed as:

ni=1(YiˉY)2=ni=1(YiˆYi+ˆYiˉY)2=ni=1(YiˆYi)2+ni=1(ˆYiˉY)2+2ni=1(YiˆYi)(ˆYiˉY)=ni=1(YiˆYi)2+ni=1(ˆYiˉY)2

  • The cross-product term is zero, as shown below.
  • This decomposition simplifies to:

SSTO=SSE+SSR,

where:

  • SSE=ni=1(YiˆYi)2: Error Sum of Squares (variation unexplained by the model).

  • SSR=ni=1(ˆYiˉY)2: Regression Sum of Squares (variation explained by the model), which measure how the conditional mean varies about a central value.

Degrees of freedom are partitioned as:

SSTO=SSR+SSE(n1)=(1)+(n2)


To confirm that the cross-product term is zero:

ni=1(YiˆYi)(ˆYiˉY)=ni=1(YiˉYb1(XiˉX))(ˉY+b1(XiˉX)ˉY)(Expand YiˆYi and ˆYiˉY)=ni=1(YiˉYb1(XiˉX))(b1(XiˉX))=b1ni=1(YiˉY)(XiˉX)b21ni=1(XiˉX)2(Distribute terms in the product)=b1ni=1(YiˉY)(XiˉX)ni=1(XiˉX)2ni=1(XiˉX)2b21ni=1(XiˉX)2(Substitute b1 definition)=b21ni=1(XiˉX)2b21ni=1(XiˉX)2=0


The ANOVA table summarizes the partitioning of variability:

Source of Variation Sum of Squares df Mean Square F Statistic
Regression (Model) SSR 1 MSR=SSR1 F=MSRMSE
Error SSE n2 MSE=SSEn2
Total (Corrected) SSTO n1

The expected values of the mean squares are:

E(MSE)=σ2,E(MSR)=σ2+β21ni=1(XiˉX)2.

  • If β1=0:
    • The regression model does not explain any variation in Y beyond the mean, and E(MSR)=E(MSE)=σ2.
    • This condition corresponds to the null hypothesis, H0:β1=0.
  • If β10:
    • The regression model explains some variation in Y, and E(MSR)>E(MSE).
    • The additional term β21ni=1(XiˉX)2 represents the variance explained by the predictor X.

The difference between E(MSR) and E(MSE) allows us to infer whether β10 by comparing their ratio.


Assuming the errors ϵi are independent and identically distributed as N(0,σ2), and under the null hypothesis H0:β1=0, we have:

  1. The scaled MSE follows a chi-square distribution with n2 degrees of freedom:

    MSEσ2χ2n2.

  2. The scaled MSR follows a chi-square distribution with 1 degree of freedom:

    MSRσ2χ21.

  3. These two chi-square random variables are independent.

The ratio of two independent chi-square random variables, scaled by their respective degrees of freedom, follows an F-distribution. Therefore, under H0:

F=MSRMSEF1,n2.

The F-statistic tests whether the regression model provides a significant improvement over the null model (constant E(Y)).

The hypotheses for the F-test are:

  • Null Hypothesis (H0): β1=0 (no relationship between X and Y).
  • Alternative Hypothesis (Ha): β10 (a significant relationship exists between X and Y).

The rejection rule for H0 at significance level α is:

F>F1α;1,n2,

where F1α;1,n2 is the critical value from the F-distribution with 1 and n2 degrees of freedom.

  1. If FF1α;1,n2:
    • Fail to reject H0. There is insufficient evidence to conclude that X significantly explains variation in Y.
  2. If F>F1α;1,n2:
    • Reject H0. There is significant evidence that X explains some of the variation in Y.

5.1.1.8 Coefficient of Determination (R2)

The Coefficient of Determination (R2) measures how well the linear regression model accounts for the variability in the response variable Y. It is defined as:

R2=SSRSSTO=1SSESSTO,

where:

  • SSR: Regression Sum of Squares (variation explained by the model).
  • SSTO: Total Sum of Squares (total variation in Y about its mean).
  • SSE: Error Sum of Squares (variation unexplained by the model).

Properties of R2

  1. Range: 0R21.

    • R2=0: The model explains none of the variability in Y (e.g., β1=0).
    • R2=1: The model explains all the variability in Y (perfect fit).
  2. Proportionate Reduction in Variance: R2 represents the proportionate reduction in the total variation of Y after fitting the model. It quantifies how much better the model predicts Y compared to simply using ˉY.

  3. Potential Misinterpretation: It is not really correct to say R2 is the “variation in Y explained by X.” The term “variation explained” assumes a causative or deterministic explanation, which is not always correct. For example:

    • R2 shows how much variance in Y is accounted for by the regression model, but it does not imply causation.

    • In cases with confounding variables or spurious correlations, R2 can still be high, even if there’s no direct causal link between X and Y.


For simple linear regression, R2 is the square of the Pearson correlation coefficient, r:

R2=(r)2,

where:

  • r=corr(X,Y) is the sample correlation coefficient.

The relationship between b1 (the slope of the regression line) and r is given by:

b1=(ni=1(YiˉY)2ni=1(XiˉX)2)1/2.

Additionally, r can be expressed as:

r=sysxr,

where sy and sx are the sample standard deviations of Y and X, respectively.

5.1.1.9 Lack of Fit in Regression

The lack of fit test evaluates whether the chosen regression model adequately captures the relationship between the predictor variable X and the response variable Y. When there are repeated observations at specific values of X, we can partition the Error Sum of Squares (SSE) into two components:

  1. Pure Error
  2. Lack of Fit.

Given the observations:

  • Yij: The j-th replicate for the i-th distinct value of X,
    • Y11,Y21,,Yn1,1: n1 repeated observations of X1
    • Y1c,Y2c,,Ync,c: nc repeated observations of Xc
  • ˉYj: The mean response for replicates at Xj,
  • ˆYij: The predicted value from the regression model for Xj,

the Error Sum of Squares (SSE) can be decomposed as:

ij(YijˆYij)2=ij(YijˉYj+ˉYjˆYij)2=ij(YijˉYj)2+jnj(ˉYjˆYij)2+cross product term=ij(YijˉYj)2+jnj(ˉYjˆYij)2

  • The cross product term is zero because the deviations within replicates and the deviations between replicates are orthogonal.
  • This simplifies to:

SSE=SSPE+SSLF,

where:

  • SSPE (Pure Error Sum of Squares): Variation within replicates for the same Xj, reflecting natural variability in the response.
    • Degrees of freedom: dfpe=nc, where n is the total number of observations, and c is the number of distinct X values.
  • SSLF (Lack of Fit Sum of Squares): Variation between the replicate means ˉYj and the model-predicted values ˆYij. If SSLF is large, it suggests the model may not adequately describe the relationship between X and Y.
    • Degrees of freedom: dflf=c2, where 2 accounts for the parameters in the linear regression model (β0 and β1).

  • Mean Square for Pure Error (MSPE):
    MSPE=SSPEdfpe=SSPEnc.

  • Mean Square for Lack of Fit (MSLF):
    MSLF=SSLFdflf=SSLFc2.

5.1.1.9.1 The F-Test for Lack of Fit

The F-test for lack of fit evaluates whether the chosen regression model adequately captures the relationship between the predictor variable X and the response variable Y. Specifically, it tests whether any systematic deviations from the model exist that are not accounted for by random error.

  • Null Hypothesis (H0):
    The regression model is adequate: H0:Yij=β0+β1Xi+ϵij,ϵiji.i.d. N(0,σ2).

  • Alternative Hypothesis (Ha):
    The regression model is not adequate and includes an additional function f(Xi,Z1,) to account for the lack of fit: Ha:Yij=α0+α1Xi+f(Xi,Z1,)+ϵij,ϵiji.i.d. N(0,σ2).


Expected Mean Squares

  • The expected Mean Square for Pure Error (MSPE) is the same under both H_0 and H_a:

    E(MSPE) = \sigma^2.

  • The expected Mean Square for Lack of Fit (MSLF) depends on whether H_0 is true:

    • Under H_0 (model is adequate): E(MSLF) = \sigma^2.
    • Under H_a (model is not adequate): E(MSLF) = \sigma^2 + \frac{\sum n_j f(X_i, Z_1, \dots)^2}{n-2}.

The test statistic for the lack-of-fit test is:

F = \frac{MSLF}{MSPE},

where:

  • MSLF = \frac{SSLF}{c-2},
    and SSLF is the Lack of Fit Sum of Squares.
  • MSPE = \frac{SSPE}{n-c},
    and SSPE is the Pure Error Sum of Squares.

Under H_0, the F-statistic follows an F-distribution:

F \sim F_{c-2, n-c}.


Decision Rule

  • Reject H_0 at significance level \alpha if: F > F_{1-\alpha; c-2, n-c}.

  • Failing to reject H_0:

    • Indicates that there is no evidence of lack of fit.
    • Does not imply the model is “true,” but it suggests that the model provides a reasonable approximation to the true relationship.

To summarize, when repeat observations exist at some levels of X, the Error Sum of Squares (SSE) can be further partitioned into Lack of Fit (SSLF) and Pure Error (SSPE). This leads to an extended ANOVA table:

Source of Variation Sum of Squares df Mean Square F Statistic
Regression SSR 1 MSR = \frac{SSR}{1} F = \frac{MSR}{MSE}
Error SSE n-2 MSE = \frac{SSE}{n-2}
Lack of fit SSLF c-2 MSLF = \frac{SSLF}{c-2} F = \frac{MSLF}{MSPE}
Pure Error SSPE n-c MSPE = \frac{SSPE}{n-c}
Total (Corrected) SSTO n-1

Repeat observations have important implications for the coefficient of determination (R^2):

  1. R^2 Can’t Attain 1 with Repeat Observations:
    • With repeat observations, SSE (Error Sum of Squares) cannot be reduced to 0 because SSPE > 0 (variability within replicates).
  2. Maximum R^2:
    • The maximum attainable R^2 in the presence of repeat observations is:

      R^2_{\text{max}} = \frac{SSTO - SSPE}{SSTO}.

  3. Importance of Repeat Observations:
    • Not all levels of X need repeat observations, but their presence enables the separation of pure error from lack of fit, making the F-test for lack of fit possible.

Estimation of \sigma^2 with Repeat Observations

  1. Use of MSE:
    • When H_0 is appropriate (the model fits well), MSE is typically used as the estimate of \sigma^2 instead of MSPE because it has more degrees of freedom and provides a more reliable estimate.
  2. Pooling Estimates:
    • In practice, MSE and MSPE may be pooled if H_0 holds, resulting in a more precise estimate of \sigma^2.

5.1.1.10 Joint Inference for Regression Parameters

Joint inference considers the simultaneous coverage of confidence intervals for multiple regression parameters, such as \beta_0 (intercept) and \beta_1 (slope). Ensuring adequate confidence for both parameters together requires adjustments to maintain the desired family-wise confidence level.

Let:

  • \bar{A}_1: The event that the confidence interval for \beta_0 covers its true value.
  • \bar{A}_2: The event that the confidence interval for \beta_1 covers its true value.

The individual confidence levels are:

\begin{aligned} P(\bar{A}_1) &= 1 - \alpha, \\ P(\bar{A}_2) &= 1 - \alpha. \end{aligned}

The joint confidence coefficient, P(\bar{A}_1 \cap \bar{A}_2), is:

\begin{aligned} P(\bar{A}_1 \cap \bar{A}_2) &= 1 - P(\bar{A}_1 \cup \bar{A}_2), \\ &= 1 - P(A_1) - P(A_2) + P(A_1 \cap A_2), \\ &\geq 1 - P(A_1) - P(A_2), \\ &= 1 - 2\alpha. \end{aligned}

This means that if \alpha is the significance level for each parameter, the joint confidence coefficient is at least 1 - 2\alpha. This inequality is known as the Bonferroni Inequality.


Bonferroni Confidence Intervals

To ensure a desired joint confidence level of (1-\alpha) for both \beta_0 and \beta_1, the Bonferroni method adjusts the confidence level for each parameter by dividing \alpha by the number of parameters. For two parameters:

  1. The confidence level for each parameter is (1-\alpha/2).

  2. The resulting Bonferroni-adjusted confidence intervals are:

    \begin{aligned} b_0 &\pm B \cdot s(b_0), \\ b_1 &\pm B \cdot s(b_1), \end{aligned}

    where B = t_{1-\alpha/4; n-2} is the critical value from the t-distribution with n-2 degrees of freedom.


Interpretation of Bonferroni Confidence Intervals

  1. Coverage Probability:
    • If repeated samples were taken, (1-\alpha)100\% of the joint intervals would contain the true values of (\beta_0, \beta_1).
    • This implies that \alpha \times 100\% of the samples would miss at least one of the true parameter values.
  2. Conservatism:
    • The Bonferroni method ensures the family-wise confidence level but is conservative. The actual joint confidence level is often higher than (1-\alpha)100\%.
    • This conservatism reduces statistical power.
# Load necessary libraries
library(ggplot2)
library(MASS)

# Set seed for reproducibility
set.seed(123)

# Generate synthetic data
n <- 100  # Number of observations
x <- rnorm(n, mean = 0, sd = 1)  # Predictor
beta_0 <- 2  # True intercept
beta_1 <- 3  # True slope
sigma <- 1  # Standard deviation of error
y <-
    beta_0 + beta_1 * x + rnorm(n, mean = 0, sd = sigma)  # Response

# Fit linear model
model <- lm(y ~ x)
summary(model)
#> 
#> Call:
#> lm(formula = y ~ x)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.9073 -0.6835 -0.0875  0.5806  3.2904 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  1.89720    0.09755   19.45   <2e-16 ***
#> x            2.94753    0.10688   27.58   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9707 on 98 degrees of freedom
#> Multiple R-squared:  0.8859, Adjusted R-squared:  0.8847 
#> F-statistic: 760.6 on 1 and 98 DF,  p-value: < 2.2e-16

# Extract coefficients and standard errors
b0_hat <- coef(model)[1]
b1_hat <- coef(model)[2]
s_b0 <-
    summary(model)$coefficients[1, 2]  # Standard error of intercept
s_b1 <- summary(model)$coefficients[2, 2]  # Standard error of slope

# Desired confidence level
alpha <- 0.05  # Overall significance level

# Bonferroni correction
adjusted_alpha <- alpha / 2  # Adjusted alpha for each parameter

# Critical t-value for Bonferroni adjustment
t_crit <-
    qt(1 - adjusted_alpha, df = n - 2)  # n-2 degrees of freedom

# Bonferroni confidence intervals
ci_b0 <- c(b0_hat - t_crit * s_b0, b0_hat + t_crit * s_b0)
ci_b1 <- c(b1_hat - t_crit * s_b1, b1_hat + t_crit * s_b1)

# Print results
cat("Bonferroni Confidence Intervals:\n")
#> Bonferroni Confidence Intervals:
cat("Intercept (beta_0): [",
    round(ci_b0[1], 2),
    ",",
    round(ci_b0[2], 2),
    "]\n")
#> Intercept (beta_0): [ 1.7 , 2.09 ]
cat("Slope (beta_1): [",
    round(ci_b1[1], 2),
    ",",
    round(ci_b1[2], 2),
    "]\n")
#> Slope (beta_1): [ 2.74 , 3.16 ]

# Calculate the covariance matrix of coefficients
cov_matrix <- vcov(model)

# Generate points for confidence ellipse
ellipse_points <-
    MASS::mvrnorm(n = 1000,
                  mu = coef(model),
                  Sigma = cov_matrix)

# Convert to data frame for plotting
ellipse_df <- as.data.frame(ellipse_points)
colnames(ellipse_df) <- c("beta_0", "beta_1")

# Plot confidence intervals and ellipse
p <- ggplot() +
    # Confidence ellipse
    geom_point(
        data = ellipse_df,
        aes(x = beta_0, y = beta_1),
        alpha = 0.1,
        color = "grey"
    ) +
    # Point estimate
    geom_point(aes(x = b0_hat, y = b1_hat),
               color = "red",
               size = 3) +
    # Bonferroni confidence intervals
    geom_errorbar(aes(x = b0_hat, ymin = ci_b1[1], ymax = ci_b1[2]),
                  width = 0.1,
                  color = "blue") +
    geom_errorbarh(aes(y = b1_hat, xmin = ci_b0[1], xmax = ci_b0[2]),
                   height = 0.1,
                   color = "blue") +
    labs(title = "Bonferroni Confidence Intervals and Joint Confidence Region",
         x = "Intercept (beta_0)",
         y = "Slope (beta_1)") +
    theme_minimal()

print(p)

  1. The red point represents the estimated coefficients (b0_hat, b1_hat).
  2. The blue lines represent the Bonferroni-adjusted confidence intervals for beta_0 and beta_1.
  3. The grey points represent the joint confidence region based on the covariance matrix of coefficients.
  4. The Bonferroni intervals ensure family-wise confidence level but are conservative.
  5. Simulation results demonstrate how often the true values are captured in the intervals when repeated samples are drawn.

Notes:

  1. Conservatism of Bonferroni Intervals
    • The Bonferroni interval is conservative:
      • The joint confidence level is a lower bound, ensuring family-wise coverage of at least (1-\alpha)100\%.
      • This conservatism results in wider intervals, reducing the statistical power of the test.
    • Adjustments for Conservatism:
      • Practitioners often choose a larger \alpha (e.g., \alpha = 0.1) to reduce the width of the intervals in Bonferroni joint tests.
      • A higher \alpha allows for a better balance between confidence and precision, especially for exploratory analyses.
  2. Extending Bonferroni to Multiple Parameters: The Bonferroni method is not limited to two parameters. For testing g parameters, such as \beta_0, \beta_1, \dots, \beta_{g-1}:
    • Adjusted Confidence Level for Each Parameter:
      • The confidence level for each individual parameter is (1-\alpha/g).
    • Critical t-Value:
      • For two-sided intervals, the critical value for each parameter is: t_{1-\frac{\alpha}{2g}; n-p}, where p is the total number of parameters in the regression model.
    • Example:
      • If \alpha = 0.05 and g = 10, each individual confidence interval is constructed at the: (1 - \frac{0.05}{10}) = 99.5\% \text{ confidence level}.
      • This corresponds to using t_{1-\frac{0.005}{2}; n-p} in the formula for the confidence intervals.
  3. Limitations for Large g
    • Wide Intervals:
      • As g increases, the intervals become excessively wide, often leading to reduced usefulness in practical applications.
      • This issue stems from the conservatism of the Bonferroni method, which prioritizes family-wise error control.
    • Suitability for Small g:
      • The Bonferroni procedure works well when g is relatively small (e.g., g \leq 5).
      • For larger g, alternative methods (discussed below) are more efficient.
  4. Correlation Between Parameters: Correlation of b_0 and b_1:
    • The estimated regression coefficients b_0 and b_1 are often correlated:
      • Negative correlation if \bar{X} > 0.
      • Positive correlation if \bar{X} < 0.
    • This correlation can complicate joint inference but does not affect the validity of Bonferroni-adjusted intervals.
  5. Alternatives to Bonferroni

Several alternative procedures provide more precise joint inference, especially for larger g:

  1. Scheffé’s Method:
    • Constructs simultaneous confidence regions for all possible linear combinations of parameters.
    • Suitable for exploratory analyses but may result in even wider intervals than Bonferroni.
  2. Tukey’s Honest Significant Difference:
    • Designed for pairwise comparisons in ANOVA but can be adapted for regression parameters.
  3. Holm’s Step-Down Procedure:
    • A sequential testing procedure that is less conservative than Bonferroni while still controlling the family-wise error rate.
  4. Likelihood Ratio Tests:
    • Construct joint confidence regions based on the likelihood function, offering more precision for large g.

5.1.1.11 Assumptions of Linear Regression

To ensure valid inference and reliable predictions in linear regression, the following assumptions must hold. We’ll cover them in depth in the next section.

Assumption Description
Linearity Linear relationship between predictors and response.
Independence of Errors Errors are independent (important in time-series/clustering).
Homoscedasticity Constant variance of residuals across predictors.
Normality of Errors Residuals are normally distributed.
No Multicollinearity Predictors are not highly correlated.
No Outliers/Leverage Points No undue influence from outliers or high-leverage points.
Exogeneity Predictors are uncorrelated with the error term (no endogeneity).
Full Rank Predictors are linearly independent (no perfect multicollinearity).

5.1.1.12 Diagnostics for Model Assumptions

Constant Variance

  • To check homoscedasticity:
    • Plot residuals vs. fitted values or residuals vs. predictors.
    • Look for patterns or a funnel-shaped spread indicating heteroscedasticity.

Outliers

  • Detect outliers using:
    • Residuals vs. predictors plot.
    • Box plots.
    • Stem-and-leaf plots.
    • Scatter plots.

Standardized Residuals:

Residuals can be standardized to have unit variance, known as studentized residuals: r_i = \frac{e_i}{s(e_i)}.

Semi-Studentized Residuals:

A simplified standardization using the mean squared error (MSE): e_i^* = \frac{e_i}{\sqrt{MSE}}.

Non-Independent Error Terms

  • To detect non-independence:
    • Plot residuals vs. time for time-series data.
    • Residuals e_i are not independent because they depend on \hat{Y}_i, which is derived from the same regression function.
  • Detect dependency by plotting the residual for the i-th response vs. the (i-1)-th.

Non-Normality of Error Terms

  • To assess normality:
    • Plot distribution of residuals.
    • Create box plots, stem-and-leaf plots, or normal probability plots.
  • Issues such as an incorrect regression function or non-constant error variance can distort the residual distribution.
  • Normality tests require relatively large sample sizes to detect deviations.

Normality of Residuals

  • Use tests based on the empirical cumulative distribution function (ECDF) (check Normality Assessment)

Constancy of Error Variance

  • Statistical tests for homoscedasticity:
    • Brown-Forsythe Test (Modified Levene Test):
      • Robust against non-normality, examines the variance of residuals across levels of predictors.
    • Breusch-Pagan Test (Cook-Weisberg Test):
      • Tests for heteroscedasticity by regressing squared residuals on predictors.

5.1.1.13 Remedial Measures for Violations of Assumptions

When the assumptions of simple linear regression are violated, appropriate remedial measures can be applied to address these issues. Below is a list of measures for specific deviations from the assumptions.

5.1.1.13.1 General Remedies
  • Use more complicated models (e.g., non-linear models, generalized linear models).
  • Apply transformations (see Variable Transformation) on X and/or Y to stabilize variance, linearize relationships, or normalize residuals. Note that transformations may not always yield “optimal” results.

5.1.1.13.2 Specific Remedies for Assumption Violations
Issue Remedy Explanation
Non-Linearity - Apply transformations (e.g., log, square root). Transformation of variables can help linearize the relationship between X and Y.
- Use more complicated models (e.g., polynomial regression, splines). Higher-order terms or non-linear models can capture non-linear relationships.
Non-Constant Error Variance - Apply Weighted Least Squares. WLS assigns weights to observations based on the inverse of their variance.
- Use transformations (e.g., log, square root). Transformations can stabilize error variance.
Correlated Errors - Use serially correlated error models (e.g., ARIMA for time-series data). Time-series models account for serial dependence in the errors.
Non-Normality of Errors - Transform Y or use non-parametric methods. Transformations can normalize residuals; non-parametric methods do not assume normality.
Omitted Variables - Use multiple regression to include additional relevant predictors. Adding relevant variables reduces omitted variable bias and improves model accuracy.
Outliers - Apply robust estimation techniques (e.g., Huber regression, M-estimation). Robust methods reduce the influence of outliers on parameter estimates.

5.1.1.13.3 Remedies in Detail
  1. Non-Linearity:
    • Transformations: Apply transformations to the response variable Y or the predictor variable X. Common transformations include:
      • Logarithmic transformation: Y' = \log(Y) or X' = \log(X).
      • Polynomial terms: Include X^2, X^3, etc., to capture curvature.
    • Alternative Models:
      • Polynomial regression or splines for flexibility in modeling non-linear relationships.
  2. Non-Constant Error Variance:
    • Weighted Least Squares:
      • Assigns weights to observations inversely proportional to their variance.
    • Transformations:
      • Use a log or square root transformation to stabilize variance.
  3. Correlated Errors:
    • For time-series data:
      • Use serially correlated error models such as AR(1) or ARIMA.
      • These models explicitly account for dependency in residuals over time.
  4. Non-Normality:
    • Transformations:
      • Apply a transformation to Y (e.g., log or square root) to make the residuals approximately normal.
    • Non-parametric regression:
      • Methods like LOESS or Theil-Sen regression do not require the normality assumption.
  5. Omitted Variables:
    • Introduce additional predictors:
      • Use multiple regression to include all relevant independent variables.
    • Check for multicollinearity when adding new variables.
  6. Outliers:
    • Robust Regression:
      • Use methods such as Huber regression or M-estimation to reduce the impact of outliers on model coefficients.
    • Diagnostics:
      • Identify outliers using Cook’s Distance, leverage statistics, or studentized residuals.

5.1.1.14 Transformations in Regression Analysis

Transformations involve modifying one or both variables to address issues such as non-linearity, non-constant variance, or non-normality. However, it’s important to note that the properties of least-squares estimates apply to the transformed model, not the original variables.

When transforming the dependent variable Y, we fit the model as:

g(Y_i) = b_0 + b_1 X_i,

where g(Y_i) is the transformed response. To interpret the regression results in terms of the original Y, we need to transform back:

\hat{Y}_i = g^{-1}(b_0 + b_1 X_i).


Direct back-transformation of predictions can introduce bias. For example, in a log-transformed model:

\log(Y_i) = b_0 + b_1 X_i,

the unbiased back-transformed prediction of Y_i is:

\hat{Y}_i = \exp(b_0 + b_1 X_i + \frac{\sigma^2}{2}),

where \frac{\sigma^2}{2} accounts for the bias correction due to the log transformation.


5.1.1.14.1 Box-Cox Family of Transformations

The Box-Cox transformation is a versatile family of transformations defined as:

Y' = \begin{cases} \frac{Y^\lambda - 1}{\lambda}, & \text{if } \lambda \neq 0, \\ \ln(Y), & \text{if } \lambda = 0. \end{cases}

This transformation introduces a parameter \lambda that is estimated from the data. Common transformations include:

\lambda Transformation Y'
2 Y^2
0.5 \sqrt{Y}
0 \ln(Y)
-0.5 1/\sqrt{Y}
-1 1/Y

Choosing the Transformation Parameter \lambda

The value of \lambda can be selected using one of the following methods:

  1. Trial and Error:
    • Apply different transformations and compare the residual plots or model fit statistics (e.g., R^2 or AIC).
  2. Maximum Likelihood Estimation:
    • Choose \lambda to maximize the likelihood function under the assumption of normally distributed errors.
  3. Numerical Search:
    • Use computational optimization techniques to minimize the residual sum of squares (RSS) or another goodness-of-fit criterion.
# Install and load the necessary library
if (!require("MASS")) install.packages("MASS")
library(MASS)

# Fit a linear model
set.seed(123)
n <- 50
x <- rnorm(n, mean = 5, sd = 2)
y <- 3 + 2 * x + rnorm(n, mean = 0, sd = 2)
model <- lm(y ~ x)

# Apply Box-Cox Transformation
boxcox_result <- boxcox(model, lambda = seq(-2, 2, 0.1), plotit = TRUE)


# Find the optimal lambda
optimal_lambda <- boxcox_result$x[which.max(boxcox_result$y)]
cat("Optimal lambda for Box-Cox transformation:", optimal_lambda, "\n")
#> Optimal lambda for Box-Cox transformation: 0.8686869

Notes

  1. Benefits of Transformations:

    • Stabilize Variance: Helps address heteroscedasticity.

    • Linearize Relationships: Useful for non-linear data.

    • Normalize Residuals: Addresses non-normality issues.

  2. Caveats:

    • Interpretability: Transformed variables may complicate interpretation.

    • Over-Transformation: Excessive transformations can distort the relationship between variables.

  3. Applicability:

    • Transformations are most effective for issues like non-linearity or non-constant variance. They are less effective for correcting independence violations or omitted variables.
5.1.1.14.2 Variance Stabilizing Transformations

Variance stabilizing transformations are used when the standard deviation of the response variable depends on its mean. The delta method, which applies a Taylor series expansion, provides a systematic approach to find such transformations.

Given that the standard deviation of Y is a function of its mean:

\sigma = \sqrt{\text{var}(Y)} = f(\mu),

where \mu = E(Y) and f(\mu) is a smooth function of the mean, we aim to find a transformation h(Y) such that the variance of the transformed variable h(Y) is constant for all values of \mu.

Expanding h(Y) in a Taylor Expansion series around \mu:

h(Y) = h(\mu) + h'(\mu)(Y - \mu) + \text{higher-order terms}.

Ignoring higher-order terms, the variance of h(Y) can be approximated as:

\text{var}(h(Y)) = \text{var}(h(\mu) + h'(\mu)(Y - \mu)).

Since h(\mu) is a constant:

\text{var}(h(Y)) = \left(h'(\mu)\right)^2 \text{var}(Y).

Substituting \text{var}(Y) = \left(f(\mu)\right)^2, we get:

\text{var}(h(Y)) = \left(h'(\mu)\right)^2 \left(f(\mu)\right)^2.

To stabilize the variance (make it constant for all \mu), we require:

\left(h'(\mu)\right)^2 \left(f(\mu)\right)^2 = \text{constant}.

Thus, the derivative of h(\mu) must be proportional to the inverse of f(\mu):

h'(\mu) \propto \frac{1}{f(\mu)}.

Integrating both sides gives:

h(\mu) = \int \frac{1}{f(\mu)} \, d\mu.

The specific form of h(\mu) depends on the function f(\mu), which describes the relationship between the standard deviation and the mean.


Examples of Variance Stabilizing Transformations

f(\mu) Transformation h(Y) Purpose
\sqrt{\mu} \int \frac{1}{\sqrt{\mu}} d\mu = 2\sqrt{Y} Stabilizes variance for Poisson data.
\mu \int \frac{1}{\mu} d\mu = \ln(Y) Stabilizes variance for exponential or multiplicative models.
\mu^2 \int \frac{1}{\mu^2} d\mu = -\frac{1}{Y} Stabilizes variance for certain power law data.

Variance stabilizing transformations are particularly useful for:

  1. Poisson-distributed data: Use h(Y) = 2\sqrt{Y} to stabilize variance.
  2. Exponential or multiplicative models: Use h(Y) = \ln(Y) for stabilization.
  3. Power law relationships: Use transformations like h(Y) = Y^{-1} or other forms derived from f(\mu).

Example: Variance Stabilizing Transformation for the Poisson Distribution

For a Poisson distribution, the variance of Y is equal to its mean:

\sigma^2 = \text{var}(Y) = E(Y) = \mu.

Thus, the standard deviation is:

\sigma = f(\mu) = \sqrt{\mu}.

Using the relationship for variance stabilizing transformations:

h'(\mu) \propto \frac{1}{f(\mu)} = \mu^{-0.5}.

Integrating h'(\mu) gives the variance stabilizing transformation:

h(\mu) = \int \mu^{-0.5} \, d\mu = 2\sqrt{\mu}.

Hence, the variance stabilizing transformation is:

h(Y) = \sqrt{Y}.

This transformation is widely used in Poisson regression to stabilize the variance of the response variable.

# Simulate Poisson data
set.seed(123)
n <- 500
x <- rnorm(n, mean = 5, sd = 2)
y <- rpois(n, lambda = exp(1 + 0.3 * x))  # Poisson-distributed Y

# Fit linear model without transformation
model_raw <- lm(y ~ x)

# Apply square root transformation
y_trans <- sqrt(y)
model_trans <- lm(y_trans ~ x)

# Compare residual plots
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))

# Residual plot for raw data
plot(
    fitted(model_raw),
    resid(model_raw),
    main = "Residuals Raw Data",
    xlab = "Fitted Values",
    ylab = "Residuals"
)
abline(h = 0, col = "red", lty = 2)

# Residual plot for transformed data
plot(
    fitted(model_trans),
    resid(model_trans),
    main = "Residuals: Transformed Data (sqrt(Y))",
    xlab = "Fitted Values",
    ylab = "Residuals"
)
abline(h = 0, col = "blue", lty = 2)

5.1.1.14.3 General Strategy When f(\mu) Is Unknown

If the relationship between \text{var}(Y) and \mu (i.e., f(\mu)) is unknown, the following steps can help:

  1. Trial and Error:
    • Apply common transformations (e.g., \log(Y), \sqrt{Y}) and examine residual plots.
    • Select the transformation that results in stabilized variance (residuals show no pattern in plots).
  2. Leverage Prior Research:
    • Consult researchers or literature on similar experiments to determine the transformations typically used.
  3. Analyze Observations with the Same Predictor Value:
    • If multiple observations Y_{ij} are available at the same X value:
      • Compute the mean \bar{Y}_i and standard deviation s_i for each group.
      • Check if s_i \propto \bar{Y}_i^{\lambda}.
        • For example, assume: s_i = a \bar{Y}_i^{\lambda}.
        • Taking the natural logarithm: \ln(s_i) = \ln(a) + \lambda \ln(\bar{Y}_i).
      • Perform a regression of \ln(s_i) on \ln(\bar{Y}_i) to estimate \lambda and suggest the form of f(\mu).
  4. Group Observations:
    • If individual observations are sparse, try grouping similar observations by X values to compute \bar{Y}_i and s_i for each group.

5.1.1.14.4 Common Transformations and Their Applications

The table below summarizes common transformations used to stabilize variance under various conditions, along with their appropriate contexts and comments:

Transformation Situation Comments
\sqrt{Y} var(\epsilon_i) = k \, E(Y_i) For counts following a Poisson distribution.
\sqrt{Y} + \sqrt{Y+1} var(\epsilon_i) = k \, E(Y_i) Useful for small counts or datasets with zeros.
\log(Y) var(\epsilon_i) = k \, (E(Y_i))^2 Appropriate for positive integers with a wide range.
\log(Y+1) var(\epsilon_i) = k \, (E(Y_i))^2 Used when the data includes zero counts.
1/Y var(\epsilon_i) = k \, (E(Y_i))^4 For responses mostly near zero, with occasional large values.
\arcsin(\sqrt{Y}) var(\epsilon_i) = k \, E(Y_i)(1-E(Y_i)) Suitable for binomial proportions or percentage data.
  1. Choosing the Transformation:
    • Start by identifying the relationship between the variance of the residuals (var(\epsilon_i)) and the mean of the response variable (E(Y_i)).
    • Select the transformation that matches the identified variance structure.
  2. Transformations for Zero Values:
    • For data with zeros, transformations like \sqrt{Y+1} or \log(Y+1) can be used to avoid undefined values. But this will seriously jeopardize model assumption (J. Chen and Roth 2024).
  3. Use in Regression Models:
    • Apply these transformations to the dependent variable Y in the regression model.
    • Always check residual plots to confirm that the transformation stabilizes variance and resolves non-linearity.
  4. Interpretation After Transformation:
    • After transforming Y, interpret the results in terms of the transformed variable.
    • For practical interpretation, back-transform predictions and account for any associated bias.

5.1.2 Multiple Linear Regression

The geometry of least squares regression involves projecting the response vector \mathbf{y} onto the space spanned by the columns of the design matrix \mathbf{X}. The fitted values \mathbf{\hat{y}} can be expressed as:

\begin{aligned} \mathbf{\hat{y}} &= \mathbf{Xb} \\ &= \mathbf{X(X'X)^{-1}X'y} \\ &= \mathbf{Hy}, \end{aligned}

where:

  • \mathbf{H} = \mathbf{X(X'X)^{-1}X'} is the projection operator (sometimes denoted as \mathbf{P}).
  • \mathbf{\hat{y}} is the projection of \mathbf{y} onto the linear space spanned by the columns of \mathbf{X} (the model space).

The dimension of the model space is equal to the rank of \mathbf{X} (i.e., the number of linearly independent columns in \mathbf{X}).


Properties of the Projection Matrix \mathbf{H}

  1. Symmetry:
    • The projection matrix \mathbf{H} is symmetric: \mathbf{H} = \mathbf{H}'.
  2. Idempotence:
    • Applying \mathbf{H} twice gives the same result: \mathbf{HH} = \mathbf{H}. Proof: \begin{aligned} \mathbf{HH} &= \mathbf{X(X'X)^{-1}X'X(X'X)^{-1}X'} \\ &= \mathbf{X(X'X)^{-1}IX'} \\ &= \mathbf{X(X'X)^{-1}X'} \\ &= \mathbf{H}. \end{aligned}
  3. Dimensionality:
    • \mathbf{H} is an n \times n matrix (where n is the number of observations).
    • The rank of \mathbf{H} is equal to the rank of \mathbf{X}, which is typically the number of predictors (including the intercept).
  4. Orthogonal Complement:
    • The matrix \mathbf{(I - H)}, where: \mathbf{I - H} = \mathbf{I - X(X'X)^{-1}X'}, is also a projection operator.
    • It projects onto the orthogonal complement of the space spanned by the columns of \mathbf{X} (i.e., the space orthogonal to the model space).
  5. Orthogonality of Projections:
    • \mathbf{H} and \mathbf{(I - H)} are orthogonal: \mathbf{H(I - H)} = \mathbf{0}.
    • Similarly: \mathbf{(I - H)H} = \mathbf{0}.

Intuition for \mathbf{H} and \mathbf{(I - H)}

  • \mathbf{H}: Projects \mathbf{y} onto the model space, giving the fitted values \mathbf{\hat{y}}.
  • \mathbf{I - H}: Projects \mathbf{y} onto the residual space, giving the residuals \mathbf{e}: \mathbf{e} = \mathbf{(I - H)y}.
  • \mathbf{H} and \mathbf{(I - H)} divide the response vector \mathbf{y} into two components: \mathbf{y} = \mathbf{\hat{y}} + \mathbf{e}.
    • \mathbf{\hat{y}} = \mathbf{Hy} (fitted values).
    • \mathbf{e} = \mathbf{(I - H)y} (residuals).
  • The properties of \mathbf{H} (symmetry, idempotence, dimensionality) reflect its role as a linear transformation that projects vectors onto the model space.

This geometric perspective provides insight into the mechanics of least squares regression, particularly how the response variable \mathbf{y} is decomposed into fitted values and residuals.


Similar to simple regression, the total sum of squares in multiple regression analysis can be partitioned into components corresponding to the regression (model fit) and the residuals (errors).

The uncorrected total sum of squares is:

\mathbf{y'y} = \mathbf{\hat{y}'\hat{y} + e'e},

where:

  • \mathbf{\hat{y} = Hy} (fitted values, projected onto the model space).

  • \mathbf{e = (I - H)y} (residuals, projected onto the orthogonal complement of the model space).

Expanding this using projection matrices:

\begin{aligned} \mathbf{y'y} &= \mathbf{(Hy)'(Hy) + ((I-H)y)'((I-H)y)} \\ &= \mathbf{y'H'Hy + y'(I-H)'(I-H)y} \\ &= \mathbf{y'Hy + y'(I-H)y}. \end{aligned}

This equation shows the partition of \mathbf{y'y} into components explained by the model (\mathbf{\hat{y}}) and the unexplained variation (residuals).

For the corrected total sum of squares, we adjust for the mean (using the projection matrix \mathbf{H_1}):

\mathbf{y'(I-H_1)y = y'(H-H_1)y + y'(I-H)y}.

Here:

  • \mathbf{H_1} = \frac{1}{n} \mathbf{J}, where \mathbf{J} is an n \times n matrix of ones.

  • \mathbf{H - H_1} projects onto the subspace explained by the predictors after centering.

Aspect Uncorrected Total Sum of Squares (\mathbf{y'y}) Corrected Total Sum of Squares (\mathbf{y'(I-H_1)y})
Definition Total variation in y relative to the origin. Total variation in y relative to its mean (centered data).
Adjustment No adjustment for the mean of y. Adjusts for the mean of y by centering it.
Equation \mathbf{y'y} = \mathbf{\hat{y}'\hat{y}} + \mathbf{e'e} \mathbf{y'(I-H_1)y} = \mathbf{y'(H-H_1)y} + \mathbf{y'(I-H)y}
Projection Matrices

\mathbf{H}: Projects onto model space.

\mathbf{I-H}: Projects onto residuals.

\mathbf{H_1} = \frac{1}{n} \mathbf{J}: Adjusts for the mean.

\mathbf{H-H_1}: Projects onto predictors after centering.

\mathbf{I-H}: Projects onto residuals.

Interpretation Includes variation due to the mean of y. Focuses on variation in y around its mean.
Usage Suitable for raw, uncentered data. Common in regression and ANOVA to isolate variability explained by predictors.
Application Measures total variability in y, including overall level (mean). Measures variability explained by predictors relative to the mean.

Why the Correction Matters

  • In ANOVA and regression, removing the contribution of the mean helps isolate the variability explained by predictors from the overall level of the response variable.

  • Corrected sums of squares are more common when comparing models or computing R^2, which requires centering to ensure consistency in proportionate variance explained.

The corrected total sum of squares can be decomposed into the sum of squares for regression (SSR) and the sum of squares for error (SSE):

Source SS df MS F
Regression SSR = \mathbf{y'(H - \frac{1}{n} J)y} p - 1 MSR = SSR / (p-1) MSR / MSE
Error SSE = \mathbf{y'(I - H)y} n - p MSE = SSE / (n-p)
Total SST = \mathbf{y'(I - H_1)y} n - 1

Where:

  • p: Number of parameters (including intercept).

  • n: Number of observations.


Alternatively, the regression model can be expressed as:

\mathbf{Y = X\hat{\beta} + (Y - X\hat{\beta})},

where:

  • \mathbf{\hat{Y} = X\hat{\beta}}: Vector of fitted values (in the subspace spanned by \mathbf{X}).

  • \mathbf{e = Y - X\hat{\beta}}: Vector of residuals (in the orthogonal complement of the subspace spanned by \mathbf{X}).

  • \mathbf{Y} is an n \times 1 vector in the n-dimensional space \mathbb{R}^n.

  • \mathbf{X} is an n \times p full-rank matrix, with its columns generating a p-dimensional subspace of \mathbb{R}^n. Hence, any estimator \mathbf{X\hat{\beta}} is also in this subspace.

In linear regression, the Ordinary Least Squares estimator \hat{\beta} minimizes the squared Euclidean distance \|\mathbf{Y} - \mathbf{X}\beta\|^2 between the observed response vector \mathbf{Y} and the fitted values \mathbf{X}\beta. This minimization corresponds to the orthogonal projection of \mathbf{Y} onto the column space of \mathbf{X}.

We solve the optimization problem:

\min_{\beta} \|\mathbf{Y} - \mathbf{X}\beta\|^2.

The objective function can be expanded as:

\|\mathbf{Y} - \mathbf{X}\beta\|^2 = (\mathbf{Y} - \mathbf{X}\beta)^\top (\mathbf{Y} - \mathbf{X}\beta).

Perform the multiplication:

\begin{aligned} (\mathbf{Y} - \mathbf{X}\beta)^\top (\mathbf{Y} - \mathbf{X}\beta) &= \mathbf{Y}^\top \mathbf{Y} - \mathbf{Y}^\top \mathbf{X}\beta - \beta^\top \mathbf{X}^\top \mathbf{Y} + \beta^\top (\mathbf{X}^\top \mathbf{X}) \beta. \end{aligned}

Since \mathbf{Y}^\top \mathbf{X}\beta is a scalar, it equals \beta^\top \mathbf{X}^\top \mathbf{Y}. Therefore, the expanded expression becomes:

\|\mathbf{Y} - \mathbf{X}\beta\|^2 = \mathbf{Y}^\top \mathbf{Y} - 2\beta^\top \mathbf{X}^\top \mathbf{Y} + \beta^\top (\mathbf{X}^\top \mathbf{X}) \beta.

To find the \beta that minimizes this expression, take the derivative with respect to \beta and set it to 0:

\frac{\partial}{\partial \beta} \Bigl[ \mathbf{Y}^\top \mathbf{Y} - 2\beta^\top \mathbf{X}^\top \mathbf{Y} + \beta^\top (\mathbf{X}^\top \mathbf{X}) \beta \Bigr] = 0.

Computing the gradient:

\frac{\partial}{\partial \beta} = -2\mathbf{X}^\top \mathbf{Y} + 2(\mathbf{X}^\top \mathbf{X})\beta.

Setting this to zero:

-2\mathbf{X}^\top \mathbf{Y} + 2\mathbf{X}^\top \mathbf{X}\beta = 0.

Simplify:

\mathbf{X}^\top \mathbf{X}\beta = \mathbf{X}^\top \mathbf{Y}.

If \mathbf{X}^\top \mathbf{X} is invertible, the solution is:

\hat{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}.


Orthogonal Projection Interpretation

The fitted values are:

\hat{\mathbf{Y}} = \mathbf{X}\hat{\beta}.

From the normal equations, \mathbf{X}^\top(\mathbf{Y} - \mathbf{X}\hat{\beta}) = 0, which implies that the residual vector \mathbf{Y} - \hat{\mathbf{Y}} is orthogonal to every column of \mathbf{X}. Therefore:

  • \hat{\mathbf{Y}} = \mathbf{X}\hat{\beta} is the orthogonal projection of \mathbf{Y} onto \mathrm{Col}(\mathbf{X}).
  • \mathbf{Y} - \mathbf{X}\hat{\beta} lies in the orthogonal complement of \mathrm{Col}(\mathbf{X}).

Pythagoras Decomposition

The geometric interpretation gives us the decomposition:

\mathbf{Y} = \mathbf{X}\hat{\beta} + (\mathbf{Y} - \mathbf{X}\hat{\beta}),

where:

  • \mathbf{X}\hat{\beta} is the projection of \mathbf{Y} onto the column space of \mathbf{X}.

  • (\mathbf{Y} - \mathbf{X}\hat{\beta}) is the residual vector, orthogonal to \mathbf{X}\hat{\beta}.

Since the two components are orthogonal, their squared norms satisfy:

\begin{aligned}\|\mathbf{Y}\|^2 &= \mathbf{Y}^\top \mathbf{Y}&& \text{(definition of norm squared)} \\[6pt]&= (\mathbf{Y} - \mathbf{X}\hat{\beta} + \mathbf{X}\hat{\beta})^\top(\mathbf{Y} - \mathbf{X}\hat{\beta} + \mathbf{X}\hat{\beta})&& \text{(add and subtract the same term } \mathbf{X}\hat{\beta}\text{)} \\[6pt]&= (\mathbf{Y} - \mathbf{X}\hat{\beta})^\top(\mathbf{Y} - \mathbf{X}\hat{\beta})\;+\; 2\,(\mathbf{X}\hat{\beta})^\top(\mathbf{Y} - \mathbf{X}\hat{\beta})\;+\; (\mathbf{X}\hat{\beta})^\top(\mathbf{X}\hat{\beta})&& \text{(expand }(a+b)^\top(a+b)\text{)} \\[6pt]&= \|\mathbf{Y} - \mathbf{X}\hat{\beta}\|^2\;+\; 2\,(\mathbf{X}\hat{\beta})^\top(\mathbf{Y} - \mathbf{X}\hat{\beta})\;+\; \|\mathbf{X}\hat{\beta}\|^2&& \text{(rewrite each quadratic form as a norm)} \\[6pt]&= \|\mathbf{Y} - \mathbf{X}\hat{\beta}\|^2 + \|\mathbf{X}\hat{\beta}\|^2&& \text{(use that }(\mathbf{X}\hat{\beta})^\top(\mathbf{Y}-\mathbf{X}\hat{\beta}) = 0\text{, i.e. orthogonality)} \\[6pt]& \quad = \|\mathbf{X}\hat{\beta}\|^2 \;+\; \|\mathbf{Y} - \mathbf{X}\hat{\beta}\|^2. \end{aligned}

where the norm of a vector \mathbf{a} in \mathbb{R}^p is defined as:

\|\mathbf{a}\| = \sqrt{\mathbf{a}^\top \mathbf{a}} = \sqrt{\sum_{i=1}^p a_i^2}.

We are saying that \mathbf{Y} is decomposed into two orthogonal components:

  1. \mathbf{X}\hat{\beta} (the projection onto \mathrm{Col}(\mathbf{X})

    • \|\mathbf{X}\hat{\beta}\| measures the part of \mathbf{Y} explained by the model.
  2. \mathbf{Y} - \mathbf{X}\hat{\beta} (the residual lying in the orthogonal complement).

    • \|\mathbf{Y} - \mathbf{X}\hat{\beta}\| measures the residual error.

This geometric interpretation (projection plus orthogonal remainder) is exactly why we call \mathbf{X}\hat{\beta} the orthogonal projection of \mathbf{Y} onto the column space of \mathbf{X}. This decomposition also underlies the analysis of variance (ANOVA) in regression.


The coefficient of multiple determination, denoted R^2, measures the proportion of the total variation in the response variable (\mathbf{Y}) that is explained by the regression model. It is defined as:

R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO},

where:

  • SSR: Regression sum of squares (variation explained by the model).

  • SSE: Error sum of squares (unexplained variation).

  • SSTO: Total sum of squares (total variation in \mathbf{Y}).

The adjusted R^2 adjusts R^2 for the number of predictors in the model, penalizing for adding predictors that do not improve the model’s fit substantially. It is defined as:

R^2_a = 1 - \frac{SSE/(n-p)}{SSTO/(n-1)} = 1 - \frac{(n-1)SSE}{(n-p)SSTO},

where:

  • n: Number of observations.

  • p: Number of parameters (including the intercept).


Key Differences Between R^2 and R^2_a

Aspect R^2 R^2_a
Behavior with Predictors Always increases (or remains constant) when more predictors are added, even if they are not statistically significant. Includes a penalty for the number of predictors. May decrease if added predictors do not improve the model sufficiently.
Interpretation Proportion of the total variation in \mathbf{Y} explained by the regression model. Adjusted measure of explained variance, accounting for model complexity.
Range Ranges between 0 and 1. Can be lower than R^2, particularly when the model includes irrelevant predictors.
Usefulness Useful for understanding the overall fit of the model. Useful for comparing models with different numbers of predictors.

In multiple regression, R^2_a provides a more reliable measure of model fit, especially when comparing models with different numbers of predictors.


In a regression model with coefficients \beta = (\beta_0, \beta_1, \dots, \beta_{p-1})^\top, the sums of squares are used to evaluate the contribution of predictors to explaining the variation in the response variable.

  1. Model Sums of Squares:
    • SSM: Total model sum of squares, capturing the variation explained by all predictors: SSM = SS(\beta_0, \beta_1, \dots, \beta_{p-1}).
  2. Marginal Contribution:
    • SSM_m: Conditional model sum of squares, capturing the variation explained by predictors after accounting for others: SSM_m = SS(\beta_0, \beta_1, \dots, \beta_{p-1} | \beta_0).

Decompositions of SSM_m

  1. Sequential Sums of Squares (Type I SS)
  • Definition:

    • Sequential SS depends on the order in which predictors are added to the model.
    • It represents the additional contribution of each predictor given only the predictors that precede it in the sequence.
  • Formula: SSM_m = SS(\beta_1 | \beta_0) + SS(\beta_2 | \beta_0, \beta_1) + \dots + SS(\beta_{p-1} | \beta_0, \dots, \beta_{p-2}).

  • Key Points:

    • Sequential SS is not unique; it depends on the order of the predictors.
    • Default in many statistical software functions (e.g., anova() in R).
  1. Marginal Sums of Squares (Type II SS)
  • Definition:
    • Marginal SS evaluates the contribution of a predictor after accounting for all other predictors except those with which it is collinear.
    • It ignores hierarchical relationships or interactions, focusing on independent contributions.
  • Formula: SSM_m = SS(\beta_j | \beta_1, \dots, \beta_{j-1}, \beta_{j + 1}, \dots, \beta_{p-1}) where Type II SS evaluates the contribution of \beta_j while excluding any terms collinear with \beta_j.
  • Key Points:
    • Type II SS is independent of predictor order.
    • Suitable for models without interaction terms or when predictors are balanced.
  1. Partial Sums of Squares (Type III SS)
  • Definition:

    • Partial SS evaluates the contribution of each predictor after accounting for all other predictors in the model.
    • It quantifies the unique contribution of a predictor, controlling for the presence of others.
  • Formula: SSM_m = SS(\beta_1 | \beta_0, \beta_2, \dots, \beta_{p-1}) + \dots + SS(\beta_{p-1} | \beta_0, \beta_1, \dots, \beta_{p-2}).

  • Key Points:

    • Partial SS is unique for a given model.
    • More commonly used in practice for assessing individual predictor importance.

Comparison of Sequential, Marginal, and Partial SS

Aspect Sequential SS (Type I) Marginal SS (Type II) Partial SS (Type III)
Dependency Depends on the order in which predictors are entered. Independent of order; adjusts for non-collinear predictors. Independent of order; evaluates unique contributions.
Usage Default in software functions like anova() (Type I SS). Models without interactions or hierarchical dependencies. Commonly used for hypothesis testing.
Interpretation Measures the additional contribution of predictors in sequence. Measures the contribution of a predictor, ignoring collinear terms. Measures the unique contribution of each predictor.
Uniqueness Not unique; changes with predictor order. Unique for a given model without interactions. Unique for a given model.

Practical Notes

  • Use Type III SS (Partial SS) when:
    • The focus is on individual predictor contributions while accounting for all others.
    • Conducting hypothesis tests on predictors in complex models with interactions or hierarchical structures.
  • Use Type II SS (Marginal SS) when:
    • Working with balanced datasets or models without interaction terms.
    • Ignoring interactions and focusing on independent effects.
  • Use Type I SS (Sequential SS) when:
    • Interested in understanding the incremental contribution of predictors based on a specific order of entry (e.g., stepwise regression).

5.1.2.1 OLS Assumptions


5.1.2.1.1 A1 Linearity

The linear regression model is expressed as:

A1: y = \mathbf{x}\beta + \epsilon

This assumption is not restrictive since x can include nonlinear transformations (e.g., interactions, natural logarithms, quadratic terms).

However, when combined with A3 (Exogeneity of Independent Variables), linearity can become restrictive.


5.1.2.1.1.1 Log Model Variants

Logarithmic transformations of variables allow for flexible modeling of nonlinear relationships. Common log model forms include:

Model Form Interpretation of \beta In Words
Level-Level y = \beta_0 + \beta_1x + \epsilon \Delta y = \beta_1 \Delta x A unit change in x results in a \beta_1 unit change in y.
Log-Level \ln(y) = \beta_0 + \beta_1x + \epsilon \% \Delta y = 100 \beta_1 \Delta x A unit change in x results in a 100 \beta_1 \% change in y.
Level-Log y = \beta_0 + \beta_1 \ln(x) + \epsilon \Delta y = (\beta_1/100)\% \Delta x A 1% change in x results in a (\beta_1 / 100) unit change in y.
Log-Log \ln(y) = \beta_0 + \beta_1 \ln(x) + \epsilon \% \Delta y = \beta_1 \% \Delta x A 1% change in x results in a \beta_1 \% change in y.

5.1.2.1.1.2 Higher-Order Models

Higher-order terms allow the effect of x_1 on y to depend on the level of x_1. For example:

y = \beta_0 + x_1\beta_1 + x_1^2\beta_2 + \epsilon

The partial derivative of y with respect to x_1 is:

\frac{\partial y}{\partial x_1} = \beta_1 + 2x_1\beta_2

  • The effect of x_1 on y depends on the value of x_1.
  • Partial Effect at the Average: \beta_1 + 2E(x_1)\beta_2.
  • Average Partial Effect: E(\beta_1 + 2x_1\beta_2).

5.1.2.1.1.3 Interaction Terms

Interactions capture the joint effect of two variables. For example:

y = \beta_0 + x_1\beta_1 + x_2\beta_2 + x_1x_2\beta_3 + \epsilon

  • \beta_1 is the average effect of a unit change in x_1 on y when x_2 = 0.
  • The partial effect of x_1 on y, which depends on the level of x_2, is:

\beta_1 + x_2\beta_3.


5.1.2.1.2 A2 Full Rank

The full rank assumption ensures the uniqueness and existence of the parameter estimates in the population regression equation. It is expressed as:

A2: \text{rank}(E(\mathbf{x'x})) = k

This assumption is also known as the identification condition.

Key Points

  1. No Perfect Multicollinearity:
    • The columns of \mathbf{x} (the matrix of predictors) must be linearly independent.
    • No column in \mathbf{x} can be written as a linear combination of other columns.
  2. Implications:
    • Ensures that each parameter in the regression equation is identifiable and unique.
    • Prevents computational issues, such as the inability to invert \mathbf{x'x}, which is required for estimating \hat{\beta}.

Example of Violation

If two predictors, x_1 and x_2, are perfectly correlated (e.g., x_2 = 2x_1), the rank of \mathbf{x} is reduced, and \mathbf{x'x} becomes singular. In such cases:

  • The regression coefficients cannot be uniquely estimated.

  • The model fails to satisfy the full rank assumption.


5.1.2.1.3 A3 Exogeneity of Independent Variables

The exogeneity assumption ensures that the independent variables (\mathbf{x}) are not systematically related to the error term (\epsilon). It is expressed as:

A3: E[\epsilon | x_1, x_2, \dots, x_k] = E[\epsilon | \mathbf{x}] = 0

This assumption is often referred to as strict exogeneity or mean independence (see Correlation and Independence.

Key Points

  1. Strict Exogeneity:
    • Independent variables carry no information about the error term \epsilon.
    • By the [Law of Iterated Expectations], E(\epsilon) = 0, which can be satisfied by always including an intercept in the regression model.
  2. Implication:
    • A3 implies: E(y | \mathbf{x}) = \mathbf{x}\beta, meaning the conditional mean function is a linear function of \mathbf{x}. This aligns with A1 Linearity.
  3. Relationship with Independence:

5.1.2.1.3.1 A3a: Weaker Exogeneity Assumption

A weaker version of the exogeneity assumption is:

A3a: E(\mathbf{x_i'}\epsilon_i) = 0

This implies:

  • The independent variables (\mathbf{x}_i) are uncorrelated with the error term (\epsilon_i).

  • Weaker than mean independence in A3.


Comparison Between A3 and A3a

Aspect A3 (Strict Exogeneity) A3a (Weaker Exogeneity)
Definition E(\epsilon | \mathbf{x}) = 0. E(\mathbf{x}_i'\epsilon_i) = 0.
Strength Stronger assumption; implies A3a. Weaker assumption; does not imply A3.
Interpretation Predictors provide no information about \epsilon. Predictors are uncorrelated with \epsilon.
Causality Enables causal interpretation. Does not allow causal interpretations.

Notes on Practical Relevance

  • Checking for Exogeneity:
    • Strict exogeneity cannot be tested directly, but violations can manifest as omitted variable bias, endogeneity, or measurement error.
    • Including all relevant predictors and ensuring accurate measurement can help satisfy this assumption.
  • Violations of Exogeneity:
    • If A3 is violated, standard OLS estimates are biased and inconsistent.
    • In such cases, instrumental variable (IV) methods or other approaches may be required to correct for endogeneity.

5.1.2.1.4 A4 Homoskedasticity

The homoskedasticity assumption ensures that the variance of the error term (\epsilon) is constant across all levels of the independent variables (\mathbf{x}). It is expressed as:

A4: \text{Var}(\epsilon | \mathbf{x}) = \text{Var}(\epsilon) = \sigma^2

Key Points

  1. Definition:
    • The variance of the disturbance term \epsilon is the same for all observations, regardless of the values of the predictors \mathbf{x}.
  2. Practical Implication:
    • Homoskedasticity ensures that the errors do not systematically vary with the predictors.
    • This is critical for valid inference, as the standard errors of the coefficients rely on this assumption.
  3. Violation (Heteroskedasticity):
    • If the variance of \epsilon depends on \mathbf{x}, the assumption is violated.
    • Common signs include funnel-shaped patterns in residual plots or varying error sizes.

5.1.2.1.5 A5 Data Generation (Random Sampling)

The random sampling assumption ensures that the observations (y_i, x_{i1}, \dots, x_{ik-1}) are drawn independently and identically distributed (iid) from the joint distribution of (y, \mathbf{x}). It is expressed as:

A5: \{y_i, x_{i1}, \dots, x_{ik-1} : i = 1, \dots, n\}

Key Points

  1. Random Sampling:
    • The dataset is assumed to be a random sample from the population.
    • Each observation is independent of others and follows the same probability distribution.
  2. Implications:
    • With A3 (Exogeneity of Independent Variables) and A4 (Homoskedasticity), random sampling implies:
      • Strict Exogeneity: E(\epsilon_i | x_1, \dots, x_n) = 0 Independent variables do not contain information for predicting \epsilon.
      • Non-Autocorrelation: E(\epsilon_i \epsilon_j | x_1, \dots, x_n) = 0 \quad \text{for } i \neq j The error terms are uncorrelated across observations, conditional on the independent variables.
      • Variance of Errors: \text{Var}(\epsilon | \mathbf{X}) = \text{Var}(\epsilon) = \sigma^2 \mathbf{I}_n
  3. When A5 May Not Hold:
    • In time series data, where observations are often autocorrelated.
    • In spatial data, where neighboring observations may not be independent.

Practical Considerations

  • Time Series Data:
    • Use methods such as autoregressive models or generalized least squares (GLS) to address dependency in observations.
  • Spatial Data:
    • Spatial econometric models may be required to handle correlation across geographic locations.
  • Checking Random Sampling:
    • While true randomness cannot always be verified, exploratory analysis of the residuals (e.g., for patterns or autocorrelation) can help detect violations.
5.1.2.1.5.1 A5a: Stationarity in Stochastic Processes

A stochastic process \{x_t\}_{t=1}^T is stationary if, for every collection of time indices \{t_1, t_2, \dots, t_m\}, the joint distribution of:

x_{t_1}, x_{t_2}, \dots, x_{t_m}

is the same as the joint distribution of:

x_{t_1+h}, x_{t_2+h}, \dots, x_{t_m+h}

for any h \geq 1.

Key Points on Stationarity

  1. Definition:
    • A stationary process has statistical properties (mean, variance, and covariance) that are invariant over time.
    • For example, the joint distribution for the first ten observations is identical to the joint distribution for the next ten observations, regardless of their position in time.
  2. Implication:
    • Stationarity ensures that the relationships observed in the data remain consistent over time.
    • Independent draws automatically satisfy the stationarity condition.

Weak Stationarity

A stochastic process \{x_t\}_{t=1}^T is weakly stationary if:

  • The covariance between x_t and x_{t+h} depends only on the lag h and not on t.
  • As h \to \infty, the covariance diminishes, meaning x_t and x_{t+h} become “almost independent.”

Differences Between Stationarity and Weak Stationarity

Aspect Stationarity Weak Stationarity
Joint Distribution Entire joint distribution remains unchanged over time. Focuses only on the first two moments: mean and covariance.
Dependence Over Time Observations at all lags are equally distributed. Observations far apart are “almost independent.”
Application Ensures strong consistency in time-series processes. More practical for many time-series applications.

Weak stationarity is often sufficient for many time-series analyses, especially when focusing on correlations and trends rather than full distributions.


Common Weakly Dependent Processes

1. Moving Average Process of Order 1 (MA(1))

An MA(1) process models the dependent variable y_t as a function of the current and one-period lagged stochastic error term:

y_t = u_t + \alpha_1 u_{t-1},

where u_t is white noise, independently and identically distributed (iid) with variance \sigma^2.

Key Properties

  1. Mean: E(y_t) = E(u_t) + \alpha_1E(u_{t-1}) = 0

  2. Variance: \begin{aligned} \text{Var}(y_t) &= \text{Var}(u_t) + \alpha_1^2 \text{Var}(u_{t-1}) \\ &= \sigma^2 + \alpha_1^2 \sigma^2 \\ &= \sigma^2 (1 + \alpha_1^2) \end{aligned} An increase in the absolute value of \alpha_1 increases the variance.

  3. Autocovariance:

    • For lag 1: \text{Cov}(y_t, y_{t-1}) = \alpha_1 \text{Var}(u_{t-1}) = \alpha_1 \sigma^2.
    • For lag 2 or greater: \text{Cov}(y_t, y_{t-2}) = 0.

The MA(1) process is invertible if |\alpha_1| < 1, allowing it to be rewritten as an autoregressive (AR) representation:

u_t = y_t - \alpha_1 u_{t-1}.

Invertibility implies that we can express the current observation in terms of past observations.


An MA(q) process generalizes the MA(1) process to include q lags:

y_t = u_t + \alpha_1 u_{t-1} + \dots + \alpha_q u_{t-q},

where u_t \sim WN(0, \sigma^2) (white noise with mean 0 and variance \sigma^2).

Key Characteristics:

  1. Covariance Stationary:
    • An MA(q) process is covariance stationary irrespective of the parameter values.
  2. Invertibility:
    • An MA(q) process is invertible if the parameters satisfy certain conditions (e.g., |\alpha_i| < 1).
  3. Autocorrelations:
    • The autocorrelations are nonzero for lags up to q but are 0 for lags beyond q.
  4. Conditional Mean:
    • The conditional mean of y_t depends on the q lags, indicating “long-term memory.”

Example: Autocovariance of an MA(1)

  1. Lag 1: \begin{aligned} \text{Cov}(y_t, y_{t-1}) &= \text{Cov}(u_t + \alpha_1 u_{t-1}, u_{t-1} + \alpha_1 u_{t-2}) \\ &= \alpha_1 \text{Var}(u_{t-1}) \\ &= \alpha_1 \sigma^2. \end{aligned}

  2. Lag 2: \begin{aligned} \text{Cov}(y_t, y_{t-2}) &= \text{Cov}(u_t + \alpha_1 u_{t-1}, u_{t-2} + \alpha_1 u_{t-3}) \\ &= 0. \end{aligned}

An MA process captures a linear relationship between the dependent variable y_t and the current and past values of a stochastic error term u_t. Its properties make it useful for modeling time-series data with limited memory and short-term dependencies.

  1. Auto-Regressive Process of Order 1 (AR(1))

An auto-regressive process of order 1 (AR(1)) is defined as:

y_t = \rho y_{t-1} + u_t, \quad |\rho| < 1

where u_t represents independent and identically distributed (i.i.d.) random noise over t with variance \sigma^2.

Covariance at lag 1:

\begin{aligned} Cov(y_t, y_{t-1}) &= Cov(\rho y_{t-1} + u_t, y_{t-1}) \\ &= \rho Var(y_{t-1}) \\ &= \rho \frac{\sigma^2}{1-\rho^2}. \end{aligned}

Covariance at lag h:

Cov(y_t, y_{t-h}) = \rho^h \frac{\sigma^2}{1-\rho^2}.

Stationarity implies that the distribution of y_t does not change over time, requiring constant mean and variance. For this process:

  1. Mean Stationarity:
    Assuming E(y_t) = 0, we have: y_t = \rho^t y_0 + \rho^{t-1}u_1 + \rho^{t-2}u_2 + \dots + \rho u_{t-1} + u_t.

    If the initial observation y_0 = 0, then y_t is simply a weighted sum of the random shocks u_t from all prior time periods. Thus, E(y_t) = 0 for all t.

  2. Variance Stationarity:
    The variance is computed as: Var(y_t) = Var(\rho y_{t-1} + u_t).

    Expanding and simplifying: Var(y_t) = \rho^2 Var(y_{t-1}) + Var(u_t) + 2\rho Cov(y_{t-1}, u_t). Since u_t is independent of y_{t-1}, Cov(y_{t-1}, u_t) = 0, giving: Var(y_t) = \rho^2 Var(y_{t-1}) + \sigma^2.

    Solving recursively, we find: Var(y_t) = \frac{\sigma^2}{1-\rho^2}.

For the variance to remain constant over time, it is required that |\rho| < 1 and p \notin \{1,-1\}.

Key Insights on Stationarity

  1. Stationarity Requirement:
    The condition |\rho| < 1 ensures stationarity, as this guarantees that both the mean and variance of the process are constant over time.

  2. Weak Dependence:
    As |\rho| < 1, the dependency between observations diminishes with increasing lag h, as seen from the covariance: Cov(y_t, y_{t-h}) = \rho^h \frac{\sigma^2}{1-\rho^2}.

To estimate an AR(1) process, we utilize the Yule-Walker equations, which relate the autocovariances of the process to its parameters.

Starting with the AR(1) process: y_t = \epsilon_t + \phi y_{t-1}, multiplying both sides by y_{t-\tau} and taking expectations, we get: y_t y_{t-\tau} = \epsilon_t y_{t-\tau} + \phi y_{t-1} y_{t-\tau}.

For \tau \geq 1, the autocovariance \gamma(\tau) satisfies: \gamma(\tau) = \phi \gamma(\tau - 1).

Dividing through by the variance \gamma(0), we obtain the autocorrelation: \rho_\tau = \phi^\tau.

Thus, the autocorrelations decay geometrically as \phi^\tau, where |\phi| < 1 ensures stationarity and decay over time.


Generalizing to AR(p)

An AR(p) process extends the AR(1) structure to include p lags: y_t = \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p} + \epsilon_t.

Here, \epsilon_t is white noise with E(\epsilon_t) = 0 and Var(\epsilon_t) = \sigma^2.

The AR(p) process is covariance stationary if the roots of the characteristic equation lie outside the unit circle: 1 - \phi_1 z - \phi_2 z^2 - \dots - \phi_p z^p = 0.

For the AR(p) process, the autocorrelations \rho_\tau decay more complexly compared to the AR(1). However, they still diminish over time, ensuring weak dependence among distant observations.


The Yule-Walker equations for an AR(p) process provide a system of linear equations to estimate the parameters \phi_1, \phi_2, \dots, \phi_p: \gamma(\tau) = \phi_1 \gamma(\tau - 1) + \phi_2 \gamma(\tau - 2) + \dots + \phi_p \gamma(\tau - p), \quad \tau \geq 1.

This system can be written in matrix form for \tau = 1, \dots, p as:

\begin{bmatrix} \gamma(1) \\ \gamma(2) \\ \vdots \\ \gamma(p) \end{bmatrix} = \begin{bmatrix} \gamma(0) & \gamma(1) & \dots & \gamma(p-1) \\ \gamma(1) & \gamma(0) & \dots & \gamma(p-2) \\ \vdots & \vdots & \ddots & \vdots \\ \gamma(p-1) & \gamma(p-2) & \dots & \gamma(0) \end{bmatrix} \begin{bmatrix} \phi_1 \\ \phi_2 \\ \vdots \\ \phi_p \end{bmatrix}.

This system is solved to estimate the coefficients \phi_1, \phi_2, \dots, \phi_p.


  1. ARMA(p, q) Process

An ARMA(p, q) process combines autoregressive (AR) and moving average (MA) components to model time series data effectively. The general form is:

y_t = \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p} + \epsilon_t + \alpha_1 \epsilon_{t-1} + \alpha_2 \epsilon_{t-2} + \dots + \alpha_q \epsilon_{t-q}.

A simple case of ARMA(1, 1) is given by: y_t = \phi y_{t-1} + \epsilon_t + \alpha \epsilon_{t-1}, where:

  • \phi captures the autoregressive behavior,

  • \alpha controls the moving average component,

  • \epsilon_t represents white noise.

ARMA processes can capture seasonality and more complex dependencies than pure AR or MA models.


  1. Random Walk Process

A random walk is a non-stationary process defined as: y_t = y_0 + \sum_{s=1}^t u_s, where u_s are i.i.d. random variables.

Properties:

  1. Non-Stationarity:

    • If y_0 = 0, then E(y_t) = 0, but the variance increases over time: Var(y_t) = t \sigma^2.
  2. Not Weakly Dependent:

    • The covariance of the process does not diminish with increasing lag h: Cov\left(\sum_{s=1}^t u_s, \sum_{s=1}^{t-h} u_s\right) = (t-h)\sigma^2.

    As h increases, the covariance remains large, violating the condition for weak dependence.


5.1.2.1.5.2 A5a: Stationarity and Weak Dependence in Time Series

For time series data, the set \{y_t, x_{t1}, \dots, x_{tk-1}\}, where t = 1, \dots, T, must satisfy the conditions of stationarity and weak dependence. These properties are essential to ensure the consistency and efficiency of estimators in time-series regression models.

  1. Stationarity: A stationary process has statistical properties (e.g., mean, variance, autocovariance) that remain constant over time. This ensures that the relationships in the data do not change as t progresses, making it possible to draw meaningful inferences.
  2. Weak Dependence: Weak dependence implies that observations far apart in time are “almost independent.” While there may be short-term correlations, these diminish as the time lag increases. This property ensures that the sample averages are representative of the population mean.

The Weak Law of Large Numbers provides a foundation for the consistency of sample means. If \{z_t\} is a weakly dependent, stationary process with E(|z_t|) < \infty and E(z_t) = \mu, then:

\frac{1}{T} \sum_{t=1}^T z_t \xrightarrow{p} \mu.

Interpretation:

  • As the sample size T \to \infty, the sample mean \bar{z} = \frac{1}{T} \sum_{t=1}^T z_t converges in probability to the true mean \mu.

  • This ensures the consistency of estimators based on time-series data.

The Central Limit Theorem extends the WLLN by describing the distribution of the sample mean. Under additional regularity conditions (e.g., finite variance) (Greene 1990), the sample mean \bar{z} satisfies:

\sqrt{T}(\bar{z} - \mu) \xrightarrow{d} N(0, B),

where:

B = \text{Var}(z_t) + 2\sum_{h=1}^\infty \text{Cov}(z_t, z_{t-h}).

Interpretation:

  • The sample mean \bar{z} is approximately normally distributed for large T.

  • The variance of the limiting distribution, B, depends not only on the variance of z_t but also on the covariances between z_t and its past values.

5.1.2.1.6 A6 Normal Distribution

A6: \epsilon|\mathbf{x} \sim N(0, \sigma^2 I_n)

The assumption here implies that the error term \epsilon is normally distributed with mean zero and variance \sigma^2 I_n. This assumption is fundamental for statistical inference in linear regression models.

Using assumptions A1 Linearity, A2 Full Rank, and A3 Exogeneity of Independent Variables, we derive the identification (or orthogonality condition) for the population parameter \beta:

\begin{aligned} y &= x\beta + \epsilon && \text{(A1: Model Specification)} \\ x'y &= x'x\beta + x'\epsilon && \text{(Multiply both sides by $x'$)} \\ E(x'y) &= E(x'x)\beta + E(x'\epsilon) && \text{(Taking expectation)} \\ E(x'y) &= E(x'x)\beta && \text{(A3: Exogeneity, $E(x'\epsilon) = 0$)} \\ [E(x'x)]^{-1}E(x'y) &= [E(x'x)]^{-1}E(x'x)\beta && \text{(Invertibility of $E(x'x)$, A2)} \\ [E(x'x)]^{-1}E(x'y) &= \beta && \text{(Simplified solution for $\beta$)} \end{aligned}

Thus, \beta is identified as the vector of parameters that minimizes the expected squared error.

To find \beta, we minimize the expected value of the squared error:

\underset{\gamma}{\operatorname{argmin}} \ E\big((y - x\gamma)^2\big)

The first-order condition is derived by taking the derivative of the objective function with respect to \gamma and setting it to zero:

\begin{aligned} \frac{\partial E\big((y - x\gamma)^2\big)}{\partial \gamma} &= 0 \\ -2E(x'(y - x\gamma)) &= 0 \\ E(x'y) - E(x'x\gamma) &= 0 \\ E(x'y) &= E(x'x)\gamma \\ (E(x'x))^{-1}E(x'y) &= \gamma \end{aligned}

This confirms that \gamma = \beta.

The second-order condition ensures that the solution minimizes the objective function. Taking the second derivative:

\frac{\partial^2 E\big((y - x\gamma)^2\big)}{\partial \gamma'^2} = 0 = 2E(x'x)

If assumption A3 Exogeneity of Independent Variables holds, E(x'x) is positive semi-definite (PSD). Thus, 2E(x'x) is also PSD, ensuring a minimum.

5.1.2.1.7 Hierarchy of OLS Assumptions

This table summarizes the hierarchical nature of assumptions required to derive different properties of the OLS estimator.

Usage of Assumptions
Assumption Identification Data Description Unbiasedness Consistency Gauss-Markov (BLUE) Asymptotic Inference (z and Chi-squared) Classical LM (BUE) Small-sample Inference (t and F)

A2 Full Rank

Variation in \mathbf{X}

\checkmark \checkmark \checkmark \checkmark

A5 Data Generation (Random Sampling)

Random Sampling

\checkmark \checkmark \checkmark

A1 Linearity

Linearity in Parameters

\checkmark \checkmark \checkmark

A3 Exogeneity of Independent Variables

Zero Conditional Mean

\checkmark \checkmark \checkmark

A4 Homoskedasticity

\mathbf{H} homoskedasticity

\checkmark \checkmark

A6 Normal Distribution

Normality of Errors

\checkmark
  • Identification Data Description: Ensures the model is identifiable and coefficients can be estimated.
  • Unbiasedness Consistency: Guarantees that OLS estimates are unbiased and converge to the true parameter values as the sample size increases.
  • Gauss-Markov (BLUE) and Asymptotic Inference: Requires additional assumptions (e.g., homoskedasticity) to ensure minimum variance of estimators and valid inference using large-sample tests (z and chi-squared).
  • Classical LM (BUE) Small-sample Inference: Builds on all previous assumptions and adds normality of errors for valid t and F tests in finite samples.

5.1.2.2 Theorems

5.1.2.2.1 Frisch–Waugh–Lovell Theorem {#Frisch–Waugh–Lovell Theorem}

The Frisch–Waugh–Lovell (FWL) Theorem is a fundamental result in linear regression that allows for a deeper understanding of how coefficients are computed in a multiple regression setting (Lovell 2008). Informally, it states:

When estimating the effect of a subset of variables (X_1) on y in the presence of other variables (X_2), you can “partial out” the influence of X_2 from both y and X_1. Then, regressing the residuals of y on the residuals of X_1 produces coefficients for X_1 that are identical to those obtained from the full multiple regression.

Consider the multiple linear regression model:

\mathbf{y = X\beta + \epsilon = X_1\beta_1 + X_2\beta_2 + \epsilon}

where:

  • y is an n \times 1 vector of the dependent variable.
  • X_1 is an n \times k_1 matrix of regressors of interest.
  • X_2 is an n \times k_2 matrix of additional regressors.
  • \beta_1 and \beta_2 are coefficient vectors of sizes k_1 \times 1 and k_2 \times 1, respectively.
  • \epsilon is an n \times 1 error term vector.

This can be equivalently represented in partitioned matrix form as:

\left( \begin{array}{cc} X_1'X_1 & X_1'X_2 \\ X_2'X_1 & X_2'X_2 \end{array} \right) \left( \begin{array}{c} \hat{\beta_1} \\ \hat{\beta_2} \end{array} \right) = \left( \begin{array}{c} X_1'y \\ X_2'y \end{array} \right)

The OLS estimator for the vector \begin{pmatrix} \beta_1 \\ \beta_2 \end{pmatrix} is:

\begin{pmatrix} \hat{\beta}_1 \\ \hat{\beta}_2 \end{pmatrix} = \begin{pmatrix} X_1'X_1 & X_1'X_2 \\ X_2'X_1 & X_2'X_2 \end{pmatrix}^{-1} \begin{pmatrix} X_1'y \\ X_2'y \end{pmatrix}.

If we only want the coefficients on X_1, a known result from partitioned-inversion gives:

\hat{\beta}_1 = \bigl(X_1' M_2\, X_1\bigr)^{-1} \,X_1' M_2\, y,

where

M_2 = I - X_2 \bigl(X_2'X_2\bigr)^{-1} X_2'.

The matrix M_2 is often called the residual-maker or annihilator matrix for X_2. It is an n \times n symmetric, idempotent projection matrix that projects any vector in \mathbb{R}^n onto the orthogonal complement of the column space of X_2. M_2 satisfies M_2^2 = M_2, and M_2 = M_2'.

Intuitively, M_2 captures the part of y (and any other vector) that is orthogonal to the columns of X_2. This “partialling out” of X_2 from both y and X_1 lets us isolate \hat{\beta}_1.

Equivalently, we can also represent \hat{\beta_1} as:

\mathbf{\hat{\beta_1} = (X_1'X_1)^{-1}X_1'y - (X_1'X_1)^{-1}X_1'X_2\hat{\beta_2}}

From this equation, we can see that

  1. Betas from Multiple vs. Simple Regressions:
    • The coefficients (\beta) from a multiple regression are generally not the same as the coefficients from separate individual simple regressions.
  2. Impact of Additional Variables (X_2):
    • The inclusion of different sets of explanatory variables (X_2) affects all coefficient estimates, even for those in X_1.
  3. Special Cases:
    • If X_1'X_2 = 0 (orthogonality between X_1 and X_2) or \hat{\beta_2} = 0, the above points (1 and 2) do not hold. In such cases, there is no interaction between the coefficients in X_1 and X_2, making the coefficients in X_1 unaffected by X_2.

Steps in FWL:

  1. Partial Out X_2 from y: Regress y on X_2 to obtain residuals:

    \tilde{y} = M_2y.

  2. Partial Out X_2 from X_1: For each column of X_1, regress it on X_2 to obtain residuals:

    \tilde{X}_1 = M_2X_1.

  3. Regression of Residuals: Regress \tilde{y} on \tilde{X}_1:

    \tilde{y} = \tilde{X}_1\beta_1 + \text{error}.

The coefficients \beta_1 obtained here are identical to those from the full model regression:

y = X_1\beta_1 + X_2\beta_2 + \epsilon.

Why It Matters

  1. Interpretation of Partial Effects: The FWL Theorem provides a way to interpret \beta_1 as the effect of X_1 on y after removing any linear dependence on X_2.

  2. Computational Simplicity: It allows the decomposition of a large regression problem into smaller, computationally simpler pieces.

# Simulate data
set.seed(123)
n <- 100
X1 <- matrix(rnorm(n * 2), n, 2)  # Two regressors of interest
X2 <- matrix(rnorm(n * 2), n, 2)  # Two additional regressors
beta1 <- c(2,-1)  # Coefficients for X1
beta2 <- c(1, 0.5)  # Coefficients for X2undefined
u <- rnorm(n)  # Error term
y <- X1 %*% beta1 + X2 %*% beta2 + u  # Generate dependent variable

# Full regression
full_model <- lm(y ~ X1 + X2)
summary(full_model)
#> 
#> Call:
#> lm(formula = y ~ X1 + X2)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.47336 -0.58010  0.07461  0.68778  2.46552 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.11614    0.10000   1.161    0.248    
#> X11          1.77575    0.10899  16.293  < 2e-16 ***
#> X12         -1.14151    0.10204 -11.187  < 2e-16 ***
#> X21          0.94954    0.10468   9.071 1.60e-14 ***
#> X22          0.47667    0.09506   5.014 2.47e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9794 on 95 degrees of freedom
#> Multiple R-squared:  0.8297, Adjusted R-squared:  0.8225 
#> F-statistic: 115.7 on 4 and 95 DF,  p-value: < 2.2e-16

# Step 1: Partial out X2 from y
y_residual <- residuals(lm(y ~ X2))

# Step 2: Partial out X2 from X1
X1_residual <- residuals(lm(X1 ~ X2))

# Step 3: Regress residuals
fwl_model <- lm(y_residual ~ X1_residual - 1)
summary(fwl_model)
#> 
#> Call:
#> lm(formula = y_residual ~ X1_residual - 1)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.47336 -0.58010  0.07461  0.68778  2.46552 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> X1_residual1   1.7758     0.1073   16.55   <2e-16 ***
#> X1_residual2  -1.1415     0.1005  -11.36   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9643 on 98 degrees of freedom
#> Multiple R-squared:  0.8109, Adjusted R-squared:  0.807 
#> F-statistic: 210.1 on 2 and 98 DF,  p-value: < 2.2e-16

# Comparison of coefficients
cat("Full model coefficients (X1):", coef(full_model)[2:3], "\n")
#> Full model coefficients (X1): 1.775754 -1.141514
cat("FWL model coefficients:", coef(fwl_model), "\n")
#> FWL model coefficients: 1.775754 -1.141514
5.1.2.2.2 Gauss-Markov Theorem

For a linear regression model:

\mathbf{y = X\beta + \epsilon},

under the assumptions:

  1. A1: Linearity of the model.
  2. A2: Full rank of \mathbf{X}.
  3. A3: Exogeneity of \mathbf{X}.
  4. A4: Homoskedasticity of \epsilon.

The Ordinary Least Squares estimator:

\hat{\beta} = \mathbf{(X'X)^{-1}X'y},

is the Best Linear Unbiased Estimator (BLUE). This means that \hat{\beta} has the minimum variance among all linear unbiased estimators of \beta.

1. Unbiasedness

Suppose we consider any linear estimator of \beta of the form:

\tilde{\beta} = \mathbf{C\,y},

where

  • \mathbf{y} is the n \times 1 vector of observations,
  • \mathbf{C} is a k \times n matrix (with k the dimension of \beta) that depends only on the design matrix \mathbf{X}.

Our regression model is

\mathbf{y} = \mathbf{X}\beta + \boldsymbol{\epsilon}, \quad E[\boldsymbol{\epsilon} \mid \mathbf{X}] = \mathbf{0}, \quad \mathrm{Var}(\boldsymbol{\epsilon} \mid \mathbf{X}) = \sigma^2 \mathbf{I}.

We say \tilde{\beta} is unbiased if its conditional expectation (given \mathbf{X}) equals the true parameter \beta:

E(\tilde{\beta} \mid \mathbf{X}) = E(\mathbf{C\,y} \mid \mathbf{X}) = \beta.

  1. Substitute \mathbf{y} = \mathbf{X}\beta + \boldsymbol{\epsilon}:

E(\mathbf{C\,y} \mid \mathbf{X}) = E\bigl(\mathbf{C}(\mathbf{X}\beta + \boldsymbol{\epsilon}) \mid \mathbf{X}\bigr) = \mathbf{C\,X}\,\beta + \mathbf{C}\,E(\boldsymbol{\epsilon} \mid \mathbf{X}) = \mathbf{C\,X}\,\beta.

  1. For this to hold for all \beta, we require

\mathbf{C\,X} = \mathbf{I}.

In other words, \mathbf{C} must be a “right-inverse” of \mathbf{X}.

On the other hand, the OLS estimator \hat{\beta} is given by

\hat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\,\mathbf{y}.

You can verify:

  • Let \mathbf{C}_{\text{OLS}} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'.
  • Then \mathbf{C}_{\text{OLS}}\,\mathbf{X} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\,\mathbf{X} = \mathbf{I}.
  • By the argument above, this makes \hat{\beta} unbiased.

Hence, any linear estimator \tilde{\beta} = \mathbf{C\,y} that is unbiased must satisfy \mathbf{C\,X} = \mathbf{I}.

2. Minimum Variance (Gauss–Markov Part)

Among all estimators of the form \tilde{\beta} = \mathbf{C\,y} that are unbiased (so \mathbf{C\,X} = \mathbf{I}), OLS achieves the smallest covariance matrix.

  • Variance of a General Unbiased Estimator

If \tilde{\beta} = \mathbf{C\,y} with \mathbf{C\,X} = \mathbf{I}, then:

\mathrm{Var}(\tilde{\beta} \mid \mathbf{X}) = \mathrm{Var}(\mathbf{C\,y} \mid \mathbf{X}) = \mathbf{C}\,\mathrm{Var}(\mathbf{y} \mid \mathbf{X})\,\mathbf{C}' = \mathbf{C}\bigl(\sigma^2 \mathbf{I}\bigr)\mathbf{C}' = \sigma^2\,\mathbf{C}\,\mathbf{C}'.

  • Variance of the OLS Estimator

For OLS, \hat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\,\mathbf{y}. Thus,

\mathrm{Var}(\hat{\beta} \mid \mathbf{X}) = \sigma^2\,(\mathbf{X}'\mathbf{X})^{-1}.

  • Comparing \mathrm{Var}(\tilde{\beta}) to \mathrm{Var}(\hat{\beta})

We want to show:

\mathrm{Var}(\tilde{\beta} \mid \mathbf{X}) - \mathrm{Var}(\hat{\beta} \mid \mathbf{X}) \;\;\text{is positive semi-definite.}

  1. Since both \tilde{\beta} and \hat{\beta} are unbiased, we know: \mathbf{C\,X} = \mathbf{I}, \quad (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\,\mathbf{X} = \mathbf{I}.

  2. One can show algebraically (as in the proof provided above) that \mathrm{Var}(\tilde{\beta} \mid \mathbf{X}) - \mathrm{Var}(\hat{\beta} \mid \mathbf{X}) = \sigma^2 \bigl[\mathbf{C}\mathbf{C}' - (\mathbf{X}'\mathbf{X})^{-1}\bigr]. Under the condition \mathbf{C\,X} = \mathbf{I}, the difference \mathbf{C}\mathbf{C}' - (\mathbf{X}'\mathbf{X})^{-1} is positive semi-definite.

  3. Positive semi-definite difference means

\mathbf{v}' \Bigl(\mathbf{C}\mathbf{C}' - (\mathbf{X}'\mathbf{X})^{-1}\Bigr)\mathbf{v} \ge 0 \quad \text{for all vectors } \mathbf{v}.

Hence, \hat{\beta} has the smallest variance (in the sense of covariance matrices) among all linear unbiased estimators \tilde{\beta} = \mathbf{C\,y}.

Summary of the Key Points

  1. Unbiasedness:
    A linear estimator \tilde{\beta} = \mathbf{C\,y} is unbiased if E(\tilde{\beta}\mid \mathbf{X}) = \beta.
    This forces \mathbf{C\,X} = \mathbf{I}.

  2. OLS is Unbiased:
    The OLS estimator \hat{\beta} = (X'X)^{-1} X' \, y satisfies (X'X)^{-1} X' \, X = I, hence is unbiased.

  3. OLS has Minimum Variance:
    Among all \mathbf{C} that satisfy \mathbf{C\,X} = \mathbf{I}, the matrix (\mathbf{X}'\mathbf{X})^{-1} gives the smallest possible \mathrm{Var}(\tilde{\beta}).
    In matrix form, \mathrm{Var}(\tilde{\beta}) - \mathrm{Var}(\hat{\beta}) is positive semi-definite, showing OLS is optimal (the Best Linear Unbiased Estimator, BLUE).

5.1.2.3 Finite Sample Properties

The finite sample properties of an estimator are considered when the sample size n is fixed (not asymptotically large). Key properties include bias, distribution, and standard deviation of the estimator.


Bias measures how close an estimator is, on average, to the true parameter value \beta. It is defined as:

\text{Bias} = E(\hat{\beta}) - \beta

Where:

  • \beta: True parameter value.
  • \hat{\beta}: Estimator for \beta.

Unbiased Estimator: An estimator is unbiased if:

\text{Bias} = E(\hat{\beta}) - \beta = 0 \quad \text{or equivalently} \quad E(\hat{\beta}) = \beta

This means the estimator will produce estimates that are, on average, equal to the value it is trying to estimate.


An estimator is a function of random variables (data). Its distribution describes how the estimates vary across repeated samples. Key aspects include:

  • Center: Mean of the distribution, which relates to bias.
  • Spread: Variability of the estimator, captured by its standard deviation or variance.

The standard deviation of an estimator measures the spread of its sampling distribution. It indicates the variability of the estimator across different samples.


5.1.2.3.1 Ordinary Least Squares Properties

Under the standard assumptions for OLS:

  1. A1: The relationship between Y and X is linear.
  2. A2: The matrix \mathbf{X'X} is invertible.
  3. A3: E(\epsilon|X) = 0 (errors are uncorrelated with predictors).

OLS is unbiased under these assumptions. The proof is as follows:

\begin{aligned} E(\hat{\beta}) &= E(\mathbf{(X'X)^{-1}X'y}) && \text{A2}\\ &= E(\mathbf{(X'X)^{-1}X'(X\beta + \epsilon)}) && \text{A1}\\ &= E(\mathbf{(X'X)^{-1}X'X\beta + (X'X)^{-1}X'\epsilon}) \\ &= E(\beta + \mathbf{(X'X)^{-1}X'\epsilon}) \\ &= \beta + E(\mathbf{(X'X)^{-1}X'\epsilon}) \\ &= \beta + E(E(\mathbf{(X'X)^{-1}X'\epsilon}|X)) && \text{LIE (Law of Iterated Expectation)} \\ &= \beta + E(\mathbf{(X'X)^{-1}X'}E(\epsilon|X)) \\ &= \beta + E(\mathbf{(X'X)^{-1}X'} \cdot 0) && \text{A3}\\ &= \beta \end{aligned}

Key Points:

  1. Linearity of Expectation: Used to separate terms involving \beta and \epsilon.
  2. Law of Iterated Expectation (LIE): Simplifies nested expectations.
  3. Exogeneity of Errors (A3): Ensures E(\epsilon|X) = 0, eliminating bias.

Implications of Unbiasedness

  • OLS estimators are centered around the true value \beta across repeated samples.
  • In small samples, OLS estimators may exhibit variability, but their expected value remains \beta.

If the assumption of exogeneity (A3) is violated, the OLS estimator becomes biased. Specifically, omitted variables or endogeneity can introduce systematic errors into the estimation.

From the Frisch-Waugh-Lovell Theorem:

  • If an omitted variable \hat{\beta}_2 \neq 0 (non-zero effect) and the omitted variable is correlated with the included predictors (\mathbf{X_1'X_2} \neq 0), then the OLS estimator will be biased.
  • This bias arises because the omitted variable contributes to the variation in the dependent variable, but its effect is incorrectly attributed to other predictors.

5.1.2.3.2 Conditional Variance of OLS Estimator

Under assumptions A1, A2, A3, and A4, the conditional variance of the OLS estimator is:

\begin{aligned} Var(\hat{\beta}|\mathbf{X}) &= Var(\beta + \mathbf{(X'X)^{-1}X'\epsilon|X}) && \text{A1-A2} \\ &= Var((\mathbf{X'X)^{-1}X'\epsilon|X}) \\ &= \mathbf{(X'X)^{-1}X'} Var(\epsilon|\mathbf{X})\mathbf{X(X'X)^{-1}} \\ &= \mathbf{(X'X)^{-1}X'} \sigma^2 I \mathbf{X(X'X)^{-1}} && \text{A4} \\ &= \sigma^2 \mathbf{(X'X)^{-1}} \end{aligned}

This result shows that the variance of \hat{\beta} depends on:

  1. \sigma^2: The variance of the errors.
  2. \mathbf{X'X}: The information content in the design matrix \mathbf{X}.

5.1.2.3.3 Sources of Variation in OLS Estimator
  1. Unexplained Variation in the Dependent Variable: \sigma^2 = Var(\epsilon_i|\mathbf{X})
  • Large \sigma^2 indicates that the amount of unexplained variation (noise) is high relative to the explained variation (\mathbf{x_i \beta}).
  • This increases the variance of the OLS estimator.
  1. Small Variation in Predictor Variables
  • If the variance of predictors (Var(x_{i1}), Var(x_{i2}), \dots) is small, the design matrix \mathbf{X} lacks information, leading to:
    • High variability in \hat{\beta}.
    • Potential issues in estimating coefficients accurately.
  • Small sample size exacerbates this issue, as fewer observations reduce the robustness of parameter estimates.
  1. Correlation Between Explanatory Variables (Collinearity)
  • Strong correlation among explanatory variables creates problems:
    • x_{i1} being highly correlated with a linear combination of 1, x_{i2}, x_{i3}, \dots contributes to inflated standard errors for \hat{\beta}_1.
    • Including many irrelevant variables exacerbates this issue.
  • Perfect Collinearity:
    • If x_1 is perfectly determined by a linear combination of other predictors, the matrix \mathbf{X'X} becomes singular.
    • This violates A2, making OLS impossible to compute.
  • Multicollinearity:
    • If x_1 is highly correlated (but not perfectly) with a linear combination of other variables, the variance of \hat{\beta}_1 increases.
    • Multicollinearity does not violate OLS assumptions but weakens inference by inflating standard errors.

5.1.2.3.4 Standard Errors

Standard errors measure the variability of an estimator, specifically the standard deviation of \hat{\beta}. They are crucial for inference, as they quantify the uncertainty associated with parameter estimates.


The variance of the OLS estimator \hat{\beta} is:

Var(\hat{\beta}|\mathbf{X}) = \sigma^2 \mathbf{(X'X)^{-1}}

Where:

  • \sigma^2: Variance of the error terms.
  • \mathbf{(X'X)^{-1}}: Inverse of the design matrix product, capturing the geometry of the predictors.

Estimation of \sigma^2

Under assumptions A1 through A5, we can estimate \sigma^2 as:

s^2 = \frac{1}{n-k} \sum_{i=1}^{n} e_i^2 = \frac{1}{n-k} SSR

Where:

  • n: Number of observations.
  • k: Number of predictors, including the intercept.
  • e_i: Residuals from the regression model (e_i = y_i - \hat{y}_i).
  • SSR: Sum of squared residuals (\sum e_i^2).

The degrees of freedom adjustment (n-k) accounts for the fact that residuals e_i are not true errors \epsilon_i. Since the regression model uses k parameters, we lose k degrees of freedom in estimating variance.

The standard error for \sigma is:

s = \sqrt{s^2}

However, s is a biased estimator of \sigma due to Jensen’s Inequality.


The standard error of each regression coefficient \hat{\beta}_{j-1} is:

SE(\hat{\beta}_{j-1}) = s \sqrt{[(\mathbf{X'X})^{-1}]_{jj}}

Alternatively, it can be expressed in terms of SST_{j-1} and R_{j-1}^2:

SE(\hat{\beta}_{j-1}) = \frac{s}{\sqrt{SST_{j-1}(1 - R_{j-1}^2)}}

Where:

  • SST_{j-1}: Total sum of squares for x_{j-1} from a regression of x_{j-1} on all other predictors.
  • R_{j-1}^2: Coefficient of determination for the same regression.

This formulation highlights the role of multicollinearity, as R_{j-1}^2 reflects the correlation between x_{j-1} and other predictors.


5.1.2.3.5 Summary of Finite Sample Properties of OLS Under Different Assumptions
  1. Under A1-A3: OLS is unbiased. E(\hat{\beta}) = \beta

  2. Under A1-A4: The variance of the OLS estimator is: Var(\hat{\beta}|\mathbf{X}) = \sigma^2 \mathbf{(X'X)^{-1}}

  3. Under A1-A4, A6: The OLS estimator is normally distributed: \hat{\beta} \sim N(\beta, \sigma^2 \mathbf{(X'X)^{-1}})

  4. Under A1-A4, Gauss-Markov Theorem: OLS is BLUE (Best Linear Unbiased Estimator).

  5. Under A1-A5: The standard errors for \hat{\beta} are unbiased estimators of the standard deviation of \hat{\beta}.


5.1.2.4 Large Sample Properties

Large sample properties provide a framework to evaluate the quality of estimators when finite sample properties are either uninformative or computationally infeasible. This perspective becomes crucial in modern data analysis, especially for methods like GLS or MLE, where assumptions for finite sample analysis may not hold.

When to Use Finite vs. Large Sample Analysis
Aspect Finite Sample Properties Large Sample Properties
Applicability Limited to fixed sample sizes Relevant for n \to \infty
Exactness Exact results (e.g., distributions, unbiasedness) Approximate results
Assumptions May require stronger assumptions (e.g., normality, independence) Relies on asymptotic approximations (e.g., CLT, LLN)
Estimator Behavior Performance may vary significantly Estimators stabilize and improve in accuracy
Ease of Use Often complex due to reliance on exact distributions Simplifies analysis by leveraging asymptotic approximations
Real-World Relevance More realistic for small datasets More relevant for large datasets
  • Finite Sample Analysis:

    • Small sample sizes (e.g., n < 30).

    • Critical for studies where exact results are needed.

    • Useful in experimental designs and case studies.

  • Large Sample Analysis:

    • Large datasets (e.g., n > 100).

    • Necessary when asymptotic approximations improve computational simplicity.


Key Concepts:

  1. Consistency:
    • Consistency ensures that an estimator converges in probability to the true parameter value as the sample size increases.
    • Mathematically, an estimator \hat{\theta} is consistent for \theta if: \hat{\theta}_n \to^p \theta \quad \text{as } n \to \infty.
    • Consistency does not imply unbiasedness, and unbiasedness does not guarantee consistency.
  2. Asymptotic Distribution:
    • The limiting distribution describes the shape of the scaled estimator as n \to \infty.
    • Asymptotic distributions often follow normality due to the Central Limit Theorem, which underpins much of inferential statistics.
  3. Asymptotic Variance:
    • Represents the spread of the estimator with respect to its limiting distribution.
    • Smaller asymptotic variance implies greater precision in large samples.

Motivation

Finite Sample Properties, such as unbiasedness, rely on strong assumptions like:

When these assumptions are violated or impractical to verify, finite sample properties lose relevance. In such cases, Large Sample Propertiesserve as an essential alternative for evaluating estimators.

For example, let the conditional expectation function (CEF) be: \mu(\mathbf{X}) = E(y | \mathbf{X}), which represents the minimum mean squared predictor over all possible functions f(\mathbf{X}): \min_f E((y - f(\mathbf{X}))^2).

Under the assumptions A1 and A3, the CEF simplifies to: \mu(\mathbf{X}) = \mathbf{X}\beta.

The linear projection is given by: L(y | 1, \mathbf{X}) = \gamma_0 + \mathbf{X}\text{Var}(\mathbf{X})^{-1}\text{Cov}(\mathbf{X}, y), where: \gamma = \mathbf{X}\text{Var}(\mathbf{X})^{-1}\text{Cov}(\mathbf{X}, y).

This linear projection minimizes the mean squared error: (\gamma_0, \gamma) = \arg\min_{(a, b)} E\left[\left(E(y|\mathbf{X}) - \left(a + \mathbf{X}b\right)\right)^2\right].

Implications for OLS

  • Consistency: OLS is always consistent for the linear projection, ensuring convergence to the true parameter value as n \to \infty.
  • Causal Interpretation: The linear projection has no inherent causal interpretation—it approximates the conditional mean function.
  • Assumption Independence: Unlike the CEF, the linear projection does not depend on assumptions A1 and A3.

Evaluating Estimators via Large Sample Properties

  1. Consistency:
    • Measures the estimator’s centrality to the true value.
    • A consistent estimator ensures that with larger samples, estimates become arbitrarily close to the population parameter.
  2. Limiting Distribution:
    • Helps infer the sampling behavior of the estimator as n grows.
    • Often approximated by a normal distribution for practical use in hypothesis testing and confidence interval construction.
  3. Asymptotic Variance:
    • Quantifies the dispersion of the estimator around its limiting distribution.
    • Smaller variance is desirable for greater reliability.

An estimator \hat{\theta} is consistent for a parameter \theta if, as the sample size n increases, \hat{\theta} converges in probability to \theta:

\hat{\theta}_n \to^p \theta \quad \text{as } n \to \infty.

  1. Convergence in Probability:

    • The probability that \hat{\theta} deviates from \theta by more than a small margin (no matter how small) approaches zero as n increases.

    Formally: \forall \epsilon > 0, \quad P(|\hat{\theta}_n - \theta| > \epsilon) \to 0 \quad \text{as } n \to \infty.

  2. Interpretation: Consistency ensures that the estimator becomes arbitrarily close to the true population parameter \theta as the sample size grows.

  3. Asymptotic Behavior: Large sample properties rely on consistency to provide valid approximations of an estimator’s behavior in finite samples.


Relationship Between Consistency and Unbiasedness

  • Unbiasedness:
    • An estimator \hat{\theta} is unbiased if its expected value equals the true parameter: E(\hat{\theta}) = \theta.
    • Unbiasedness is a finite-sample property and does not depend on the sample size.
  • Consistency:
    • Consistency is a large-sample property and requires \hat{\theta} to converge to \theta as n \to \infty.

Important Distinctions

  1. Unbiasedness Does Not Imply Consistency:
    • Example: Consider an unbiased estimator with extremely high variance that does not diminish as n increases. Such an estimator does not converge to \theta in probability.
  2. Consistency Does Not Imply Unbiasedness:
    • Example: \hat{\theta}_n = \frac{n-1}{n}\theta is biased for all finite n, but as n \to \infty, \hat{\theta}_n \to^p \theta, making it consistent.

From the OLS formula: \hat{\beta} = \mathbf{(X'X)^{-1}X'y}, we can expand as: \hat{\beta} = \mathbf{(\sum_{i=1}^{n}x_i'x_i)^{-1} \sum_{i=1}^{n}x_i'y_i}, or equivalently: \hat{\beta} = (n^{-1}\mathbf{\sum_{i=1}^{n}x_i'x_i)^{-1}} n^{-1}\mathbf{\sum_{i=1}^{n}x_i'y_i}.

Taking the probability limit under the assumptions A2 and A5, we apply the Weak Law of Large Numbers (for a random sample, averages converge to expectations as n \to \infty): plim(\hat{\beta}) = plim((n^{-1}\mathbf{\sum_{i=1}^{n}x_i'x_i)^{-1}})plim(n^{-1}\mathbf{\sum_{i=1}^{n}x_i'y_i}), which simplifies to: \begin{aligned} plim(\hat{\beta}) &= (E(\mathbf{x_i'x_i}))^{-1}E(\mathbf{x_i'y_i}) & \text{A1}\\ &= (E(\mathbf{x_i'x_i}))^{-1} \bigl(E(\mathbf{x_i'x_i} \,\beta) + E(\mathbf{x_i\,\epsilon_i})\bigr) & (A3a) \\ &= (E(\mathbf{x_i' x_i}))^{-1}E(\mathbf{x_i' x_i})\, \beta & (A2)\\ &= \beta \end{aligned}

Proof:

For a model of y_i = x_i \beta + \epsilon_i, where \beta is the true parameter vector, \epsilon_i is the random error:

Expanding E(x_i y_i):

E(x_i'y_i) = E(x_i'(x_i\beta + \epsilon_i))

By the linearity of expectation:

E(x_i'y_i) = E(x_i'x_i \beta) + E(x_i \epsilon_i)

The second term E(x_i \epsilon_i) = 0 under assumption A3a.

Thus, E(x_i'y_i) = E(x_i'x_i)\beta.


Consistency of OLS

Hence, in short, under the assumptions:

the term E(\mathbf{x_i'\epsilon_i}) = 0, and the OLS estimator is consistent: plim(\hat{\beta}) = \beta.

However, OLS consistency does not guarantee unbiasedness in small samples.


5.1.2.4.1 Asymptotic Distribution of OLS

Under the same assumptions :

and if \mathbf{x_i'x_i} has finite first and second moments (required for the Central Limit Theorem), we have:

  1. Convergence of n^{-1}\sum_{i=1}^n \mathbf{x_i'x_i}: n^{-1}\sum_{i=1}^n \mathbf{x_i'x_i} \to^p E(\mathbf{x_i'x_i}).

  2. Asymptotic normality of \sqrt{n}(n^{-1}\sum_{i=1}^n \mathbf{x_i'\epsilon_i}): \sqrt{n}(n^{-1}\sum_{i=1}^n \mathbf{x_i'\epsilon_i}) \to^d N(0, \mathbf{B}), where \mathbf{B} = Var(\mathbf{x_i'\epsilon_i}).

From these results, the scaled difference between \hat{\beta} and \beta follows: \sqrt{n}(\hat{\beta} - \beta) = (n^{-1}\sum_{i=1}^n \mathbf{x_i'x_i})^{-1} \sqrt{n}(n^{-1}\sum_{i=1}^n \mathbf{x_i'\epsilon_i}).

By the Central Limit Theorem: \sqrt{n}(\hat{\beta} - \beta) \to^d N(0, \Sigma), where: \Sigma = (E(\mathbf{x_i'x_i}))^{-1} \mathbf{B} (E(\mathbf{x_i'x_i}))^{-1}.

The sandwich form is \Sigma is standard.

Implications for Homoskedasticity (A4) vs. Heteroskedasticity

  1. No Homoskedasticity (A4 Homoskedasticity) needed:

    • the CLT and the large-sample distribution of \hat{\beta} do not require homoskedasticity. The only place homoskedasticity would simplify things is that \mathbf{B} = Var(\mathbf{x_i'\epsilon_i}) = \sigma^2 E(\mathbf{x_i'x_i}),

    only if Var(\epsilon_i | \mathbf{x}_i) \sigma^2

    Then \Sigma = \sigma^2 (E(\mathbf{x_i'x_i}))^{-1}.

  2. Adjusting for Heteroskedasticity:

    • In practice, \sigma_i^2 can vary with \mathbf{x}_i​, leading to heteroskedasticity.
    • The standard OLS formula for Var(\hat{\beta}) is inconsistent under heteroskedasticity, so one uses robust (White) standard errors.
    • Heteroskedasticity can arise from (but not limited to):
      • Limited dependent variables.
      • Dependent variables with large or skewed ranges.

5.1.2.4.2 Derivation of Asymptotic Variance

The asymptotic variance of the OLS estimator is derived as follows:

\Sigma = (E(\mathbf{x_i'x_i}))^{-1}\mathbf{B}(E(\mathbf{x_i'x_i}))^{-1}.

  1. Substituting \mathbf{B} = Var(\mathbf{x_i'}\epsilon_i): \Sigma = (E(\mathbf{x_i'x_i}))^{-1}Var(\mathbf{x_i'}\epsilon_i)(E(\mathbf{x_i'x_i}))^{-1}.

  2. Using the definition of variance: \Sigma = (E(\mathbf{x_i'x_i}))^{-1}E[(\mathbf{x_i'}\epsilon_i - 0)(\mathbf{x_i'}\epsilon_i - 0)'](E(\mathbf{x_i'x_i}))^{-1}.

  3. By the [Law of Iterated Expectations] and A3a: Weaker Exogeneity Assumption, we have: E[(\mathbf{x_i'}\epsilon_i)(\mathbf{x_i'}\epsilon_i)'] = E[E(\epsilon_i^2|\mathbf{x_i})\mathbf{x_i'x_i}],

  4. Assuming homoskedasticity (A4 Homoskedasticity), E(\epsilon_i^2|\mathbf{x_i}) = \sigma^2, so: \Sigma = (E(\mathbf{x_i'x_i}))^{-1}\sigma^2E(\mathbf{x_i'x_i})(E(\mathbf{x_i'x_i}))^{-1}.

  5. Simplifying: \Sigma = \sigma^2(E(\mathbf{x_i'x_i}))^{-1}.

Hence, under the assumptions:

we have \sqrt{n}(\hat{\beta} - \beta) \to^d N(0, \sigma^2(E(\mathbf{x_i'x_i}))^{-1}).

The asymptotic variance provides an approximation of the scaled estimator’s variance for large n. This leads to:

Avar(\sqrt{n}(\hat{\beta} - \beta)) = \sigma^2(E(\mathbf{x_i'x_i}))^{-1}.

The finite sample variance of an estimator can be approximated using the asymptotic variance for large sample sizes:

\begin{aligned} Avar(\sqrt{n}(\hat{\beta}-\beta)) &\approx Var(\sqrt{n}(\hat{\beta}-\beta)) \\ Avar(\sqrt{n}(\hat{\beta}-\beta))/n &\approx Var(\sqrt{n}(\hat{\beta}-\beta))/n = Var(\hat{\beta}) \end{aligned}

However, it is critical to note that asymptotic variance (Avar(.)) does not behave in the same manner as finite sample variance (Var(.)). This distinction is evident in the following expressions:

\begin{aligned} Avar(\sqrt{n}(\hat{\beta}-\beta))/n &\neq Avar(\sqrt{n}(\hat{\beta}-\beta)/\sqrt{n}) \\ &\neq Avar(\hat{\beta}) \end{aligned}

Thus, while Avar(.) provides a useful approximation for large samples, its conceptual properties differ from finite sample variance.


In Finite Sample Properties, the standard errors are calculated as estimates of the conditional standard deviation:

SE_{fs}(\hat{\beta}_{j-1}) = \sqrt{\hat{Var}(\hat{\beta}_{j-1}|\mathbf{X})} = \sqrt{s^2 \cdot [\mathbf{(X'X)}^{-1}]_{jj}},

where:

  • s^2 is the estimator of the error variance,

  • [\mathbf{(X'X)}^{-1}]_{jj} represents the jth diagonal element of the inverse design matrix.

In contrast, in Large Sample Properties, the standard errors are calculated as estimates of the square root of the asymptotic variance:

SE_{ls}(\hat{\beta}_{j-1}) = \sqrt{\hat{Avar}(\sqrt{n} \hat{\beta}_{j-1})/n} = \sqrt{s^2 \cdot [\mathbf{(X'X)}^{-1}]_{jj}}.

Interestingly, the standard error estimator is identical for both finite and large samples:

  • The expressions for SE_{fs} and SE_{ls} are mathematically the same.
  • However, they are conceptually estimating two different quantities:
    • Finite Sample Standard Error: An estimate of the conditional standard deviation of \hat{\beta}_{j-1} given \mathbf{X}.
    • Large Sample Standard Error: An estimate of the square root of the asymptotic variance of \hat{\beta}_{j-1}.

The assumptions required for these estimators to be valid differ in their stringency:

  1. Finite Sample Variance (Conditional Variance):
    • Requires stronger assumptions (A1-A5).
  2. Asymptotic Variance:
    • Valid under weaker assumptions (A1, A2, A3a, A4, A5).

This distinction highlights the utility of asymptotic properties in providing robust approximations when finite sample assumptions may not hold.

5.1.2.5 Diagnostics

5.1.2.5.1 Normality of Errors

Ensuring the normality of errors is a critical assumption in many regression models. Deviations from this assumption can impact inference and model interpretation. For diagnoses assessing normality, see Normality Assessment.

Plots are invaluable for visual inspection of normality. One common approach is the Q-Q plot, which compares the quantiles of the residuals against those of a standard normal distribution:

# Example Q-Q plot
set.seed(123) # For reproducibility
y <- 1:100
x <- rnorm(100) # Generating random normal data
qqplot(x,
       y,
       main = "Q-Q Plot",
       xlab = "Theoretical Quantiles",
       ylab = "Sample Quantiles")

5.1.2.5.2 Influential Observations and Outliers

Identifying influential observations or outliers is essential for robust regression modeling. The hat matrix (\mathbf{H}) plays a key role in diagnosing influence.

5.1.2.5.2.1 Hat Matrix: Outliers in X-Space

The hat matrix is primarily concerned with leverage, which reflects how far an observation’s predictor values (X-space) are from the centroid of the predictor space.

  • What it measures: The diagonal elements of the hat matrix quantify leverage, not residual size or model fit. High leverage suggests that an observation has an unusual predictor configuration and might disproportionately influence the regression line, irrespective of the response variable.

  • What it doesn’t measure: It doesn’t directly account for outliers in Y-space or residuals.

The hat matrix, defined as:

\mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}

has the following properties:

  • Fitted Values: \hat{\mathbf{Y}} = \mathbf{H}\mathbf{Y}.

  • Residuals: \mathbf{e} = (\mathbf{I} - \mathbf{H})\mathbf{Y}.

  • Variance of Residuals: \text{var}(\mathbf{e}) = \sigma^2 (\mathbf{I} - \mathbf{H}).

Diagonal Elements of the Hat Matrix (h_{ii})

  • h_{ii} is the i-th element on the main diagonal of \mathbf{H}. It must satisfy 0 \leq h_{ii} \leq 1.

  • \sum_{i=1}^{n} h_{ii} = p, where p is the number of parameters (including the intercept).

  • The variance of residuals for observation i is given by \sigma^2(e_i) = \sigma^2 (1 - h_{ii}).

  • The covariance between residuals for observations i and j (i \neq j) is -h_{ij}\sigma^2.

For large datasets, the off-diagonal elements (h_{ij}) tend to have small covariance if model assumptions hold.

Estimations Using MSE

  • Variance of residuals: s^2(e_i) = MSE (1 - h_{ii}), where MSE is the mean squared error.

  • Covariance of residuals: \hat{\text{cov}}(e_i, e_j) = -h_{ij}(MSE).

Interpretation of h_{ii}

If \mathbf{x}_i = [1, X_{i,1}, \ldots, X_{i,p-1}]' represents the vector of predictor values for observation i, then:

h_{ii} = \mathbf{x}_i' (\mathbf{X}'\mathbf{X})^{-1} \mathbf{x}_i

The value of h_{ii} depends on the relative positions of the design points X_{i,1}, \ldots, X_{i,p-1}. Observations with high h_{ii} are more influential and warrant closer inspection for leverage or outlier behavior.

5.1.2.5.2.2 Studentized Residuals: Outliers in Y-Space

Residuals focus on discrepancies between observed (Y) and predicted (\hat{Y}) values, helping to identify outliers in Y-space.

  • What they measure:

    • Standardized or studentized residuals assess how far an observation’s response is from the regression line, adjusted for variance.

    • Externally studentized residuals are more robust because they exclude the i-th observation when estimating variance.

    • Large studentized residuals (e.g., >2 or >3) indicate observations that are unusual in Y-space.

  • What they don’t measure: They do not consider leverage or the X-space configuration. A point with a large residual could have low leverage, making it less influential overall.

Studentized residuals, also known as standardized residuals, adjust for the variance of residuals by dividing the residuals by their standard error:

\begin{aligned} r_i &= \frac{e_i}{s(e_i)} \\ r_i &\sim N(0,1), \end{aligned}

where s(e_i) = \sqrt{MSE(1-h_{ii})}, and r_i accounts for the varying variances of residuals. These residuals allow for a better comparison of model fit across observations.

  • Semi-Studentized Residuals: In contrast, the semi-studentized residuals are defined as:

e_i^* = \frac{e_i}{\sqrt{MSE}}

This approach does not adjust for the heterogeneity in variances of residuals, as e_i^* assumes equal variance for all residuals.


To assess the influence of individual observations, we consider the model without a particular value. When the i-th observation is removed, the deleted residual is defined as:

\begin{aligned} d_i &= Y_i - \hat{Y}_{i(i)} \\ &= \frac{e_i}{1-h_{ii}}, \end{aligned}

where Y_i is the actual observation, and \hat{Y}_{i(i)} is the predicted value for the i-th observation, computed using the regression model fitted to the remaining n-1 observations. Importantly, we do not need to refit the regression model for each observation to compute d_i.

  • As h_{ii} (leverage) increases, d_i also increases, indicating higher influence of the observation.

The variance of the deleted residual is given by:

s^2(d_i) = \frac{MSE_{(i)}}{1-h_{ii}},

where MSE_{(i)} is the mean squared error when the i-th case is omitted.


The studentized deleted residual accounts for variability and follows a t-distribution with n-p-1 degrees of freedom:

t_i = \frac{d_i}{s(d_i)} = \frac{e_i}{\sqrt{MSE_{(i)}(1-h_{ii})}},

where t_i helps identify outliers more effectively.

We can compute t_i without fitting the regression model multiple times. Using the relationship:

(n-p)MSE = (n-p-1)MSE_{(i)} + \frac{e_i^2}{1-h_{ii}},

we derive:

t_i = e_i \sqrt{\frac{n-p-1}{SSE(1-h_{ii}) - e_i^2}}.

This formulation avoids the need for recalculating MSE_{(i)} explicitly for each case.


Outlying Y-observations are those with large studentized deleted residuals in absolute value. To handle multiple testing when there are many residuals, we use a Bonferroni-adjusted critical value:

t_{1-\alpha/2n; n-p-1},

where \alpha is the desired significance level, n is the sample size, and p is the number of parameters in the model. Observations exceeding this threshold are flagged as potential outliers.

# Example R Code for Demonstrating Residual Diagnostics
set.seed(123) # For reproducibility

# Simulate some data
n <- 100
x <- rnorm(n, mean = 10, sd = 3)
y <- 5 + 2 * x + rnorm(n, mean = 0, sd = 2)

# Fit a linear regression model
model <- lm(y ~ x)


# Extract residuals, fitted values, and hat values
residuals <- resid(model)
hat_values <- lm.influence(model)$hat
mse <- mean(residuals ^ 2)
n <- length(residuals)
p <- length(coefficients(model)) # Number of parameters

# Compute studentized residuals
studentized_residuals <- residuals / sqrt(mse * (1 - hat_values))

# Compute deleted residuals
deleted_residuals <- residuals / (1 - hat_values)

# Compute studentized deleted residuals
studentized_deleted_residuals <-
    residuals * sqrt((n - p - 1)
                     / (sum(residuals ^ 2) * (1 - hat_values) - residuals ^ 2))

# Flag potential outliers using Bonferroni-adjusted critical value
alpha <- 0.05
bonferroni_threshold <- qt(1 - alpha / (2 * n), df = n - p - 1)
outliers <-
    abs(studentized_deleted_residuals) > bonferroni_threshold

# Print results
results <- data.frame(
    Residuals                     = residuals,
    Hat_Values                    = hat_values,
    Studentized_Residuals         = studentized_residuals,
    Deleted_Residuals             = deleted_residuals,
    Studentized_Deleted_Residuals = studentized_deleted_residuals,
    Outlier                       = outliers
)

causalverse::nice_tab(head(results))
#>   Residuals Hat_Values Studentized_Residuals Deleted_Residuals
#> 1     -1.27       0.02                 -0.67             -1.29
#> 2      0.70       0.01                  0.36              0.70
#> 3     -0.12       0.04                 -0.07             -0.13
#> 4     -0.48       0.01                 -0.25             -0.49
#> 5     -1.68       0.01                 -0.88             -1.70
#> 6      0.30       0.04                  0.16              0.31
#>   Studentized_Deleted_Residuals Outlier
#> 1                         -0.66   FALSE
#> 2                          0.36   FALSE
#> 3                         -0.06   FALSE
#> 4                         -0.25   FALSE
#> 5                         -0.87   FALSE
#> 6                          0.15   FALSE

# Plot studentized deleted residuals for visualization
plot(
    studentized_deleted_residuals,
    main = "Studentized Deleted Residuals",
    xlab = "Observation",
    ylab = "Studentized Deleted Residuals",
    pch = 16,
    col = ifelse(outliers, "red", "black")
)
abline(
    h = c(-bonferroni_threshold, bonferroni_threshold),
    col = "blue",
    lty = 2
)
legend(
    "topright",
    legend = c("Potential Outliers", "Threshold"),
    col = c("red", "blue"),
    pch = c(16, NA),
    lty = c(NA, 2)
)

5.1.2.5.3 Identifying Influential Cases

By influential, we refer to observations whose exclusion causes major changes in the fitted regression model. Note that not all outliers are influential.

Types of Influence Measures

  • Influence on Single Fitted Values: DFFITS
  • Influence on All Fitted Values: Cook’s D
  • Influence on the Regression Coefficients: DFBETAS

Measures like Cook’s D, DFFITS, and DFBETAS combine leverage (from the hat matrix) and residual size (from studentized residuals) to assess the influence of an observation on the model as a whole. Hence, these effectively combine impact of X-space and Y-space.

5.1.2.5.3.1 DFFITS

DFFITS measures the influence on single fitted values. It is defined as:

\begin{aligned} (DFFITS)_i &= \frac{\hat{Y}_i - \hat{Y}_{i(i)}}{\sqrt{MSE_{(i)}h_{ii}}} \\ &= t_i \left(\frac{h_{ii}}{1-h_{ii}}\right)^{1/2} \end{aligned}

Where:

  • \hat{Y}_i = fitted value for observation i using all data.
  • \hat{Y}_{i(i)} = fitted value for observation i with the ith case removed.
  • MSE_{(i)} = mean squared error with observation i excluded.
  • h_{ii} = leverage of the ith observation.
  • t_i = studentized deleted residual.

High DFFITS values occur when leverage and residuals are jointly significant.

DFFITS captures the standardized difference between the fitted value for observation i with and without the ith case in the model. It is a product of:

  1. The studentized deleted residual.
  2. A scaling factor based on the leverage of the ith observation, h_{ii}.

An observation is considered influential based on DFFITS if:

  • Small to medium data sets: |DFFITS| > 1
  • Large data sets: |DFFITS| > 2 \sqrt{p/n}

Where:

  • p = number of predictors (including the intercept).
  • n = total number of observations.

This provides a practical threshold for detecting influential observations in different dataset sizes.

# Load necessary package
library(car)

# Fit a linear model
model <- lm(mpg ~ hp + wt, data = mtcars)

# Compute DFFITS
dffits_values <- dffits(model)

# Display influential observations based on the threshold for a large dataset
threshold <- 2 * sqrt(length(coefficients(model)) / nrow(mtcars))
influential_obs <- which(abs(dffits_values) > threshold)

# Results
list(
    DFFITS                   = dffits_values,
    Threshold                = threshold,
    Influential_Observations = influential_obs
)
#> $DFFITS
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>        -0.218494101        -0.126664789        -0.249103400         0.011699160 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>         0.028162679        -0.253806124        -0.191618944         0.221917842 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>         0.079763706        -0.067222732        -0.190099538         0.064280875 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>         0.020560728        -0.135714533         0.008984366         0.227919348 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>         1.231668760         0.749153703         0.165329646         0.865985851 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>        -0.292008465        -0.253389811        -0.294709853        -0.170476763 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>         0.207813200        -0.041423665        -0.004054382         0.471518032 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>        -0.161026362        -0.129395315         0.907521354        -0.128232538 
#> 
#> $Threshold
#> [1] 0.6123724
#> 
#> $Influential_Observations
#> Chrysler Imperial          Fiat 128    Toyota Corolla     Maserati Bora 
#>                17                18                20                31
5.1.2.5.3.2 Cook’s D

Cook’s D measures the influence of the ith case on all fitted values in a regression model. It is defined as:

\begin{aligned} D_i &= \frac{\sum_{j=1}^{n}(\hat{Y}_j - \hat{Y}_{j(i)})^2}{p(MSE)} \\ &= \frac{e^2_i}{p(MSE)}\left(\frac{h_{ii}}{(1-h_{ii})^2}\right) \end{aligned}

Where:

  • \hat{Y}_j = fitted value for observation j using all data.
  • \hat{Y}_{j(i)} = fitted value for observation j with the ith case removed.
  • e_i = residual for observation i.
  • h_{ii} = leverage of the ith observation.
  • p = number of predictors (including the intercept).
  • MSE = mean squared error of the model.

Key Insights

  • Cook’s D quantifies the overall influence of the ith observation on the entire set of fitted values.
  • If either the residual e_i increases or the leverage h_{ii} increases, then D_i also increases, indicating higher influence.
  • Observations with both high leverage and large residuals are flagged as influential.

Threshold for Influence

  • D_i can be interpreted as a percentile of an F_{(p,n-p)} distribution.
  • Practical thresholds:
    • If D_i > 4/n, the ith case has major influence.
    • Alternatively, cases where D_i exceeds the 50th percentile of the F-distribution may also be considered influential.
# Load necessary package
library(car)

# Fit a linear model
model <- lm(mpg ~ hp + wt, data = mtcars)

# Compute Cook's D
cooks_d_values <- cooks.distance(model)

# Display influential observations based on the threshold
threshold <- 4 / nrow(mtcars)
influential_obs <- which(cooks_d_values > threshold)

# Results
list(
  Cooks_D                  = cooks_d_values,
  Threshold                = threshold,
  Influential_Observations = influential_obs
)
#> $Cooks_D
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>        1.589652e-02        5.464779e-03        2.070651e-02        4.724822e-05 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>        2.736184e-04        2.155064e-02        1.255218e-02        1.677650e-02 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>        2.188702e-03        1.554996e-03        1.215737e-02        1.423008e-03 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>        1.458960e-04        6.266049e-03        2.786686e-05        1.780910e-02 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>        4.236109e-01        1.574263e-01        9.371446e-03        2.083933e-01 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>        2.791982e-02        2.087419e-02        2.751510e-02        9.943527e-03 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>        1.443199e-02        5.920440e-04        5.674986e-06        7.353985e-02 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>        8.919701e-03        5.732672e-03        2.720397e-01        5.600804e-03 
#> 
#> $Threshold
#> [1] 0.125
#> 
#> $Influential_Observations
#> Chrysler Imperial          Fiat 128    Toyota Corolla     Maserati Bora 
#>                17                18                20                31
5.1.2.5.3.3 DFBETAS

DFBETAS measures the influence of the ith observation on each regression coefficient in a regression model. It is defined as:

(DFBETAS)_{k(i)} = \frac{b_k - b_{k(i)}}{\sqrt{MSE_{(i)}c_{kk}}}

Where:

  • b_k = regression coefficient for the kth predictor using all observations.
  • b_{k(i)} = regression coefficient for the kth predictor with the ith observation omitted.
  • MSE_{(i)} = mean squared error with the ith observation excluded.
  • c_{kk} = kth diagonal element of \mathbf{X'X}^{-1}, representing the variance of b_k.

Key Insights

  • DFBETAS quantifies the change in each regression coefficient (b_k) caused by omitting the ith observation.
  • The sign of DFBETAS indicates whether the inclusion of an observation increases or decreases the regression coefficient.
    • Positive DFBETAS: Inclusion increases b_k.
    • Negative DFBETAS: Inclusion decreases b_k.
  • High DFBETAS indicate that a single observation disproportionately affects one or more predictors.

The thresholds for identifying influential observations based on DFBETAS are:

  • Small data sets: |DFBETAS| > 1
  • Large data sets: |DFBETAS| > 2 / \sqrt{n}

Where n is the total number of observations.

# Load necessary package
library(car)

# Fit a linear model
model <- lm(mpg ~ hp + wt, data = mtcars)

# Compute DFBETAS
dfbetas_values <- dfbetas(model)

# Display influential observations based on the threshold for each predictor
threshold <- 2 / sqrt(nrow(mtcars))
influential_obs <-
    apply(dfbetas_values, 2, function(x)
        which(abs(x) > threshold))

# Results
list(
    DFBETAS                  = dfbetas_values,
    Threshold                = threshold,
    Influential_Observations = influential_obs
)
#> $DFBETAS
#>                      (Intercept)           hp            wt
#> Mazda RX4           -0.161347204  0.032966471  0.0639304305
#> Mazda RX4 Wag       -0.069324050  0.045785122 -0.0004066495
#> Datsun 710          -0.211199646  0.043374926  0.0972314374
#> Hornet 4 Drive       0.002672687 -0.006839301  0.0044886906
#> Hornet Sportabout    0.001784844  0.009208434 -0.0015536931
#> Valiant             -0.005985946  0.180374447 -0.1516565139
#> Duster 360           0.004705177 -0.159988770  0.0781031774
#> Merc 240D            0.034255292 -0.189552940  0.1224118752
#> Merc 230             0.019788247 -0.055075623  0.0332570461
#> Merc 280            -0.003198686  0.036709039 -0.0337297820
#> Merc 280C           -0.009045583  0.103809696 -0.0953846390
#> Merc 450SE          -0.026973686 -0.005712458  0.0356973740
#> Merc 450SL          -0.003961562  0.003399822  0.0049302300
#> Merc 450SLC          0.031572445 -0.016800308 -0.0400515832
#> Cadillac Fleetwood  -0.006420656 -0.002577897  0.0075499557
#> Lincoln Continental -0.168791258 -0.058242601  0.1903129995
#> Chrysler Imperial   -0.924056752 -0.148009806  0.9355996760
#> Fiat 128             0.605181396 -0.311246566 -0.1672758566
#> Honda Civic          0.156388333 -0.034026915 -0.0819144214
#> Toyota Corolla       0.804669969 -0.170934240 -0.4114605894
#> Toyota Corona       -0.231328587  0.066064464  0.0882138248
#> Dodge Challenger     0.003923967  0.049775308 -0.0888481611
#> AMC Javelin         -0.019610048  0.037837437 -0.0734203131
#> Camaro Z28           0.029920076 -0.128670440  0.0390740055
#> Pontiac Firebird    -0.058806962 -0.002278294  0.0868742949
#> Fiat X1-9           -0.037559007  0.010208853  0.0174261386
#> Porsche 914-2       -0.003655931  0.000316321  0.0020588013
#> Lotus Europa         0.423409344  0.188396749 -0.4072338373
#> Ford Pantera L      -0.022536462 -0.148176049  0.0999346699
#> Ferrari Dino        -0.065508308 -0.085182962  0.0869804902
#> Maserati Bora       -0.007482815  0.865763737 -0.4999048760
#> Volvo 142E          -0.080001907  0.038406565  0.0127537553
#> 
#> $Threshold
#> [1] 0.3535534
#> 
#> $Influential_Observations
#> $Influential_Observations$`(Intercept)`
#> Chrysler Imperial          Fiat 128    Toyota Corolla      Lotus Europa 
#>                17                18                20                28 
#> 
#> $Influential_Observations$hp
#> Maserati Bora 
#>            31 
#> 
#> $Influential_Observations$wt
#> Chrysler Imperial    Toyota Corolla      Lotus Europa     Maserati Bora 
#>                17                20                28                31
5.1.2.5.4 Collinearity

Collinearity (or multicollinearity) refers to the correlation among explanatory variables in a regression model. It can lead to various issues, including:

  • Large changes in the estimated regression coefficients when a predictor variable is added or removed, or when observations are altered.
  • Non-significant results for individual tests on regression coefficients of important predictor variables.
  • Regression coefficients with signs opposite to theoretical expectations or prior experience.
  • Large coefficients of simple correlation between pairs of predictor variables in the correlation matrix.
  • Wide confidence intervals for the regression coefficients representing important predictor variables.

When some X variables are highly correlated, the inverse (X'X)^{-1} either does not exist or is computationally unstable. This can result in:

  1. Non-interpretability of parameters: \mathbf{b = (X'X)^{-1}X'y}
  2. Infinite sampling variability: \mathbf{s^2(b) = MSE (X'X)^{-1}}

If some predictor variables (X) are “perfectly” correlated, the system becomes undetermined, leading to an infinite number of models that fit the data. Specifically:

  • If X'X is singular, then (X'X)^{-1} does not exist.
  • This results in poor parameter estimation and invalid statistical inference.

5.1.2.5.4.1 Variance Inflation Factors (VIFs)

The Variance Inflation Factor (VIF) quantifies how much the variance of a regression coefficient is inflated due to multicollinearity. It is defined as:

VIF_k = \frac{1}{1-R^2_k}

Where:

  • R^2_k is the coefficient of multiple determination when X_k is regressed on the other p-2 predictor variables in the model.

Interpretation of VIFs

  • Large VIF_k values indicate that near collinearity is inflating the variance of b_k. The relationship is given by: var(b_k) \propto \sigma^2 (VIF_k)
  • Thresholds:
    • VIF > 4: Investigate the cause of multicollinearity.
    • VIF_k > 10: Serious multicollinearity problem that can lead to poor parameter estimates.
  • The mean VIF provides an estimate of the degree of multicollinearity:
    • If avg(VIF) >> 1, serious multicollinearity is present.
  • Multicollinearity and VIF:
    • High VIFs with indicator variables are normal and not problematic.

    • VIF is generally not useful for detecting multicollinearity concerns in models with fixed effects.

  • Overemphasis on Multicollinearity:
    • Multicollinearity inflates standard errors and widens confidence intervals but does not bias results.

    • If key variables have narrow confidence intervals, multicollinearity is not an issue.

  • Goldberger’s Insight (Goldberger 1991):
    • Multicollinearity is akin to small sample size (“micronumerosity”).

    • Large standard errors are expected with highly correlated independent variables.

  • Practical Implications:
    • Evaluate whether confidence intervals for key variables are sufficiently narrow.

    • If not, the study is inconclusive, and a larger dataset or redesigned study is needed.


# Load necessary packages
library(car)

# Fit a regression model
model <- lm(mpg ~ hp + wt + disp, data = mtcars)

# Compute Variance Inflation Factors
vif_values <- vif(model)

# Check for high multicollinearity
threshold <- 10
high_vif <- which(vif_values > threshold)

# Results
list(
  VIFs = vif_values,
  High_VIF_Threshold = threshold,
  High_VIF_Indices = high_vif
)
#> $VIFs
#>       hp       wt     disp 
#> 2.736633 4.844618 7.324517 
#> 
#> $High_VIF_Threshold
#> [1] 10
#> 
#> $High_VIF_Indices
#> named integer(0)
5.1.2.5.4.2 Condition Number

Condition Number is a diagnostic measure for detecting multicollinearity, derived from the spectral decomposition of the matrix \mathbf{X'X}.

The spectral decomposition of \mathbf{X'X} is:

\mathbf{X'X}= \sum_{i=1}^{p} \lambda_i \mathbf{u_i u_i'}

Where:

  • \lambda_i: Eigenvalue associated with the ith eigenvector.
  • \mathbf{u}_i: Eigenvector associated with \lambda_i.
  • \lambda_1 > \lambda_2 > \dots > \lambda_p (ordered eigenvalues).
  • The eigenvectors are orthogonal: \begin{cases} \mathbf{u_i'u_j} = 0 & \text{for } i \neq j \\ \mathbf{u_i'u_j} = 1 & \text{for } i = j \end{cases}

Definition of the Condition Number

The Condition Number is defined as:

k = \sqrt{\frac{\lambda_{\text{max}}}{\lambda_{\text{min}}}}

Where:

  • \lambda_{\text{max}}: Largest eigenvalue.
  • \lambda_{\text{min}}: Smallest eigenvalue.

Interpretation

  • k > 30: Cause for concern.
  • 30 < k < 100: Moderate dependencies among predictors.
  • k > 100: Strong collinearity, indicating serious multicollinearity.

The Condition Index for the ith eigenvalue is defined as:

\delta_i = \sqrt{\frac{\lambda_{\text{max}}}{\lambda_i}}

Where i = 1, \dots, p.


Variance proportions can be used to identify collinearity issues. The proportion of total variance associated with the kth regression coefficient and the ith eigen mode is given by:

\frac{u_{ik}^2/\lambda_i}{\sum_j \left(u^2_{jk}/\lambda_j\right)}

Key Indicators:

  • A large condition index \delta_i suggests potential collinearity.
  • Variance proportions > 0.5 for at least two regression coefficients indicate serious collinearity problems.

# Load necessary package
library(car)

# Fit a regression model
model <- lm(mpg ~ hp + wt + disp, data = mtcars)

# Compute eigenvalues and eigenvectors of the correlation matrix of predictors
cor_matrix <- cor(mtcars[, c("hp", "wt", "disp")])
eigen_decomp <- eigen(cor_matrix)

# Extract eigenvalues and compute the Condition Number
eigenvalues <- eigen_decomp$values
condition_number <- sqrt(max(eigenvalues) / min(eigenvalues))

# Compute Condition Indices
condition_indices <- sqrt(max(eigenvalues) / eigenvalues)

# Results
list(
  Condition_Number = condition_number,
  Condition_Indices = condition_indices
)
#> $Condition_Number
#> [1] 5.469549
#> 
#> $Condition_Indices
#> [1] 1.000000 2.697266 5.469549
  1. Condition Number: A single value indicating the degree of multicollinearity.
  2. Condition Indices: A vector showing the relative dependency associated with each eigenvalue.
5.1.2.5.5 Constancy of Error Variance

Testing for the constancy of error variance (homoscedasticity) ensures that the assumptions of linear regression are met.

Below are two commonly used tests to assess error variance.


5.1.2.5.5.1 Brown-Forsythe Test (Modified Levene Test)

The Brown-Forsythe Test does not depend on the normality of errors and is suitable when error variance increases or decreases with X.

Procedure

  1. Split the residuals into two groups: e_{i1}, i = 1, \dots, n_1 \quad \text{and} \quad e_{j2}, j = 1, \dots, n_2

  2. Compute absolute deviations from the group medians: d_{i1} = |e_{i1} - \tilde{e}_1| \quad \text{and} \quad d_{j2} = |e_{j2} - \tilde{e}_2| where \tilde{e}_1 and \tilde{e}_2 are the medians of groups 1 and 2, respectively.

  3. Perform a two-sample t-test on d_{i1} and d_{j2}: t_L = \frac{\bar{d}_1 - \bar{d}_2}{s\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} where s^2 = \frac{\sum_i (d_{i1} - \bar{d}_1)^2 + \sum_j (d_{j2} - \bar{d}_2)^2}{n - 2}

  4. Reject the null hypothesis of constant error variance if: |t_L| > t_{1-\alpha/2; n-2}

# Load necessary package
library(car)

# Fit a linear model
model <- lm(mpg ~ hp + wt, data = mtcars)

# Perform the Brown-Forsythe Test
levene_test <- leveneTest(model$residuals ~ cut(mtcars$hp, 2))  # Split HP into 2 groups

# Results
levene_test
#> Levene's Test for Homogeneity of Variance (center = median)
#>       Df F value Pr(>F)
#> group  1  0.0851 0.7724
#>       30

The p-value determines whether to reject the null hypothesis of constant variance.


5.1.2.5.5.2 Breusch-Pagan Test (Cook-Weisberg Test)

The Breusch-Pagan Test assumes independent and normally distributed errors. It tests the hypothesis:

  • H_0: \gamma_1 = 0 (Constant variance).
  • H_1: \gamma_1 \neq 0 (Non-constant variance).

Procedure

  1. Assume the variance of the error terms is modeled as: \sigma^2_i = \gamma_0 + \gamma_1 X_i

  2. Regress the squared residuals (e_i^2) on X_i:

    • Obtain the regression sum of squares (SSR^*).
  3. Compute the Breusch-Pagan statistic: X^2_{BP} = \frac{SSR^*/2}{\left(\frac{SSE}{n}\right)^2} where SSE is the error sum of squares from the regression of Y on X.

  4. Compare X^2_{BP} to the critical value from the \chi^2 distribution with 1 degree of freedom:

    • Reject H_0 (homogeneous variance) if: X^2_{BP} > \chi^2_{1-\alpha;1}
# Load necessary package
library(lmtest)

# Perform the Breusch-Pagan Test
bp_test <- bptest(model)

# Results
bp_test
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model
#> BP = 0.88072, df = 2, p-value = 0.6438

If the p-value is below the chosen significance level (e.g., 0.05), reject $H_0$ and conclude non-constant variance.

5.1.2.5.6 Independence

Testing for independence ensures that the residuals of a regression model are uncorrelated. Violation of this assumption may lead to biased or inefficient parameter estimates. Below, we discuss three primary methods for diagnosing dependence: plots, the Durbin-Watson test, and specific methods for time-series and spatial data.


5.1.2.5.6.1 Plots

A residual plot can help detect dependence in the residuals:

  1. Plot residuals (e_i) versus the predicted values (\hat{Y}_i).
  2. Patterns such as systematic waves, trends, or clusters indicate possible dependence.
  3. Independence is suggested if residuals are randomly scattered without clear patterns.
# Fit a regression model
model <- lm(mpg ~ hp + wt, data = mtcars)

# Residual plot
plot(model$fitted.values, model$residuals,
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residual Plot")
abline(h = 0, col = "red")

5.1.2.5.6.2 Durbin-Watson Test

The Durbin-Watson test specifically detects autocorrelation in residuals, especially in time-series data.

  • Hypotheses:

    • H_0: Residuals are uncorrelated.

    • H_1: Residuals are autocorrelated.

  • The Durbin-Watson statistic (d) is calculated as: d = \frac{\sum_{t=2}^n (e_t - e_{t-1})^2}{\sum_{t=1}^n e_t^2}

    • d \approx 2: No autocorrelation.

    • d < 2: Positive autocorrelation.

    • d > 2: Negative autocorrelation.

# Load necessary package
library(lmtest)

# Perform the Durbin-Watson test
dw_test <- dwtest(model)

# Results
dw_test
#> 
#>  Durbin-Watson test
#> 
#> data:  model
#> DW = 1.3624, p-value = 0.02061
#> alternative hypothesis: true autocorrelation is greater than 0
5.1.2.5.6.3 Time-Series Autocorrelation

For time-series data, autocorrelation often occurs due to the temporal structure. Key diagnostics include:

  1. Autocorrelation Function (ACF):

    • Shows the correlation of residuals with their lagged values.

    • Significant spikes in the ACF plot indicate autocorrelation.

  2. Partial Autocorrelation Function (PACF):

    • Identifies the correlation of residuals after removing the influence of intermediate lags.
# Load necessary package
library(forecast)

# Residuals from a time-series model
time_series_res <- ts(model$residuals)

# Plot ACF and PACF
acf(time_series_res, main = "ACF of Residuals")

pacf(time_series_res, main = "PACF of Residuals")

5.1.2.5.6.4 Spatial Statistics

Spatial dependence occurs when residuals are correlated across geographical or spatial locations. Two primary tests are used to diagnose spatial autocorrelation:


Moran’s I measures global spatial autocorrelation. It determines whether similar values cluster spatially. The statistic is defined as:

I = \frac{n}{W} \cdot \frac{\sum_i \sum_j w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_i (x_i - \bar{x})^2}

Where:

  • n: Number of observations.
  • W: Sum of all spatial weights w_{ij}.
  • w_{ij}: Spatial weight between locations i and j.
  • x_i: Residual value at location i.
  • \bar{x}: Mean of the residuals.

Spatial Weight Matrix (W): W defines the spatial relationship between observations. It is often derived from:

  • Distance-based methods: E.g., k-nearest neighbors or distance bands.
  • Adjacency methods: Based on shared boundaries in geographic data.

Interpretation:

  • I > 0: Positive spatial autocorrelation (similar values cluster together).

  • I < 0: Negative spatial autocorrelation (dissimilar values are neighbors).

  • I \approx 0: Random spatial distribution.

# Load necessary packages
library(sf)
library(spdep)

# Simulate spatial data (example with mtcars dataset)
coords <- cbind(mtcars$hp, mtcars$wt)  # Coordinates based on two variables

# Add small jitter to avoid duplicate coordinates
set.seed(123)  # For reproducibility
coords_jittered <-
    coords + matrix(runif(length(coords),-0.01, 0.01), ncol = 2)

# Find nearest neighbors
neighbors <- knn2nb(knearneigh(coords_jittered, k = 3))

# Create spatial weights matrix
weights <- nb2listw(neighbors, style = "W")  # Row-standardized weights

# Fit the linear model
model <- lm(mpg ~ hp + wt, data = mtcars)

# Check lengths of residuals and weights
length(model$residuals)  # Should be 32
#> [1] 32
length(weights$neighbours)  # Should also be 32
#> [1] 32

# Compute Moran's I for residuals
moran_test <- moran.test(model$residuals, weights)
print(moran_test)
#> 
#>  Moran I test under randomisation
#> 
#> data:  model$residuals  
#> weights: weights    
#> 
#> Moran I statistic standard deviate = 1.7679, p-value = 0.03854
#> alternative hypothesis: greater
#> sample estimates:
#> Moran I statistic       Expectation          Variance 
#>        0.18544790       -0.03225806        0.01516371

# Moran's scatterplot
moran.plot(model$residuals, weights, main = "Moran's Scatterplot")

Significant Moran’s I: Indicates global clustering of similar residuals, suggesting spatial dependence.


Geary’s C measures spatial autocorrelation at a local level, emphasizing differences between neighboring observations. The statistic is defined as:

C = \frac{n - 1}{2W} \cdot \frac{\sum_i \sum_j w_{ij} (x_i - x_j)^2}{\sum_i (x_i - \bar{x})^2}

Where:

  • C ranges from 0 to 2:
    • C \approx 0: High positive spatial autocorrelation.
    • C \approx 1: Spatial randomness.
    • C \approx 2: High negative spatial autocorrelation.

Comparison of Moran’s I and Geary’s C:

  • Moran’s I is more global and measures the overall pattern of autocorrelation.

  • Geary’s C is more sensitive to local spatial autocorrelation, detecting small-scale patterns.


# Compute Geary's C for residuals
geary_test <- geary.test(model$residuals, weights)

# Results
geary_test
#> 
#>  Geary C test under randomisation
#> 
#> data:  model$residuals 
#> weights: weights 
#> 
#> Geary C statistic standard deviate = 1.0592, p-value = 0.1447
#> alternative hypothesis: Expectation greater than statistic
#> sample estimates:
#> Geary C statistic       Expectation          Variance 
#>        0.84535708        1.00000000        0.02131399

Significant Geary’s C: Highlights local spatial autocorrelation, useful for identifying specific regions or groups of observations where dependence is strong.

References

———. 2024. “Logs with Zeros? Some Problems and Solutions.” The Quarterly Journal of Economics 139 (2): 891–936.
Goldberger, Arthur Stanley. 1991. A Course in Econometrics. Harvard University Press.
Greene, William H. 1990. “A Gamma-Distributed Stochastic Frontier Model.” Journal of Econometrics 46 (1-2): 141–63.
Lovell, Michael C. 2008. “A Simple Proof of the FWL Theorem.” The Journal of Economic Education 39 (1): 88–91.