5.2 Generalized Least Squares

5.2.1 Infeasible Generalized Least Squares

Motivation for a More Efficient Estimator

The Gauss-Markov Theorem guarantees that OLS is the Best Linear Unbiased Estimator (BLUE) under assumptions A1-A4:
- A4: \(Var(\epsilon | \mathbf{X}) = \sigma^2 \mathbf{I}_n\) (homoscedasticity and no autocorrelation).
When A4 does not hold:
- Heteroskedasticity: \(Var(\epsilon_i | \mathbf{X}) \neq \sigma^2\).
- Serial Correlation: \(Cov(\epsilon_i, \epsilon_j | \mathbf{X}) \neq 0\) for (\(i \neq j\)).

Without A4, OLS is unbiased but no longer efficient. This motivates the need for an alternative approach to identify the most efficient estimator.

The unweighted (standard) regression model is given by:

\[ \mathbf{y} = \mathbf{X \beta} + \mathbf{\epsilon} \]

Assuming A1-A3 hold (linearity, full rank, exogeneity), but A4 does not, the variance of the error term is no longer proportional to an identity

\[ Var(\mathbf{\epsilon} | \mathbf{X}) = \mathbf{\Omega} \neq \sigma^2 \mathbf{I}_n. \]

To address the violation of A4 (\(\mathbf{\Omega} \neq \sigma^2 \mathbf{I}_n\)), one can transform the model by premultiplying both sides by a full-rank matrix \(\mathbf{w}\) to have a weighted (transformed) regression model:

\[ \mathbf{w y} = \mathbf{w X \beta} + \mathbf{w \epsilon}, \]

where \(\mathbf{w}\) is a full-rank matrix chosen such that:

\[ \mathbf{w'w} = \mathbf{\Omega}^{-1}. \]

\(\mathbf{w}\) is the Cholesky Decomposition of \(\mathbf{\Omega}^{-1}\).
The Cholesky decomposition ensures \(\mathbf{w}\) satisfies \(\mathbf{w'w = \Omega^{-1}}\), where \(\mathbf{w}\) is the “square root” of \(\mathbf{\Omega}^{-1}\) in matrix terms.

By transforming the original model, the variance of the transformed errors becomes:

\[ \begin{aligned} \mathbf{\Omega} &= Var(\mathbf{\epsilon} | \mathbf{X}), \\ \mathbf{\Omega}^{-1} &= Var(\mathbf{\epsilon} | \mathbf{X})^{-1}. \end{aligned} \]

The transformed equation allows us to compute a more efficient estimator.

Using the transformed model, the Infeasible Generalized Least Squares (IGLS) estimator is:

\[ \begin{aligned} \mathbf{\hat{\beta}_{IGLS}} &= \mathbf{(X'w'wX)^{-1}X'w'wy} \\ &= \mathbf{(X' \mathbf{\Omega}^{-1} X)^{-1} X' \mathbf{\Omega}^{-1} y} \\ &= \mathbf{\beta + (X' \mathbf{\Omega}^{-1} X)^{-1} X' \mathbf{\Omega}^{-1} \mathbf{\epsilon}}. \end{aligned} \]

Unbiasedness

Since assumptions A1-A3 hold for the unweighted model:

\[ \begin{aligned} \mathbf{E(\hat{\beta}_{IGLS}|\mathbf{X})} &= \mathbf{E(\beta + (X'\Omega^{-1}X'\Omega^{-1}\epsilon)|\mathbf{X})} \\ &= \mathbf{\beta + E(X'\Omega^{-1}X'\Omega^{-1}\epsilon|\mathbf{X})} \\ &= \mathbf{\beta + X'\Omega^{-1}X'\Omega^{-1}E(\epsilon|\mathbf{X})} && \text{since A3: } E(\epsilon|\mathbf{X})=0, \\ &= \mathbf{\beta}. \end{aligned} \]

Thus, the IGLS estimator is unbiased.

Variance

The variance of the transformed errors is given by:

\[ \begin{aligned} \mathbf{Var(w\epsilon|\mathbf{X})} &= \mathbf{wVar(\epsilon|\mathbf{X})w'} \\ &= \mathbf{w\Omega w'} \\ &= \mathbf{w(w'w)^{-1}w'} && \text{since } \mathbf{w} \text{ is full-rank,} \\ &= \mathbf{ww^{-1}(w')^{-1}w'} \\ &= \mathbf{I_n}. \end{aligned} \]

Hence, A4 holds for the transformed (weighted) equation, satisfying the Gauss-Markov conditions.

The variance of the IGLS estimator is:

\[ \begin{aligned} \mathbf{Var(\hat{\beta}_{IGLS}|\mathbf{X})} &= \mathbf{Var(\beta + (X'\Omega^{-1}X)^{-1}X'\Omega^{-1}\epsilon|\mathbf{X})} \\ &= \mathbf{Var((X'\Omega^{-1}X)^{-1}X'\Omega^{-1}\epsilon|\mathbf{X})} \\ &= \mathbf{(X'\Omega^{-1}X)^{-1}X'\Omega^{-1} Var(\epsilon|\mathbf{X}) \Omega^{-1}X(X'\Omega^{-1}X)^{-1}} && \text{because A4 holds}, \\ &= \mathbf{(X'\Omega^{-1}X)^{-1}X'\Omega^{-1} \Omega \Omega^{-1} \Omega^{-1}X(X'\Omega^{-1}X)^{-1}}, \\ &= \mathbf{(X'\Omega^{-1}X)^{-1}}. \end{aligned} \]

Efficiency

The difference in variances between OLS and IGLS is:

\[ \mathbf{Var(\hat{\beta}_{OLS}|\mathbf{X}) - Var(\hat{\beta}_{IGLS}|\mathbf{X})} = \mathbf{A\Omega A'}, \]

where:

\[ \mathbf{A = (X'X)^{-1}X' - (X'\Omega^{-1}X)^{-1}X'\Omega^{-1}}. \]

Since \(\mathbf{\Omega}\) is positive semi-definite, \(\mathbf{A\Omega A'}\) is also positive semi-definite. Thus, the IGLS estimator is more efficient than OLS under heteroskedasticity or autocorrelation.

In short, properties of \(\mathbf{\hat{\beta}_{IGLS}}\):

Unbiasedness: \(\mathbf{\hat{\beta}_{IGLS}}\) remains unbiased as long as A1-A3 hold.
Efficiency: \(\mathbf{\hat{\beta}_{IGLS}}\) is more efficient than OLS under heteroskedasticity or serial correlation since it accounts for the structure of \(\mathbf{\Omega}\).

Why Is IGLS “Infeasible”?

The name infeasible arises because it is generally impossible to compute the estimator directly due to the structure of \(\mathbf{w}\) (or equivalently \(\mathbf{\Omega}^{-1}\)). The matrix \(\mathbf{w}\) is defined as:

\[ \mathbf{w} = \begin{pmatrix} w_{11} & 0 & 0 & \cdots & 0 \\ w_{21} & w_{22} & 0 & \cdots & 0 \\ w_{31} & w_{32} & w_{33} & \cdots & \cdots \\ w_{n1} & w_{n2} & w_{n3} & \cdots & w_{nn} \\ \end{pmatrix}, \]

with \(n(n+1)/2\) unique elements for \(n\) observations. This results in more parameters than data points, making direct estimation infeasible.

To make the estimation feasible, assumptions about the structure of \(\mathbf{\Omega}\) are required. Common approaches include:

Heteroskedasticity Errors: Assume a multiplicative exponential model for the variance, such as \(Var(\epsilon_i|\mathbf{X}) = \sigma_i^2\).
- Assume no correlation between errors, but allow heterogeneous variances: \[ \mathbf{\Omega} = \begin{pmatrix} \sigma_1^2 & 0 & \cdots & 0 \\ 0 & \sigma_2^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma_n^2 \end{pmatrix}. \]
- Estimate \(\sigma_i^2\) using methods such as:
  - Modeling \(\sigma_i^2\) as a function of predictors (e.g., \(\sigma_i^2 = \exp(\mathbf{x}_i \gamma)\)).
Serial Correlation: Assume serial correlation follows an autoregressive process AR(1) Model, e.g., \(\epsilon_t = \rho \epsilon_{t-1} + u_t\) and \(Cov(\epsilon_t, \epsilon_{t -h}) = \rho^h \sigma^2\), where we have a variance-covariance matrix with off-diagonal elements decaying geometrically: \[ \mathbf{\Omega} = \frac{\sigma^2}{1-\rho^2} \begin{pmatrix} 1 & \rho & \rho^2 & \cdots & \rho^{n-1} \\ \rho & 1 & \rho & \cdots & \rho^{n-2} \\ \rho^2 & \rho & 1 & \cdots & \rho^{n-3} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} & \cdots & 1 \end{pmatrix}. \]
Cluster Errors: Assume block-diagonal structure for \(\mathbf{\Omega}\) to account for grouped or panel data.

Each assumption simplifies the estimation of \(\mathbf{\Omega}\) and thus \(\mathbf{w}\), enabling Feasible Generalized Least Squares with fewer unknown parameters to estimate.

5.2.2 Feasible Generalized Least Squares

5.2.2.1 Heteroskedasticity Errors

Heteroskedasticity occurs when the variance of the error term is not constant across observations. Specifically:

\[ Var(\epsilon_i | x_i) = E(\epsilon_i^2 | x_i) \neq \sigma^2, \]

but instead depends on a function of \(x_i\):

\[ Var(\epsilon_i | x_i) = h(x_i) = \sigma_i^2 \]

This violates the assumption of homoscedasticity (constant variance), impacting the efficiency of OLS estimates.

For the model:

\[ y_i = x_i\beta + \epsilon_i, \]

we apply a transformation to standardize the variance:

\[ \frac{y_i}{\sigma_i} = \frac{x_i}{\sigma_i} \beta + \frac{\epsilon_i}{\sigma_i}. \]

By scaling each observation with \(1/\sigma_i\), the variance of the transformed error term becomes:

\[ \begin{aligned} Var\left(\frac{\epsilon_i}{\sigma_i} \bigg| X \right) &= \frac{1}{\sigma_i^2} Var(\epsilon_i | X) \\ &= \frac{1}{\sigma_i^2} \sigma_i^2 \\ &= 1. \end{aligned} \]

Thus, the heteroskedasticity is corrected in the transformed model.

In matrix notation, the transformed model is:

\[ \mathbf{w y} = \mathbf{w X \beta + w \epsilon}, \]

where \(\mathbf{w}\) is the weight matrix used to standardize the variance. The weight matrix \(\mathbf{w}\) is defined as:

\[ \mathbf{w} = \begin{pmatrix} 1/\sigma_1 & 0 & 0 & \cdots & 0 \\ 0 & 1/\sigma_2 & 0 & \cdots & 0 \\ 0 & 0 & 1/\sigma_3 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1/\sigma_n \end{pmatrix}. \]

In the presence of heteroskedasticity, the variance of the error term, \(Var(\epsilon_i|\mathbf{x}_i)\), is not constant across observations. This leads to inefficient OLS estimates.

Infeasible Weighted Least Squares (IWLS) assumes that the variances \(\sigma_i^2 = Var(\epsilon_i|\mathbf{x}_i)\) are known. This allows us to adjust the regression equation to correct for heteroskedasticity.

The model is transformed as follows:

\[ y_i = \mathbf{x}_i\beta + \epsilon_i \quad \text{(original equation)}, \]

where \(\epsilon_i\) has variance \(\sigma_i^2\). To make the errors homoskedastic, we divide through by \(\sigma_i\):

\[ \frac{y_i}{\sigma_i} = \frac{\mathbf{x}_i}{\sigma_i}\beta + \frac{\epsilon_i}{\sigma_i}. \]

Now, the transformed error term \(\epsilon_i / \sigma_i\) has a constant variance of 1:

\[ Var\left(\frac{\epsilon_i}{\sigma_i} | \mathbf{x}_i \right) = 1. \]

The IWLS estimator minimizes the weighted sum of squared residuals for the transformed model:

\[ \text{Minimize: } \sum_{i=1}^n \left( \frac{y_i - \mathbf{x}_i\beta}{\sigma_i} \right)^2. \]

In matrix form, the IWLS estimator is:

\[ \hat{\beta}_{IWLS} = (\mathbf{X}'\mathbf{W}\mathbf{X})^{-1}\mathbf{X}'\mathbf{W}\mathbf{y}, \]

where \(\mathbf{W}\) is a diagonal matrix of weights:

\[ \mathbf{W} = \begin{pmatrix} 1/\sigma_1^2 & 0 & \cdots & 0 \\ 0 & 1/\sigma_2^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1/\sigma_n^2 \end{pmatrix}. \]

Properties of IWLS

Valid Standard Errors:
- If \(Var(\epsilon_i | \mathbf{X}) = \sigma_i^2\), the usual standard errors from IWLS are valid.
Robustness:
- If the variance assumption is incorrect (\(Var(\epsilon_i | \mathbf{X}) \neq \sigma_i^2\)), heteroskedasticity-robust standard errors must be used instead.

The primary issue with IWLS is that \(\sigma_i^2 = Var(\epsilon_i|\mathbf{x}_i)\) is generally unknown. Specifically, we do not know:

\[ \sigma_i^2 = Var(\epsilon_i|\mathbf{x}_i) = E(\epsilon_i^2|\mathbf{x}_i). \]

The challenges are:

Single Observation:
- For each observation \(i\), there is only one \(\epsilon_i\), which is insufficient to estimate the variance \(\sigma_i^2\) directly.
Dependence on Assumptions:
- To estimate \(\sigma_i^2\), we must impose assumptions about its relationship to \(\mathbf{x}_i\).

To make IWLS feasible, we model \(\sigma_i^2\) as a function of the predictors \(\mathbf{x}_i\). A common approach is:

\[ \epsilon_i^2 = v_i \exp(\mathbf{x}_i\gamma), \]

where:

\(v_i\) is an independent error term with strictly positive values, representing random noise.
\(\exp(\mathbf{x}_i\gamma)\) is a deterministic function of the predictors \(\mathbf{x}_i\).

Taking the natural logarithm of both sides linearizes the model:

\[ \ln(\epsilon_i^2) = \mathbf{x}_i\gamma + \ln(v_i), \]

where \(\ln(v_i)\) is independent of \(\mathbf{x}_i\). This transformation enables us to estimate \(\gamma\) using standard OLS techniques.

Estimation Procedure for Feasible GLS (FGLS)

Since we do not observe the true errors \(\epsilon_i\), we approximate them using the OLS residuals \(e_i\). Here’s the step-by-step process:

Compute OLS Residuals: First, fit the original model using OLS and calculate the residuals:

\[ e_i = y_i - \mathbf{x}_i\hat{\beta}_{OLS}. \]
Approximate \(\epsilon_i^2\) with \(e_i^2\): Use the squared residuals as a proxy for the squared errors:

\[ e_i^2 \approx \epsilon_i^2. \]
Log-Linear Model: Fit the log-transformed model to estimate \(\gamma\):

\[ \ln(e_i^2) = \mathbf{x}_i\gamma + \ln(v_i). \]

Estimate \(\gamma\) using OLS, where \(\ln(v_i)\) is treated as the error term.
Estimate Variances: Use the fitted values \(\hat{\gamma}\) to estimate \(\sigma_i^2\) for each observation:

\[ \hat{\sigma}_i^2 = \exp(\mathbf{x}_i\hat{\gamma}). \]
Perform Weighted Least Squares: Use the estimated variances \(\hat{\sigma}_i^2\) to construct the weight matrix \(\mathbf{\hat{W}}\):

\[ \mathbf{\hat{W}} = \begin{pmatrix} 1/\hat{\sigma}_1^2 & 0 & \cdots & 0 \\ 0 & 1/\hat{\sigma}_2^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1/\hat{\sigma}_n^2 \end{pmatrix}. \]

Then, compute the Feasible GLS (FGLS) estimator:

\[ \hat{\beta}_{FGLS} = (\mathbf{X}'\mathbf{\hat{W}}\mathbf{X})^{-1}\mathbf{X}'\mathbf{\hat{W}}\mathbf{y}. \]

5.2.2.2 Serial Correlation

Serial correlation (also called autocorrelation) occurs when the error terms in a regression model are correlated across observations. Formally:

\[ Cov(\epsilon_i, \epsilon_j | \mathbf{X}) \neq 0 \quad \text{for } i \neq j. \]

This violates the Gauss-Markov assumption that \(Cov(\epsilon_i, \epsilon_j | \mathbf{X}) = 0\), leading to inefficiencies in OLS estimates.

5.2.2.2.1 Covariance Stationarity

If the errors are covariance stationary, the covariance between errors depends only on their relative time or positional difference (\(h\)), not their absolute position:

\[ Cov(\epsilon_i, \epsilon_j | \mathbf{X}) = Cov(\epsilon_i, \epsilon_{i+h} | \mathbf{x}_i, \mathbf{x}_{i+h}) = \gamma_h, \]

where \(\gamma_h\) represents the covariance at lag \(h\).

Under covariance stationarity, the variance-covariance matrix of the error term \(\mathbf{\epsilon}\) takes the following form:

\[ Var(\mathbf{\epsilon}|\mathbf{X}) = \mathbf{\Omega} = \begin{pmatrix} \sigma^2 & \gamma_1 & \gamma_2 & \cdots & \gamma_{n-1} \\ \gamma_1 & \sigma^2 & \gamma_1 & \cdots & \gamma_{n-2} \\ \gamma_2 & \gamma_1 & \sigma^2 & \cdots & \vdots \\ \vdots & \vdots & \vdots & \ddots & \gamma_1 \\ \gamma_{n-1} & \gamma_{n-2} & \cdots & \gamma_1 & \sigma^2 \end{pmatrix}. \]

Key Points:

The diagonal elements represent the variance of the error term: \(\sigma^2\).
The off-diagonal elements \(\gamma_h\) represent covariances at different lags \(h\).

Why Serial Correlation Is a Problem?

The matrix \(\mathbf{\Omega}\) introduces \(n\) parameters to estimate (e.g., \(\sigma^2, \gamma_1, \gamma_2, \ldots, \gamma_{n-1}\)). Estimating such a large number of parameters becomes impractical, especially for large datasets. To address this, we impose additional structure to reduce the number of parameters.

5.2.2.2.2 AR(1) Model

In the AR(1) process, the errors follow a first-order autoregressive process:

\[ \begin{aligned} y_t &= \beta_0 + x_t\beta_1 + \epsilon_t, \\ \epsilon_t &= \rho \epsilon_{t-1} + u_t, \end{aligned} \]

where:

\(\rho\) is the first-order autocorrelation coefficient, capturing the relationship between consecutive errors.
\(u_t\) is white noise, satisfying \(Var(u_t) = \sigma_u^2\) and \(Cov(u_t, u_{t-h}) = 0\) for \(h \neq 0\).

Under the AR(1) assumption, the variance-covariance matrix of the error term \(\mathbf{\epsilon}\) becomes:

\[ Var(\mathbf{\epsilon} | \mathbf{X}) = \frac{\sigma_u^2}{1-\rho^2} \begin{pmatrix} 1 & \rho & \rho^2 & \cdots & \rho^{n-1} \\ \rho & 1 & \rho & \cdots & \rho^{n-2} \\ \rho^2 & \rho & 1 & \cdots & \rho^{n-3} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \cdots & \rho & 1 \end{pmatrix}. \]

Key Features:

The diagonal elements represent the variance: \(Var(\epsilon_t | \mathbf{X}) = \sigma_u^2 / (1-\rho^2)\).
The off-diagonal elements decay exponentially with lag \(h\): \(Cov(\epsilon_t, \epsilon_{t-h} | \mathbf{X}) = \rho^h \cdot Var(\epsilon_t | \mathbf{X})\).

Under AR(1), only one parameter \(\rho\) needs to be estimated (in addition to \(\sigma_u^2\)), greatly simplifying the structure of \(\mathbf{\Omega}\).

OLS Properties Under AR(1)

Consistency: If assumptions A1, A2, A3a, and A5a hold, OLS remains consistent.
Asymptotic Normality: OLS estimates are asymptotically normal.
Inference with Serial Correlation:
- Standard OLS errors are invalid.
- Use Newey-West standard errors to obtain robust inference.

5.2.2.2.3 Infeasible Cochrane-Orcutt

The Infeasible Cochrane-Orcutt procedure addresses serial correlation in the error terms by assuming an AR(1) process for the errors:

\[ \epsilon_t = \rho \epsilon_{t-1} + u_t, \]

where \(u_t\) is white noise and \(\rho\) is the autocorrelation coefficient.

By transforming the original regression equation:

\[ y_t = \beta_0 + x_t\beta_1 + \epsilon_t, \]

we subtract \(\rho\) times the lagged equation:

\[ \rho y_{t-1} = \rho (\beta_0 + x_{t-1}\beta_1 + \epsilon_{t-1}), \]

to obtain the weighted first-difference equation:

\[ y_t - \rho y_{t-1} = (1-\rho)\beta_0 + (x_t - \rho x_{t-1})\beta_1 + u_t. \]

Key Points:

Dependent Variable: \(y_t - \rho y_{t-1}\).
Independent Variable: \(x_t - \rho x_{t-1}\).
Error Term: \(u_t\), which satisfies the Gauss-Markov assumptions (A3, A4, A5).

The ICO estimator minimizes the sum of squared residuals for this transformed equation.

Standard Errors:
- If the errors truly follow an AR(1) process, the standard errors for the transformed equation are valid.
- For more complex error structures, Newey-West HAC standard errors are required.
Loss of Observations:
- The transformation involves first differences, which means the first observation (\(y_1\)) cannot be used. This reduces the effective sample size by one.

The Problem: \(\rho\) Is Unknown

The ICO procedure is infeasible because it requires knowledge of \(\rho\), the autocorrelation coefficient. In practice, we estimate \(\rho\) from the data.

To estimate \(\rho\), we use the OLS residuals (\(e_t\)) as a proxy for the errors (\(\epsilon_t\)). The estimate \(\hat{\rho}\) is given by:

\[ \hat{\rho} = \frac{\sum_{t=2}^{T} e_t e_{t-1}}{\sum_{t=2}^{T} e_t^2}. \]

Estimation via OLS:

Regress the OLS residuals \(e_t\) on their lagged values \(e_{t-1}\), without an intercept: \[ e_t = \rho e_{t-1} + u_t. \]
The slope of this regression is the estimate \(\hat{\rho}\).

This estimation is efficient under the AR(1) assumption and provides a practical approximation for \(\rho\).

5.2.2.2.4 Feasible Prais-Winsten

The Feasible Prais-Winsten (FPW) method addresses AR(1) serial correlation in regression models by transforming the data to eliminate serial dependence in the errors. Unlike the Infeasible Cochrane-Orcutt procedure, which discards the first observation, the Prais-Winsten method retains it using a weighted transformation.

The FPW transformation uses the following weighting matrix \(\mathbf{w}\):

\[ \mathbf{w} = \begin{pmatrix} \sqrt{1 - \hat{\rho}^2} & 0 & 0 & \cdots & 0 \\ -\hat{\rho} & 1 & 0 & \cdots & 0 \\ 0 & -\hat{\rho} & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & -\hat{\rho} & 1 \end{pmatrix}. \]

where

The first row accounts for the transformation of the first observation, using \(\sqrt{1 - \hat{\rho}^2}\).
Subsequent rows represent the AR(1) transformation for the remaining observations.

Step-by-Step Procedure

Step 1: Initial OLS Estimation

Estimate the regression model using OLS:

\[ y_t = \mathbf{x}_t \beta + \epsilon_t, \]

and compute the residuals:

\[ e_t = y_t - \mathbf{x}_t \hat{\beta}. \]

Step 2: Estimate the AR(1) Correlation Coefficient

Estimate the AR(1) correlation coefficient \(\rho\) by regressing \(e_t\) on \(e_{t-1}\) without an intercept:

\[ e_t = \rho e_{t-1} + u_t. \]

The slope of this regression is the estimated \(\hat{\rho}\).

Step 3: Transform the Data

Apply the transformation using the weighting matrix \(\mathbf{w}\) to transform both the dependent variable \(\mathbf{y}\) and the independent variables \(\mathbf{X}\):

\[ \mathbf{wy} = \mathbf{wX} \beta + \mathbf{w\epsilon}. \]

Specifically: 1. For \(t=1\), the transformed dependent and independent variables are: \[ \tilde{y}_1 = \sqrt{1 - \hat{\rho}^2} \cdot y_1, \quad \tilde{\mathbf{x}}_1 = \sqrt{1 - \hat{\rho}^2} \cdot \mathbf{x}_1. \] 2. For \(t=2, \dots, T\), the transformed variables are: \[ \tilde{y}_t = y_t - \hat{\rho} \cdot y_{t-1}, \quad \tilde{\mathbf{x}}_t = \mathbf{x}_t - \hat{\rho} \cdot \mathbf{x}_{t-1}. \]

Step 4: Feasible Prais-Winsten Estimation

Run OLS on the transformed equation:

\[ \mathbf{wy} = \mathbf{wX} \beta + \mathbf{w\epsilon}. \]

The resulting estimator is the Feasible Prais-Winsten (FPW) estimator:

\[ \hat{\beta}_{FPW} = (\mathbf{X}'\mathbf{w}'\mathbf{w}\mathbf{X})^{-1} \mathbf{X}'\mathbf{w}'\mathbf{w}\mathbf{y}. \]

Properties of Feasible Prais-Winsten Estimator

Infeasible Prais-Winsten Estimator:
- The infeasible Prais-Winsten (PW) estimator assumes the AR(1) parameter \(\rho\) is known.
- Under assumptions A1, A2, and A3 for the unweighted equation, the infeasible PW estimator is unbiased and efficient.
Feasible Prais-Winsten (FPW) Estimator: The FPW estimator replaces the unknown \(\rho\) with an estimate \(\hat{\rho}\) derived from the OLS residuals, introducing bias in small samples.
1. Bias:
  - The FPW estimator is biased due to the estimation of \(\hat{\rho}\), which introduces an additional layer of approximation.
2. Consistency:
  1. The FPW estimator is consistent under the following assumptions:
    - A1: The model is linear in parameters.
    - A2: The independent variables are linearly independent.
    - A5: The data is generated through random sampling.
    - Additionally: \[ E\big((\mathbf{x_t - \rho x_{t-1}})'\big(\epsilon_t - \rho \epsilon_{t-1}\big)\big) = 0. \] This condition ensures the transformed error term \(\epsilon_t - \rho \epsilon_{t-1}\) is uncorrelated with the transformed regressors \(\mathbf{x_t - \rho x_{t-1}}\).
  2. Note: A3a (zero conditional mean of the error term, \(E(\epsilon_t|\mathbf{x}_t) = 0\)) is not sufficient for the above condition. Full exogeneity of the independent variables (A3) is required.
3. Efficiency
  1. Asymptotic Efficiency: The FPW estimator is asymptotically more efficient than OLS if the errors are truly generated by an AR(1) process: \[ \epsilon_t = \rho \epsilon_{t-1} + u_t, \quad Var(u_t) = \sigma^2. \]
  2. Standard Errors:
    1. Usual Standard Errors: If the errors are correctly specified as an AR(1) process, the usual standard errors from FPW are valid.
    2. Robust Standard Errors: If there is concern about a more complex dependence structure (e.g., higher-order autocorrelation or heteroskedasticity), use Newey-West Standard Errors for inference. These are robust to both serial correlation and heteroskedasticity.

5.2.2.3 Cluster Errors

Consider the regression model with clustered errors:

\[ y_{gi} = \mathbf{x}_{gi}\beta + \epsilon_{gi}, \]

where:

\(g\) indexes the group (e.g., households, firms, schools).
\(i\) indexes the individual within the group.

The covariance structure for the errors \(\epsilon_{gi}\) is defined as:

\[ Cov(\epsilon_{gi}, \epsilon_{hj}) \begin{cases} = 0 & \text{if } g \neq h \text{ (independent across groups)}, \\ \neq 0 & \text{for any pair } (i,j) \text{ within group } g. \end{cases} \]

Within each group, individuals’ errors may be correlated (i.e., intra-group correlation), while errors are independent across groups. This violates A4 (constant variance and no correlation of errors).

Suppose there are three groups with varying sizes. The variance-covariance matrix \(\mathbf{\Omega}\) for the errors \(\mathbf{\epsilon}\) is:

\[ Var(\mathbf{\epsilon}| \mathbf{X}) = \mathbf{\Omega} = \begin{pmatrix} \sigma^2 & \delta_{12}^1 & \delta_{13}^1 & 0 & 0 & 0 \\ \delta_{12}^1 & \sigma^2 & \delta_{23}^1 & 0 & 0 & 0 \\ \delta_{13}^1 & \delta_{23}^1 & \sigma^2 & 0 & 0 & 0 \\ 0 & 0 & 0 & \sigma^2 & \delta_{12}^2 & 0 \\ 0 & 0 & 0 & \delta_{12}^2 & \sigma^2 & 0 \\ 0 & 0 & 0 & 0 & 0 & \sigma^2 \end{pmatrix}. \]

where

\(\delta_{ij}^g = Cov(\epsilon_{gi}, \epsilon_{gj})\) is the covariance between errors for individuals \(i\) and \(j\) in group \(g\).
\(Cov(\epsilon_{gi}, \epsilon_{hj}) = 0\) for \(g \neq h\) (independent groups).

Infeasible Generalized Least Squares (Cluster)

Assume Known Variance-Covariance Matrix: If \(\sigma^2\) and \(\delta_{ij}^g\) are known, construct \(\mathbf{\Omega}\) and compute its inverse \(\mathbf{\Omega}^{-1}\).
Infeasible GLS Estimator: The infeasible generalized least squares (IGLS) estimator is:

\[ \hat{\beta}_{IGLS} = (\mathbf{X}'\mathbf{\Omega}^{-1}\mathbf{X})^{-1}\mathbf{X}'\mathbf{\Omega}^{-1}\mathbf{y}. \]

Problem:

We do not know \(\sigma^2\) and \(\delta_{ij}^g\), making this approach infeasible.
Even if \(\mathbf{\Omega}\) is estimated, incorrect assumptions about its structure may lead to invalid inference.

To make the estimation feasible, we assume a group-level random effects specification for the error:

\[ \begin{aligned} y_{gi} &= \mathbf{x}_{gi}\beta + c_g + u_{gi}, \\ Var(c_g|\mathbf{x}_i) &= \sigma_c^2, \\ Var(u_{gi}|\mathbf{x}_i) &= \sigma_u^2, \end{aligned} \]

where:

\(c_g\) represents the group-level random effect (common shocks within each group, independent across groups).
\(u_{gi}\) represents the individual-level error (idiosyncratic shocks within each group, independent across individuals and groups).
\(\epsilon_{gi} = c_g + u_{gi}\)

Independence Assumptions:

\(c_g\) and \(u_{gi}\) are independent of each other.
Both are mean-independent of \(\mathbf{x}_i\).

Under this specification, the variance-covariance matrix \(\mathbf{\Omega}\) becomes block diagonal, where each block corresponds to a group:

\[ Var(\mathbf{\epsilon}| \mathbf{X}) = \mathbf{\Omega} = \begin{pmatrix} \sigma_c^2 + \sigma_u^2 & \sigma_c^2 & \sigma_c^2 & 0 & 0 & 0 \\ \sigma_c^2 & \sigma_c^2 + \sigma_u^2 & \sigma_c^2 & 0 & 0 & 0 \\ \sigma_c^2 & \sigma_c^2 & \sigma_c^2 + \sigma_u^2 & 0 & 0 & 0 \\ 0 & 0 & 0 & \sigma_c^2 + \sigma_u^2 & \sigma_c^2 & 0 \\ 0 & 0 & 0 & \sigma_c^2 & \sigma_c^2 + \sigma_u^2 & 0 \\ 0 & 0 & 0 & 0 & 0 & \sigma_c^2 + \sigma_u^2 \end{pmatrix}. \]

When the variance components \(\sigma_c^2\) and \(\sigma_u^2\) are unknown, we can use the Feasible Group-Level Random Effects (RE) estimator to simultaneously estimate these variances and the regression coefficients \(\beta\). This practical approach allows us to account for intra-group correlation in the errors and still obtain consistent and efficient estimates of the parameters.

Step-by-Step Procedure

Step 1: Initial OLS Estimation

Estimate the regression model using OLS:

\[ y_{gi} = \mathbf{x}_{gi}\beta + \epsilon_{gi}, \]

and compute the residuals:

\[ e_{gi} = y_{gi} - \mathbf{x}_{gi}\hat{\beta}. \]

Step 2: Estimate Variance Components

Use the standard OLS variance estimator \(s^2\) to estimate the total variance:

\[ s^2 = \frac{1}{n - k} \sum_{i=1}^{n} e_i^2, \]

where \(n\) is the total number of observations and \(k\) is the number of regressors (including the intercept).

Estimate the between-group variance \(\hat{\sigma}_c^2\) using:

\[ \hat{\sigma}_c^2 = \frac{1}{G} \sum_{g=1}^{G} \left( \frac{1}{\sum_{i=1}^{n_g - 1} i} \sum_{i \neq j} \sum_{j=1}^{n_g} e_{gi} e_{gj} \right), \]

where:

\(G\) is the total number of groups,
\(n_g\) is the size of group \(g\),
The term \(\sum_{i \neq j} e_{gi} e_{gj}\) accounts for within-group covariance.

Estimate the within-group variance as:

\[ \hat{\sigma}_u^2 = s^2 - \hat{\sigma}_c^2. \]

Step 3: Construct the Variance-Covariance Matrix

Use the estimated variances \(\hat{\sigma}_c^2\) and \(\hat{\sigma}_u^2\) to construct the variance-covariance matrix \(\hat{\Omega}\) for the error term:

\[ \hat{\Omega}_{gi,gj} = \begin{cases} \hat{\sigma}_c^2 + \hat{\sigma}_u^2 & \text{if } i = j \text{ (diagonal elements)}, \\ \hat{\sigma}_c^2 & \text{if } i \neq j \text{ (off-diagonal elements within group)}, \\ 0 & \text{if } g \neq h \text{ (across groups)}. \end{cases} \]

Step 4: Feasible GLS Estimation

With \(\hat{\Omega}\) in hand, perform Feasible Generalized Least Squares (FGLS) to estimate \(\beta\):

\[ \hat{\beta}_{RE} = (\mathbf{X}'\hat{\Omega}^{-1}\mathbf{X})^{-1} \mathbf{X}'\hat{\Omega}^{-1}\mathbf{y}. \]

If the assumptions about \(\mathbf{\Omega}\) are incorrect or infeasible, use cluster-robust standard errors to account for intra-group correlation without explicitly modeling the variance-covariance structure. These standard errors remain valid under arbitrary within-cluster dependence, provided clusters are independent.

Properties of the Feasible Group-Level Random Effects Estimator

Infeasible Group RE Estimator

The infeasible RE estimator (assuming known variances) is unbiased under assumptions A1, A2, and A3 for the unweighted equation.
A3 requires: \[ E(\epsilon_{gi}|\mathbf{x}_i) = E(c_g|\mathbf{x}_i) + E(u_{gi}|\mathbf{x}_i) = 0. \] This assumes:
- \(E(c_g|\mathbf{x}_i) = 0\): The random effects assumption (group-level effects are uncorrelated with the regressors).
- \(E(u_{gi}|\mathbf{x}_i) = 0\): No endogeneity at the individual level.

Feasible Group RE Estimator

The feasible RE estimator is biased because the variances \(\sigma_c^2\) and \(\sigma_u^2\) are estimated, introducing approximation errors.
However, the estimator is consistent under A1, A2, A3a (\(E(\mathbf{x}_i'\epsilon_{gi}) = E(\mathbf{x}_i'c_g) + E(\mathbf{x}_i'u_{gi}) = 0\)), A5a.
Efficiency
- Asymptotic Efficiency:
  - The feasible RE estimator is asymptotically more efficient than OLS if the errors follow the random effects specification.
- Standard Errors:
  - If the random effects specification is correct, the usual standard errors are consistent.
  - If there is concern about more complex dependence structures or heteroskedasticity, use cluster robust standard errors.

5.2.3 Weighted Least Squares

In the presence of heteroskedasticity, the errors \(\epsilon_i\) have non-constant variance \(Var(\epsilon_i|\mathbf{x}_i) = \sigma_i^2\). This violates the Gauss-Markov assumption of homoskedasticity, leading to inefficient OLS estimates.

Weighted Least Squares (WLS) addresses this by applying weights inversely proportional to the variance of the errors, ensuring that observations with larger variances have less influence on the estimation.

Weighted Least Squares is essentially Generalized Least Squares in the special case that \(\mathbf{\Omega}\) is a diagonal matrix with variances \(\sigma_i^2\) on the diagonal (i.e., errors are uncorrelated but have non-constant variance).
- That is, assume the errors are uncorrelated but heteroskedastic: \(\mathbf{\Omega} = \text{diag}\left(\sigma_1^2, \ldots, \sigma_n^2\right)\)
- Then \(\mathbf{\Omega}^{-1} = \text{diag}\left(1/\sigma_1^2, \ldots, 1/\sigma_n^2\right)\)

Steps for Feasible Weighted Least Squares (FWLS)

1. Initial OLS Estimation

First, estimate the model using OLS:

\[ y_i = \mathbf{x}_i\beta + \epsilon_i, \]

and compute the residuals:

\[ e_i = y_i - \mathbf{x}_i \hat{\beta}. \]

2. Model the Error Variance

Transform the residuals to model the variance as a function of the predictors:

\[ \ln(e_i^2) = \mathbf{x}_i\gamma + \ln(v_i), \]

where:

\(e_i^2\) approximates \(\epsilon_i^2\),
\(\ln(v_i)\) is the error term in this auxiliary regression, assumed independent of \(\mathbf{x}_i\).

Estimate this equation using OLS to obtain the predicted values:

\[ \hat{g}_i = \mathbf{x}_i \hat{\gamma}. \]

3. Estimate Weights

Use the predicted values from the auxiliary regression to compute the weights:

\[ \hat{\sigma}_i = \sqrt{\exp(\hat{g}_i)}. \]

These weights approximate the standard deviation of the errors.

4. Weighted Regression

Transform the original equation by dividing through by \(\hat{\sigma}_i\):

\[ \frac{y_i}{\hat{\sigma}_i} = \frac{\mathbf{x}_i}{\hat{\sigma}_i}\beta + \frac{\epsilon_i}{\hat{\sigma}_i}. \]

Estimate the transformed equation using OLS. The resulting estimator is the Feasible Weighted Least Squares (FWLS) estimator:

\[ \hat{\beta}_{FWLS} = (\mathbf{X}'\mathbf{\hat{W}}\mathbf{X})^{-1}\mathbf{X}'\mathbf{\hat{W}}\mathbf{y}, \]

where \(\mathbf{\hat{W}}\) is a diagonal weight matrix with elements \(1/\hat{\sigma}_i^2\).

Properties of the FWLS Estimator

Unbiasedness:
- The infeasible WLS estimator (where \(\sigma_i\) is known) is unbiased under assumptions A1-A3 for the unweighted model.
- The FWLS estimator is not unbiased due to the approximation of \(\sigma_i\) using \(\hat{\sigma}_i\).
Consistency:
- The FWLS estimator is consistent under the following assumptions:
  - A1 (for the unweighted equation): The model is linear in parameters.
  - A2 (for the unweighted equation): The independent variables are linearly independent.
  - A5: The data is randomly sampled.
  - \(E(\mathbf{x}_i'\epsilon_i/\sigma_i^2) = 0\). A3a: Weaker Exogeneity Assumption is not sufficient, but A3 is.
Efficiency:
- The FWLS estimator is asymptotically more efficient than OLS if the errors have multiplicative exponential heteroskedasticity: \[ Var(\epsilon_i|\mathbf{x}_i) = \sigma_i^2 = \exp(\mathbf{x}_i\gamma). \]

The FWLS estimator is asymptotically more efficient than OLS if the errors have multiplicative exponential heteroskedasticity.

Usual Standard Errors:
- If the errors are truly multiplicative exponential heteroskedastic, the usual standard errors for FWLS are valid.
Heteroskedastic Robust Standard Errors:
- If there is potential mis-specification of the multiplicative exponential model for \(\sigma_i^2\), heteroskedastic-robust standard errors should be reported to ensure valid inference.