Chapter 7 Adequacy
The outcome of estimation and inference can not demonstrate model’s performance. If the primary assumptions is violated, the estimations could be biased and the model could be useless. These problems can also happen when the model is not correctly specified. It is necessary to make diagnosis and validation for fitted models.
7.1 Goodness of fit
This structure tell us how good the model can explain the data. Coefficient of Determination \(R^2\) is a proportion to assess the quality of fitted model.
\[\begin{equation} R^2 =\frac{SSR}{SST}=1-\frac{SSE}{SST} \tag{7.1} \end{equation}\]
when \(R^2\) is close to \(1\), the most of variation in response can be explained by the fitted model. Although \(R^2\) is not the only criteria of a good model, it is often available in most published papers. Recall the discussion in Part I, the aggregated data will eliminate the difference among individuals, households, or neighborhoods. In the new variance structure, \(SSE\) will be much less than disaggregated model. The \(R^2\) in many disaggregate studies are around 0.3, while the \(R^2\) in some aggregate studies can reach 0.8. A seriously underfitting model’s outputs could be biased and unstable.
A fact is that adding predictors into the model will never decrease \(R^2\). Thus the models with different number of predictors is not comparable. Adjusted \(R^2\) can address this issue by introducing degree of freedom. The degree of freedom denotes the amount of information required to know.
\[\begin{equation} \begin{split} df_T =& df_R + df_E\\ n-1=&(p-1)+(n-p) \end{split} \tag{7.2} \end{equation}\]
Then, the mean square (MS) of each sum of squares (SS) can be calculated by \(MS=SS/df\). The mean square error \(MSE\) is also called as the expected value of error variance \(\hat\sigma^2=MSE=SSE/(n-p)\). \(n-p\) is the degree of freedom. Then adjusted \(R^2\) is
\[\begin{equation} R_{adj}^2 = 1-\frac{MSE}{MST} = 1-\frac{SSE/(n-p)}{SST/(n-1)} \tag{7.3} \end{equation}\]
Another similar method is \(R^2\) for prediction based on PRESS. Recall the PRESS statistic is the prediction error sum of square by fitting a model with \(n-1\) observations.
\[\begin{equation} PRESS = \sum_{i=1}^n(y_i-\hat y_{(i)})^2= \sum_{i=1}^n\left(\frac{e_i}{1-h_{ii}}\right)^2 \tag{7.4} \end{equation}\]
A model with smaller PRESS has a better ability of prediction. The \(R^2\) for prediction is
\[\begin{equation} R_{pred}^2 = 1-\frac{PRESS}{MST} \tag{7.5} \end{equation}\]
7.2 Residuals Analysis
The major assumptions, both IID and normality are related to residual. Residual diagnosis is an essential step for modeling validation.
There are several scaled residuals can help the diagnosis. Since \(MSE\) is the expected variance of error \(\hat\sigma^2\) and \(E[\varepsilon]=0\), standardized residuals should follow a standard normal distribution.
\[d_i=\frac{e_i}{\sqrt{MSE}}=e_i\sqrt{\frac{n-p}{\sum_{i=1}^n e_i^2}},\quad i=1,2,...,n\]
Recall random error \(\mathbf{e}=\mathbf{y}-\mathbf{\hat y}=(\mathbf{I}-\mathbf{H})\mathbf{y}\) and hat matrix \(\mathbf{H}=\mathbf{X}(\mathbf{X'X})^{-1}\mathbf{X'}\). Let \(h_{ii}\) denote the \(i^{th}\) diagonal element of hat matrix. Studentized Residuals can be expressed by \[r_i=\frac{e_i}{\sqrt{MSE(1-h_{ii})}},\quad i=1,2,...,n\] It is proved that \(0\le h_{ii}\le1\). An observation with \(h_{ii}\) closed to one will return a large value of \(r_i\). The \(x_i\) who has strong influence on fitted value is called leverage point.
Ideally, the scaled residual have zero mean and unit variance. Hence, an observation with \(|d_i|>3\) or \(|r_i|>3\) is a potential outlier.
Predicted Residual Error Sum of Squares (PRESS) can also be used to detect outliers. This method predicts the \(i^{th}\) fitted response by excluding the \(i^{th}\) observation and examine the influence of this point. The corresponding error \(e_{(i)}=e_{i}/(1-h_{ii})\) and \(V[e_{(i)}]=\sigma^2/(1-h_{ii})\). Thus, if \(MSE\) is a good estimate of \(\sigma^2\), PRESS residuals is equivalent to Studentized Residuals.
\[\frac{e_{(i)}}{\sqrt{V[e_{(i)}]}}=\frac{e_i/(1-h_{ii})}{\sqrt{\sigma^2/(1-h_{ii})}}=\frac{e_i}{\sqrt{\sigma^2(1-h_{ii})}}\]
- Residual Plot
Residual plot shows the pattern of the residuals against fitted \(\mathbf{\hat y}\). If the assumptions are valid, the shape of points should like a envelope and be evenly distributed around the horizontal line of \(e=0\).
A funnel shape in residual plot shows that the variance of error is a function of \(\hat y\). A suitable transformation to response or predictor could stabilize the variance. A curved shape means the assumption of linearity is not valid. It implies that adding quadratic terms or higher-order terms might be suitable.
- Normal Probability Plot
A histogram of residuals can check the normality assumption. The highly right-skewed probability distribution of VMT log-transform of VMT is reasonable.
A better way is a normal quantile – quantile (QQ) plot of the residuals. An ideal cumulative normal distribution should plot as a straight line. Only looking at the \(R^2\) and p-values cannot disclose this feature.
7.3 Heteroscedasticity
When the assumption of constant variance is violated, the linear model is heteroscedastic. Heteroscedasticity is common in urban studies. For example, the cities with different size are not identical. Small cities or rural areas might have homogeneous values of population density, while large cities’ densities are more variable.
Recall Generalized least square estimates (5.4) and (5.5), if the residuals are independent but variances are not constant, a simple linear model becomes \(\boldsymbol{\varepsilon}\sim MVN(\mathbf{0},\sigma^2\mathbf{V})\) where
\[\begin{equation} \mathbf{V}=\begin{bmatrix} x_1^2 & 0 & \dots & 0 \\ 0 & x_2^2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & x_n^2 \end{bmatrix},\quad \mathbf{V}^{-1}=\begin{bmatrix} \frac1{x_1^2} & 0 & \dots & 0 \\ 0 & \frac1{x_2^2} & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \frac1{x_n^2} \end{bmatrix} \tag{7.6} \end{equation}\]
Then \(\mathbf{X'V^{-1}X}=n\) and the generalized least squares solution is
\[\begin{equation} \hat\beta_{1,WLS}=\frac1n\sum_{i=1}^{n}\frac{y_i}{x_i} \tag{7.7} \end{equation}\]
and
\[\begin{equation} \hat\sigma^2_{WLS}=\frac1{n-1}\sum_{i=1}^{n}\frac{(y_i-\hat\beta_{1}x_i)^2}{x_i^2} \tag{7.8} \end{equation}\]
In heteroscedastic model, the OLS estimates of coefficients are still unbiased but no longer efficient. But the estimates of variances are biased. The corresponding hypothesis test and confidence interval would be misleading.
Another special case is the model with aggregated variables, which is the cases of geographic unit. Let \(u_j\) and \(v_j\) are the response and predictors of \(j^{th}\) household in a neighborhood. \(n_i\) is the sample size in each neighborhood. Then \(y_i=\sum_{j=1}^{n_i}u_j/n_i\) and \(X_i=\sum_{j=1}^{n_i}v_j/n_i\). In this case,
\[\begin{equation} \mathbf{V}=\begin{bmatrix} \frac1{n_1} & 0 & \dots & 0 \\ 0 & \frac1{n_2} & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \frac1{n_n} \end{bmatrix},\quad \mathbf{V}^{-1}=\begin{bmatrix} n_1 & 0 & \dots & 0 \\ 0 & n_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & n_n \end{bmatrix} \tag{7.9} \end{equation}\]
Then \(\mathbf{X'V^{-1}X}=\sum_{i=1}^nn_ix_i^2\) and the WLS estimate of \(\beta_1\) is
\[\begin{equation} \hat\beta_{1,WLS}=\frac1n\frac{\sum_{i=1}^{n}n_ix_iy_i}{\sum_{i=1}^{n}n_ix_i^2} \tag{7.10} \end{equation}\]
and
\[\begin{equation} V[\hat\beta_{1,WLS}]=\frac{V[\sum_{i=1}^{n}n_ix_iy_i]}{(\sum_{i=1}^{n}n_ix_i^2)^2}=\frac{\sum_{i=1}^{n}n_i^2x_i^2\sigma^2/n_i}{(\sum_{i=1}^{n}n_ix_i^2)^2}=\frac{\sigma^2}{\sum_{i=1}^{n}n_ix_i^2} \tag{7.11} \end{equation}\]
There are three procedures, Bartlett’s likelihood ratio test, Goldfeld-Quandt test, or Breusch-Pagan test which can be used to examine heteroscedasticity (Ravishanker and Dey 2020, 8.1.3, pp.288–290)
7.4 Autocorrelation
For spatio-temporal data, the observations often have some relationship over time or space. In these cases, the assumption of independent errors is violated, the linear model with serially correlated errors is called autocorrelation. Autocorrelation is also common in urban studies. All the neighboring geographic entities or stages could impact each other, or sharing the similar environment.
Take a special case of time-series data for example, assuming the model have constant variance. \(E[\varepsilon]=0\) but \(Cov[\varepsilon_i,\varepsilon_j]=\sigma^2\rho^{|j-i|}\), \(i,j=1,2,...,n\) and \(|\rho|<1\) The variance-covariance matrix is also called Toeplitz matrix as below
\[\begin{equation} \mathbf{V}=\begin{bmatrix} 1 & \rho & \rho^2 & \dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} & \dots & 1 \end{bmatrix},\quad \{\mathbf{V}^{-1}\}_{ij}=\begin{cases} \frac{1}{1-\rho^2} & \text{if } i=j=1,n \\ \frac{1+\rho^2}{1-\rho^2} & \text{if } i=j=2,...,n-1 \\ \frac{-\rho}{1-\rho^2} & \text{if } |j-i|=1 \\ 0 & \text{otherwise} \end{cases} \tag{7.12} \end{equation}\]
This is a linear regression with autoregressive order 1 (AR(1)). The estimates of \(\boldsymbol{\hat\beta}\) is the same with the GLS solutions, which are \(\boldsymbol{\hat\beta}_{GLS}=(\mathbf{X'V^{-1}X})^{-1}\mathbf{X'V^{-1}}\mathbf{y}\) and \(\widehat{V[\boldsymbol{\hat\beta}_{GLS}]}=\hat\sigma^2_{GLS}(\mathbf{X'V^{-1}X})^{-1}\), where \(\hat\sigma^2_{GLS}=\frac1{n-p}(\mathbf{y-X}\boldsymbol{\hat\beta}_{GLS})'\mathbf{V^{-1}}(\mathbf{y-X}\boldsymbol{\hat\beta}_{GLS})\).
It is can be verified that \(\boldsymbol{\hat\beta}_{GLS}\le\boldsymbol{\hat\beta}_{OLS}\) always holds and they are equal when \(\mathbf{V=I}\) or \(\rho=0\). It proves that \(\boldsymbol{\hat\beta}_{GLS}\) are the best linear unbiased estimators (BLUE).
This case can be extended to miltiple regression models and the autocorrelation of a stationary stochastic process at lag-k. Durbin-Watson test is used to test the null hypothesis of \(\rho=0\).