Chapter 7 Adequacy

The outcome of estimation and inference can not demonstrate model’s performance. If the primary assumptions is violated, the estimations could be biased and the model could be useless. These problems can also happen when the model is not correctly specified. It is necessary to make diagnosis and validation for fitted models.

7.1 Goodness of fit

This structure tell us how good the model can explain the data. Coefficient of Determination R2R2 is a proportion to assess the quality of fitted model.

R2=SSRSST=1SSESST

when R2 is close to 1, the most of variation in response can be explained by the fitted model. Although R2 is not the only criteria of a good model, it is often available in most published papers. Recall the discussion in Part I, the aggregated data will eliminate the difference among individuals, households, or neighborhoods. In the new variance structure, SSE will be much less than disaggregated model. The R2 in many disaggregate studies are around 0.3, while the R2 in some aggregate studies can reach 0.8. A seriously underfitting model’s outputs could be biased and unstable.

A fact is that adding predictors into the model will never decrease R2. Thus the models with different number of predictors is not comparable. Adjusted R2 can address this issue by introducing degree of freedom. The degree of freedom denotes the amount of information required to know.

dfT=dfR+dfEn1=(p1)+(np)

Then, the mean square (MS) of each sum of squares (SS) can be calculated by MS=SS/df. The mean square error MSE is also called as the expected value of error variance ˆσ2=MSE=SSE/(np). np is the degree of freedom. Then adjusted R2 is

R2adj=1MSEMST=1SSE/(np)SST/(n1)

Another similar method is R2 for prediction based on PRESS. Recall the PRESS statistic is the prediction error sum of square by fitting a model with n1 observations.

PRESS=ni=1(yiˆy(i))2=ni=1(ei1hii)2

A model with smaller PRESS has a better ability of prediction. The R2 for prediction is

R2pred=1PRESSMST

7.2 Residuals Analysis

The major assumptions, both IID and normality are related to residual. Residual diagnosis is an essential step for modeling validation.

There are several scaled residuals can help the diagnosis. Since MSE is the expected variance of error ˆσ2 and E[ε]=0, standardized residuals should follow a standard normal distribution.

di=eiMSE=einpni=1e2i,i=1,2,...,n

Recall random error e=yˆy=(IH)y and hat matrix H=X(XX)1X. Let hii denote the ith diagonal element of hat matrix. Studentized Residuals can be expressed by ri=eiMSE(1hii),i=1,2,...,n It is proved that 0hii1. An observation with hii closed to one will return a large value of ri. The xi who has strong influence on fitted value is called leverage point.

Ideally, the scaled residual have zero mean and unit variance. Hence, an observation with |di|>3 or |ri|>3 is a potential outlier.

Predicted Residual Error Sum of Squares (PRESS) can also be used to detect outliers. This method predicts the ith fitted response by excluding the ith observation and examine the influence of this point. The corresponding error e(i)=ei/(1hii) and V[e(i)]=σ2/(1hii). Thus, if MSE is a good estimate of σ2, PRESS residuals is equivalent to Studentized Residuals.

e(i)V[e(i)]=ei/(1hii)σ2/(1hii)=eiσ2(1hii)

  • Residual Plot

Residual plot shows the pattern of the residuals against fitted ˆy. If the assumptions are valid, the shape of points should like a envelope and be evenly distributed around the horizontal line of e=0.

A funnel shape in residual plot shows that the variance of error is a function of ˆy. A suitable transformation to response or predictor could stabilize the variance. A curved shape means the assumption of linearity is not valid. It implies that adding quadratic terms or higher-order terms might be suitable.

  • Normal Probability Plot

A histogram of residuals can check the normality assumption. The highly right-skewed probability distribution of VMT log-transform of VMT is reasonable.

A better way is a normal quantile – quantile (QQ) plot of the residuals. An ideal cumulative normal distribution should plot as a straight line. Only looking at the R2 and p-values cannot disclose this feature.

7.3 Heteroscedasticity

When the assumption of constant variance is violated, the linear model is heteroscedastic. Heteroscedasticity is common in urban studies. For example, the cities with different size are not identical. Small cities or rural areas might have homogeneous values of population density, while large cities’ densities are more variable.

Recall Generalized least square estimates (5.4) and (5.5), if the residuals are independent but variances are not constant, a simple linear model becomes εMVN(0,σ2V) where

V=[x21000x22000x2n],V1=[1x210001x220001x2n]

Then XV1X=n and the generalized least squares solution is

ˆβ1,WLS=1nni=1yixi

and

ˆσ2WLS=1n1ni=1(yiˆβ1xi)2x2i

In heteroscedastic model, the OLS estimates of coefficients are still unbiased but no longer efficient. But the estimates of variances are biased. The corresponding hypothesis test and confidence interval would be misleading.

Another special case is the model with aggregated variables, which is the cases of geographic unit. Let uj and vj are the response and predictors of jth household in a neighborhood. ni is the sample size in each neighborhood. Then yi=nij=1uj/ni and Xi=nij=1vj/ni. In this case,

V=[1n10001n20001nn],V1=[n1000n2000nn]

Then XV1X=ni=1nix2i and the WLS estimate of β1 is

ˆβ1,WLS=1nni=1nixiyini=1nix2i

and

V[ˆβ1,WLS]=V[ni=1nixiyi](ni=1nix2i)2=ni=1n2ix2iσ2/ni(ni=1nix2i)2=σ2ni=1nix2i

There are three procedures, Bartlett’s likelihood ratio test, Goldfeld-Quandt test, or Breusch-Pagan test which can be used to examine heteroscedasticity (Ravishanker and Dey 2020, 8.1.3, pp.288–290)

7.4 Autocorrelation

For spatio-temporal data, the observations often have some relationship over time or space. In these cases, the assumption of independent errors is violated, the linear model with serially correlated errors is called autocorrelation. Autocorrelation is also common in urban studies. All the neighboring geographic entities or stages could impact each other, or sharing the similar environment.

Take a special case of time-series data for example, assuming the model have constant variance. E[ε]=0 but Cov[εi,εj]=σ2ρ|ji|, i,j=1,2,...,n and |ρ|<1 The variance-covariance matrix is also called Toeplitz matrix as below

V=[1ρρ2ρn1ρ1ρρn2ρn1ρn2ρn31],{V1}ij={11ρ2if i=j=1,n1+ρ21ρ2if i=j=2,...,n1ρ1ρ2if |ji|=10otherwise

This is a linear regression with autoregressive order 1 (AR(1)). The estimates of ˆβ is the same with the GLS solutions, which are ˆβGLS=(XV1X)1XV1y and ^V[ˆβGLS]=ˆσ2GLS(XV1X)1, where ˆσ2GLS=1np(yXˆβGLS)V1(yXˆβGLS).

It is can be verified that ˆβGLSˆβOLS always holds and they are equal when V=I or ρ=0. It proves that ˆβGLS are the best linear unbiased estimators (BLUE).

This case can be extended to miltiple regression models and the autocorrelation of a stationary stochastic process at lag-k. Durbin-Watson test is used to test the null hypothesis of ρ=0.

References

Ravishanker, Nalini, and Dipak K. Dey. 2020. A First Course in Linear Model Theory. CRC Press. https://books.google.com?id=i2r0DwAAQBAJ.