CHAPTER 12 Multicollinearity

Recall:

  • the OLS estimator for the coefficients is \(\hat{\boldsymbol{\beta}}=(\textbf{X}'\textbf{X})^{-1}\textbf{X}'\textbf{Y}\)

  • the variance of the OLS estimator is \(Var(\hat{\boldsymbol{\beta}})=\sigma^2(\textbf{X}'\textbf{X})^{-1}\)

  • The design matrix \(\textbf{X}\) must be full rank or linearly independent for the \((\textbf{X}'\textbf{X})^{-1}\) to exist.

Definition 12.1 Multicollinearity occurs when two or more predictors in a regression model are moderately or highly correlated.


12.1 Types and Causes of Multicollinearity


Based on strength of linear dependence

1. Perfect Multicollinearity

This exists if we can find a set of values \(c_1, c_2,...,c_p\) not all zero such that \[ c_1 \textbf{x}_1+c_2 \textbf{x}_2 + \cdots+c_p \textbf{x}_p=\textbf{0} \]That is, an independent variable \(X_j\) is a perfect linear function of another regressor variable/s.

This occurs when:

  • using two indicators to measure the same variable, but different units

    e.g. Celsius and Farenheit

  • including a variable that can be computed (linearly) using other variables.

    For example, in modelling interest rates, domestic liquidity, money supply, deposit substitute, and quasi money are important factors. But domestic liquidity is sometimes computed as the sum of the other 3.

In practice, perfect multicollinearity is not common. If this case occurs, it is very easy to detect using basic reasoning and common sense.

Example of perfect multicollinearity
    icecream <- readr::read_csv("icecream.csv")
ice_cream_sales temp_celsius temp_fahrenheit advertising_budget
2838 22.2 71.96 510
2778 23.8 74.84 500
2780 32.8 91.04 500
3151 25.4 77.72 570
2721 25.6 78.08 490
3216 33.6 92.48 580
2427 27.3 81.14 420
2934 18.7 65.66 530
2697 21.6 70.88 510
2811 22.8 73.04 510
2962 31.1 87.98 520
2648 26.8 80.24 470
2753 27.0 80.60 480
2572 25.6 78.08 450
2502 22.2 71.96 450
2842 33.9 93.02 520
3066 27.5 81.50 520
2705 15.2 59.36 500
3098 28.5 83.30 550
3302 22.6 72.68 600
2675 19.7 67.46 480
2253 23.9 75.02 380
2990 19.9 67.82 550
2606 21.4 70.52 460
2550 21.9 71.42 470
3015 16.6 61.88 550
2748 29.2 84.56 490
2624 25.8 78.44 440
2706 19.3 66.74 510
2805 31.3 88.34 490
2864 27.1 80.78 500
2907 23.5 74.30 520
2723 29.5 85.10 480
2955 29.4 84.92 530
2793 29.1 84.38 490
2916 28.4 83.12 520
3013 27.8 82.04 550
2908 24.7 76.46 520
2713 23.5 74.30 480
3077 23.1 73.58 560
2961 21.5 70.70 550
2951 24.0 75.20 530
2866 18.7 65.66 510
2774 35.8 96.44 470
3143 31.0 87.80 570
2572 19.4 66.92 470
3344 23.0 73.40 610
3128 22.7 72.86 580
2672 28.9 84.02 490
2584 24.6 76.28 450

In this example, the temp_celsius is perfectly linearly related with temp_fahrenheit, because they are just direct linear transformation of each other.

    plot(icecream$temp_celsius, icecream$temp_fahrenheit)

Now, we try to create a linear model where both temp_celsius and temp_fahrenheit are included.

    icecream_model <- lm(ice_cream_sales ~ temp_celsius + temp_fahrenheit + advertising_budget, 
    data = icecream)
    summary(icecream_model)
## 
## Call:
## lm(formula = ice_cream_sales ~ temp_celsius + temp_fahrenheit + 
##     advertising_budget, data = icecream)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -123.821  -27.459    8.836   30.855  159.393 
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        266.0505    95.6013   2.783 0.007734 ** 
## temp_celsius         6.5174     1.6536   3.941 0.000268 ***
## temp_fahrenheit          NA         NA      NA       NA    
## advertising_budget   4.7333     0.1668  28.372  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.45 on 47 degrees of freedom
## Multiple R-squared:  0.9455, Adjusted R-squared:  0.9432 
## F-statistic: 407.7 on 2 and 47 DF,  p-value: < 2.2e-16

What is the coefficient estimate for temp_fahrenheit?

2. Near or imperfect Multicollinearity

Even if the inverse of \(\textbf{X}'\textbf{X}\) exists, near dependencies can still be found in the data matrix. Cases like this happen if it is possible to find one or more non-zero values \(c_1,c_2,\cdots,c_p\) such that

\[ c_1 \textbf{x}_1+c_2 \textbf{x}_2 + \cdots+c_p \textbf{x}_p \approx \textbf{0} \quad \textit{(near 0)} \]

This case is more common than perfect dependencies but sometimes harder to detect and difficult to remedy, which makes it more troublesome.

Example of imperfect multicollinearity
  • bp (blood pressure in mm Hg)
  • age (in years)
  • weight (in kg)
  • bsa (body surface area in sqm)
  • dur (duration of hypertension in years)
  • pulse (basal pulse in beats per minute)
  • stress (stress index)
bloodpressure <- readr::read_csv("bloodpressure.csv")
patient bp age weight bsa dur pulse stress
1 105 47 85.4 1.75 5.1 63 33
2 115 49 94.2 2.10 3.8 70 14
3 116 49 95.3 1.98 8.2 72 10
4 117 50 94.7 2.01 5.8 73 99
5 112 51 89.4 1.89 7.0 72 95
6 121 48 99.5 2.25 9.3 71 10
7 121 49 99.8 2.25 2.5 69 42
8 110 47 90.9 1.90 6.2 66 8
9 110 49 89.2 1.83 7.1 69 62
10 114 48 92.7 2.07 5.6 64 35
11 114 47 94.4 2.07 5.3 74 90
12 115 49 94.1 1.98 5.6 71 21
13 114 50 91.6 2.05 10.2 68 47
14 106 45 87.1 1.92 5.6 67 80
15 125 52 101.3 2.19 10.0 76 98
16 114 46 94.5 1.98 7.4 69 95
17 106 46 87.0 1.87 3.6 62 18
18 113 46 94.5 1.90 4.3 70 12
19 110 48 90.5 1.88 9.0 71 99
20 122 56 95.7 2.09 7.0 75 99
cor_matrix <- cor(bloodpressure[,2:8])
corrplot::corrplot(cor_matrix, method = "number")

From this correlation matrix, we can see that weight is highly correlated with bsa. Also, pulse has moderate positive correlation with several variables. To visualize, the following are sample scatter plots.

Let’s create a model with all available variables will be included in the linear model that predicts bp (blood pressure).

    bp_model <- lm(bp ~ age+weight+bsa+dur+pulse+stress, data = bloodpressure)
    summary(bp_model)
## 
## Call:
## lm(formula = bp ~ age + weight + bsa + dur + pulse + stress, 
##     data = bloodpressure)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93213 -0.11314  0.03064  0.21834  0.48454 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.870476   2.556650  -5.034 0.000229 ***
## age           0.703259   0.049606  14.177 2.76e-09 ***
## weight        0.969920   0.063108  15.369 1.02e-09 ***
## bsa           3.776491   1.580151   2.390 0.032694 *  
## dur           0.068383   0.048441   1.412 0.181534    
## pulse        -0.084485   0.051609  -1.637 0.125594    
## stress        0.005572   0.003412   1.633 0.126491    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4072 on 13 degrees of freedom
## Multiple R-squared:  0.9962, Adjusted R-squared:  0.9944 
## F-statistic: 560.6 on 6 and 13 DF,  p-value: 6.395e-15

In this multiple linear regression model, what is the coefficient of pulse?

Now, let’s try a simple linear regression model containing pulse as the only predictor.

lm(bp ~ pulse, data = bloodpressure) |> summary()
## 
## Call:
## lm(formula = bp ~ pulse, data = bloodpressure)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4418 -2.4978 -0.3672  1.8455  7.6179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   42.323     16.240   2.606 0.017871 *  
## pulse          1.030      0.233   4.420 0.000331 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 18 degrees of freedom
## Multiple R-squared:  0.5204, Adjusted R-squared:  0.4938 
## F-statistic: 19.53 on 1 and 18 DF,  p-value: 0.0003307

The sign of the coefficient of pulse is positive, and it becomes significant.


Based on cause of multicollinearity

  1. Structural Multicollinearity

    is a mathematical artifact caused by creating new predictors from other predictors — such as creating the predictor \(x^2\) from the predictor \(x\).

  2. Data-based Multicollinearity

    a result of a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which the data are collected.


Remarks: Multicollinearity is a non-statistical problem

  • The primary source of the problem is how the model is specified and inclusion of several regressors that are linearly related with one another.
  • Multicollinearity has to do with specific characteristics of the data matrix and not the statistical aspects of the linear regression model.
  • This means that multicollinearity is a data problem, NOT a statistical problem.
  • It is a non-statistical problem that is of great importance to the efficiency of least squares estimation.

Multicollinear data frequently arise and cause problems in many applications of linear regression, such as in econometrics, oceanography, geophysics, and other fields that rely on non-experimental data.


12.2 Implications of Multicollinearity

Why should multicollinearity be avoided?


Perfect Multicollinearity: Mathematical Standpoint

Recall the definition of linear dependence

If \(\textbf{X}=[\textbf{x}_1, \textbf{x}_2, …,\textbf{x}_p]\) , \(\textbf{x}\)s are vectors of dimension \(n\times 1\), \(\textbf{x}\)s are linearly dependent if there exists constants \(c_1, c_2,…,c_p\), not all zero such that \(\sum_{j=1}^pc_i\textbf{x}_j = \textbf{0}\).

Otherwise, if the constants than only can satisfy \(\sum_{j=1}^kc_j\textbf{x}_j = \textbf{0}\) is \(c_1=c_2=\cdots=c_p=0\), then \(\textbf{x}s\) are linearly independent.

The rank of the matrix \(\textbf{X}\) is defined to be the number of linearly independent rows (or columns) of \(\textbf{X}\)

Recall the OLS estimator of \(\boldsymbol{\beta}\) is derived as \(\hat{\boldsymbol{\beta}}=(\textbf{X}'\textbf{X})^{-1}\textbf{X}\textbf{Y}\)

  • For the case of perfect multicollinearity, the \(\textbf{x}_j\)s are not independent \(\rightarrow\) \(\textbf{X}\) is not full column rank \(\rightarrow\) \((\textbf{X}'\textbf{X})^{-1}\) does not exist
  • If the \(\textbf{x}_j\)s are linearly dependent, \(rk(\textbf{X}'\textbf{X})\) is barely \(p\) and \((\textbf{X}'\textbf{X})^{-1}\) becomes unstable. This problem is oftentimes called iII-conditioned problem

Exercise 12.1 What is Ill-Conditioned Problem?


Imperfect Multicollinearity: Logical Standpoint

Imperfect Multicollinearity does not violate any assumptions since \(\hat{\boldsymbol{\beta}}\) can still be computed. So why do we still need to care about multicollinearity?

Suppose you have two variables \(X_1\) and \(X_2\) that you want to use a predictor of \(Y\). What if these two independent variables are linearly related to each other? For example, \(X_1=\alpha+\alpha_1X_2+\epsilon\)

  • Redundancy in Information: Highly correlated predictor variables provide redundant information, which does not add unique insights to the model.

  • Interpretation Difficulty: It becomes challenging to determine the individual effect of each predictor on the outcome, making the model’s results harder to interpret.

By avoiding multicollinearity, we ensure that each predictor variable contributes unique and valuable information to the model.


Major Implications: If multicollinearity exists, then it is likely that the model will have

  • Wrong signs for regression coefficients
  • Larger variance and standard errors of the OLS estimators.
  • Wider confidence and prediction intervals
  • Insignificant t-test stat values for significance
  • A high \(R^2\) but few significant t-test stat values.
  • Difficulty in assessing the individual contributions of explanatory variables to the explained sum of squares or \(R^2\)

12.3 Detection and Analysis of Multicollinearity Structure

There are many ways in detecting multicollinearity.

Some early indicators are:

  • signs of the coefficients are reversed compared to the expected effect of the predictors

  • the independent variables are pairwise correlated. However, take note that using correlations is pretty limited.

    • The problem of multicollinearity exists when the joint association of the independent variables affects the model process.

    • However, absence of pairwise correlation will not necessarily indicate absence of multicollinearity.

    • Joint correlation of the independent variables will not be a problem if it is weak to affect modeling.

  • High \(R^2\) but low t-Statistic values.

Other methods are discussed in the following sections.


Variance Inflation Factors (VIF)

Recall: \(Var(\hat{\boldsymbol{\beta}})=\sigma^2(\textbf{X}'\textbf{X})^{-1}\), hence \(Var(\hat{\beta_j})\) is the \((j+1)^{th}\) diagonal element of \(\sigma^2(\textbf{X}'\textbf{X})^{-1}\), \(j=0,1,...,k\)

Q: What does it mean when the variance of an estimator is large?

Definition 12.2 The variance inflation factor estimates how much the variance of a regression coefficient estimate is inflated due to multicollinearity. It looks at the extent to which an explanatory variable can be explained by all the other explanatory variables in the equation.

  • \(\text{VIF}\geq1\)

  • A high \(\text{VIF}_j\) for an independent variable \(X_j\) indicates a strong collinearity with other variables, suggesting the need for adjustments in the model’s structure and the selection of independent variables.

  • \(\text{VIF}_j>>10\) indicates severe variance inflation for the parameter estimator associated with \(X_j\).

For our class, we will use 5 as the limit of the VIF

Remark 1: VIF as function of the correlation matrix of the independent variables

  • Recall Theorem 4.18

  • Under the standardized version of \(\textbf{X}\), if \(\textbf{X}^*\) is centered and \(\textbf{R}_{xx}\) is the correlation matrix of the independent variables

    \[ {\textbf{X}^*}^T{\textbf{X}^*}=(n-1)\textbf{R}_{xx}\\ \rightarrow \frac{1}{(n-1)} {\textbf{X}^*}^T{\textbf{X}^*}=\textbf{R}_{xx} \]

  • The \(\text{VIF}_j\) of the \(j^{th}\) variable (or the \(j^{th}\) coefficient estimate \(\hat{\beta}_j\)) is the \(j^{th}\) diagonal element of \((\textbf{R}_{xx})^{-1}\)

    Illustration

    solve(cor(bloodpressure[,-c(1,2)]))
    ##                age      weight         bsa         dur      pulse      stress
    ## age     1.76280672  0.70881848 -0.74146035 -0.19153608 -1.1197911 -0.03312947
    ## weight  0.70881848  8.41703503 -5.72973187 -0.03656198 -4.1565746  1.67143351
    ## bsa    -0.74146035 -5.72973187  5.32875147  0.09487404  2.0822594 -0.71226463
    ## dur    -0.19153608 -0.03656198  0.09487404  1.23730942 -0.3207191 -0.15317745
    ## pulse  -1.11979107 -4.15657462  2.08225936 -0.32071910  4.4135752 -1.61796715
    ## stress -0.03312947  1.67143351 -0.71226463 -0.15317745 -1.6179671  1.83484532
M <- model.matrix(bp_model)
# compute sd of each column
apply(M, 2, sd)
## (Intercept)         age      weight         bsa         dur       pulse 
##   0.0000000   2.5005263   4.2949052   0.1364821   2.1452763   3.8030459 
##      stress 
##  37.0863502

 

Remark 2: VIF as function of of \(R^2_j\)

  • Let \(R^2_j\) be the coefficient of determination obtained when \(X_j\) is regressed on the other \(k-1\) independent variables.

  • \(\text{VIF}_j=\frac{1}{1-R^2_j}\)

  • Thus, as \(R^2_j\rightarrow 0\) then \(\text{VIF}_j \rightarrow 1\)

  • Conversely, as \(R^2_j\rightarrow 1\) then \(\text{VIF}_j \rightarrow \infty\)

 

Remark 3: How does VIF “inflate” the variance?

  • The variance of the estimator \(\hat{\beta}_j\) can be expressed as \[ Var(\hat{\beta}_j)=\frac{\sigma^2}{(n-1)\widehat{Var}(X_j)}\text{VIF}_j \] which separates the influences of several distinct factors on the variance of the coefficient estimate:

    • \(\sigma^2\) : greater error variance leads to proportionately more variance in \(\hat{\beta}_j\)
    • \(n\) : greater sample size yields to low variance of \(\hat{\beta}_j\)
    • \(\widehat{Var}(X_j)\) : greater variability of the particular covariate \(X_j\) will result to lower variance of \(\hat{\beta}_j\)
  • This gives us an intuitive interpretation for the \(\text{VIF}\) as the factor that inflates the variance of \(\hat{\beta}\) that is not due to error, not due to sample size \(n\), not due to the variation of \(X_j\).

  • This explains that \(\text{VIF}\) is the inflation of variance of the the estimated coefficient due to effects of other variables or due to multicollinearity.

 

Definition 12.3 Some software use \(\frac{1}{VIF}\), called the tolerance value, instead of the VIF. The tolerance limits frequently used are 0.01, 0.001, 0.0001.

Limitation: \(\text{VIF}\) only measures the combined effect of the dependences among the regressors on the variance of the \(j^{th}\) slope. Therefore, \(\text{VIF}\)s cannot distinguish between several simultaneous multicollinearities, nor the structure of dependencies.

Example of VIF in R:

You can use the vif() function of the car package to obtain the VIF of the estimated coefficients of an lm object.

bp_model <- lm(bp ~ age + weight + bsa + dur + pulse + stress, data = bloodpressure)
car::vif(bp_model)
##      age   weight      bsa      dur    pulse   stress 
## 1.762807 8.417035 5.328751 1.237309 4.413575 1.834845

In this output, the \(\text{VIF}(weight)=8.42>5\) which may imply that weight is highly collinear with the other variables, hence can be explained by the rest of the variables.


Condition Number

Recall of Eigenvalues

A square \(n\times n\) matrix \(\textbf{A}\) has an eigenvalue \(\lambda\) if \(\textbf{A}\textbf{x}=\lambda\textbf{x}\), \(\textbf{x}\neq \textbf{0}\)

  • Any \(\textbf{x}\neq \textbf{0}\) that satisfies the above equation is an eigenvector corresponding to \(\lambda\).

  • The eigenvalues are the \(n\) (real) solutions for \(\lambda\) in the equation \(|\textbf{A}-\lambda\textbf{I}|=0\)

  • \(\textbf{A}\) is singular if and only if 0 is an eigenvalue of \(\textbf{A}\).

  • If \(\textbf{A}\) has eigenvalues \(\lambda_1, \lambda_2,..., \lambda_n\), then \[ tr(\textbf{A}) = \sum_{i=1}^n\lambda_i\quad \text{and} \quad|\textbf{A}|=\prod_{i=1}^n\lambda_i \]

Definition 12.4 (Condition Number) Let \(X^*\) be the matrix of scaled columns of the original design matrix \(\textbf{X}\).

If the eigenvalues of \(\frac{1}{n-1}\textbf{X}^{*'}\textbf{X}^*\) is given by \(\lambda_1, \lambda_2,...,\lambda_k\) then condition number of \(\textbf{X}\) is given by

\[ \kappa(\textbf{X})=\sqrt{\frac{\lambda_\max}{\lambda_\min}} \] where

  • \(\lambda_{max}\) and \(\lambda_{min}\) are the maximum and minimum eigenvalues of \(\frac{1}{n-1}\textbf{X}^{*'}\textbf{X}^*\)

The closer the \(\lambda_{min}\) to 0, the closer \(\textbf{X}'\textbf{X}\) to being singular.

Note that if there is ill-conditioning, some eigenvalues of \(\textbf{X}'\textbf{X}\) are near zero, hence a possible high value of \(\kappa\).

  • Low Condition Number \(\kappa\) (close to 1): predictors are nearly orthogonal, meaning they are not collinear.

  • High Condition Number \(\kappa\): High degree of multicollinearity

Range of \(\kappa\) State of Multicollinearity
\(\kappa<100\) No serious problem
\(100 \leq \kappa \leq 1000\) Moderate to strong
\(\kappa >1000\) Severe

In this class, we will use 100 as our limit for condition number \(\kappa\).

M <- model.matrix(bp_model)
apply(M, 2, sum)
## (Intercept)         age      weight         bsa         dur       pulse 
##       20.00      972.00     1861.80       39.96      128.60     1392.00 
##      stress 
##     1067.00
library(multiColl)
CN(model.matrix(bp_model))
## [1] 201.4958

Condition Indices

Definition 12.5 Condition indices are the ratio of the square root of maximum eigenvalue to the square root of each of the other eigenvalues, that is, \[ CI_j=\eta_j = \frac{\sqrt{\lambda_{max}}}{\sqrt{\lambda_j}} \]

  • This gives a clarification as to whether one or several dependencies are present among the X ’s.

  • This can help in formulating a possible simultaneous system of equations

  • The lowest condition index is 1.

  • The condition number is the highest condition index.

Indices greater than 30 could indicate presence of dependencies. Interpretations are more intuitive if combined with variance proportions.


Variance Proportions

By the Spectral Decomposition, we can decompose the matrix form of the variance of the estimated coefficients as \[ Var(\hat{\boldsymbol{\beta}})=\sigma^2(\textbf{X}'\textbf{X})^{-1}= \sigma^2\textbf{T}\boldsymbol{\Lambda}^{-1}\textbf{T}' \] where

  • \(\textbf{T}\) is a \(p \times p\) matrix composed of columns of eigenvectors \(\textbf{t}_k\) of \(\textbf{X}'\textbf{X}\) corresponding to \(k^{th}\) eigenvalue. \[ \begin{align} \textbf{T}&=\begin{bmatrix}\textbf{t}_1 & \textbf{t}_2 & \cdots & \textbf{t}_p\end{bmatrix}\\ &=\begin{bmatrix}t_{11} & t_{12} & \cdots & t_{1p} \\ t_{21} & t_{22} & \cdots & t_{2p} \\ \vdots & \vdots &\ddots&\vdots \\ t_{p1} & t_{p2} & \cdots & t_{pp}\end{bmatrix}\\ \end{align} \]
  • \(\boldsymbol{\Lambda}\) is a diagonal matrix composed of eigenvalues of \(\textbf{X}'\textbf{X}\). \[ \boldsymbol{\Lambda}=\text{diag}(\lambda_1,\lambda_2,\cdots,\lambda_p) \]

Note that from here, \[ Var(\hat{\beta_j})=\sigma^2 \sum_{k=1}^p \frac{t^2_{jk}}{\lambda_k} \]

Definition 12.6 Variance Proportion of the \(j^{th}\) regressor with the \(i^{th}\) eigenvector is given by: \[ \pi_{j,i}=\frac{t_{ji}^2/\lambda_i}{\sum_{k=1}^p {t^2_{jk}}/{\lambda_k}} \]

Rationale

  • recall from the definition of multicollinearity: There exists \(c_1,c_2,...,c_p\) (not all zero) such that \[ c_1\textbf{x}_1+c_2\textbf{x}_2+\cdots+c_p\textbf{x}_p=\textbf{0} \quad (\text{or} \approx\textbf{0}) \]

  • If \(\lambda_j\) is close to 0, and \(\textbf{t}_j=\begin{bmatrix}t_{j1} & t_{j2} & \cdots & t_{jk}\end{bmatrix}'\) is the eigenvector associated with \(\lambda_j\), then \(\sum_{i=1}^k t_{ji}X_i\approx0\) gives the structure of dependency of the Xs that leads to the problem of multicollinearity.

  • a high proportion of any variance can be associated with a large singular value even when there is no perfect collinearity.

Rule:

  • The standard approach is to check a high condition index (usually >30) associated with a large proportion of the variance of two or more coefficients when diagnosing collinearity.

  • If two or more elements in the \(j^{th}\) condition index are relatively large and its associated condition index \(\eta_j\) is large too, it signals that near dependencies are influencing regression estimates.

  • That is, a multicollinearity problem occurs when a component associated with a high condition index contributes strongly (variance proportion greater than about 0.5) to the variance of two or more variables.

To summarize, collinearity is spotted by finding 2 or more variables that have large proportions of variance (0.50 or more) that correspond to large condition indices (>30).


To detect multicollinearity, we may ask the following questions:

  • Which regressors have coefficients with very inflated variance due to multicollinearity?

    use the variance inflation factors

  • Is there an overall problem of multicollinearity or dependencies between the regressors?

    use the condition number

  • How many dependency structures are there?

    use the condition indices

  • Which are the variables involved in these dependencies?

    use the variance proportions


Illustration of Multicollinearity Diagnostics in R

The ols_coll_diag() function from the olsrr package shows the different multicollinearity diagnostics in a single output: VIF, Condition Index, and Variance Proportions.

library(olsrr)
multi_diag <- ols_coll_diag(bp_model)
multi_diag
Variables Tolerance VIF
age 0.5672772 1.762807
weight 0.1188067 8.417035
bsa 0.1876612 5.328752
dur 0.8082053 1.237309
pulse 0.2265737 4.413575
stress 0.5450051 1.834845
Eigenvalue Condition Index intercept age weight bsa dur pulse stress
1 6.6558 1.0000 0.0000 0.0000 0.0000 0.0000 0.0016 0.0000 0.0030
2 0.2679 4.9842 0.0002 0.0001 0.0000 0.0001 0.0000 0.0000 0.5512
3 0.0714 9.6536 0.0005 0.0004 0.0001 0.0003 0.9285 0.0002 0.0524
4 0.0027 50.0668 0.1027 0.0890 0.0052 0.1698 0.0007 0.0040 0.0254
5 0.0011 77.3965 0.3723 0.7775 0.0092 0.0113 0.0259 0.0050 0.0391
6 0.0009 83.7514 0.2900 0.0381 0.0061 0.0736 0.0432 0.4376 0.1433
7 0.0002 201.4958 0.2343 0.0949 0.9794 0.7450 0.0000 0.5532 0.1856

Interpretation:

  1. The variances of the estimated coefficients of weight and bsa are inflated due to multicollinearity (\(\text{VIF}>5)\). This tells us that the variance of the estimated coefficient of weight is inflated by a factor of 8.42 because weight is highly correlated with at least one of the other predictors in the model.

  2. The condition number implies that there is an overall problem of multicollinearity.

  3. We check the variance proportions corresponding to condition indices >30.

    • Checking the 7th condition index, weight, bsa, and pulse, all having variance proportions > 0.50.

    • The eigensystem indicates that these variables are multicollinear to each other, and the structure of the (near) dependency can be described by \[t_{71}\textbf{intercept}+t_{72}\textbf{age}+t_{73}\textbf{weight}+t_{74}\textbf{bsa}+t_{75}\textbf{dur}+t_{76}\textbf{pulse}+t_{77}\textbf{stress}\approx\textbf{0}\]

    where \(\textbf{t}_j=\begin{bmatrix}t_{j1} & t_{j2} & \cdots & t_{jp}\end{bmatrix}'\) is the eigenvector corresponding to \(j^{th}\) eigenvalue.


12.4 Remedial Measures

No single solution can eliminate multicollinearity completely. Certain approaches may be useful.


Deletion of unimportant or redundant variables

  • If a variable is redundant, it should have never been included in the model in the first place.
  • You may also use variable selection procedures
Example

Recall blood pressure dataset.

  • bp (blood pressure in mm Hg)
  • age (in years)
  • weight (in kg)
  • bsa (body surface area in sqm)
  • dur (duration of hypertension in years)
  • pulse (basal pulse in beats per minute)
  • stress (stress index)
bloodpressure <- readr::read_csv("bloodpressure.csv")
patient bp age weight bsa dur pulse stress
1 105 47 85.4 1.75 5.1 63 33
2 115 49 94.2 2.10 3.8 70 14
3 116 49 95.3 1.98 8.2 72 10
4 117 50 94.7 2.01 5.8 73 99
5 112 51 89.4 1.89 7.0 72 95
6 121 48 99.5 2.25 9.3 71 10
7 121 49 99.8 2.25 2.5 69 42
8 110 47 90.9 1.90 6.2 66 8
9 110 49 89.2 1.83 7.1 69 62
10 114 48 92.7 2.07 5.6 64 35
11 114 47 94.4 2.07 5.3 74 90
12 115 49 94.1 1.98 5.6 71 21
13 114 50 91.6 2.05 10.2 68 47
14 106 45 87.1 1.92 5.6 67 80
15 125 52 101.3 2.19 10.0 76 98
16 114 46 94.5 1.98 7.4 69 95
17 106 46 87.0 1.87 3.6 62 18
18 113 46 94.5 1.90 4.3 70 12
19 110 48 90.5 1.88 9.0 71 99
20 122 56 95.7 2.09 7.0 75 99
cor_matrix <- cor(bloodpressure[,2:8])
corrplot::corrplot(cor_matrix, method = "shade", addCoef.col = 'grey')

We first check the variance inflation factors for detecting multicollinearity:

full_bp <- lm(bp ~ age + weight + bsa + dur + pulse + stress,
    data = bloodpressure)
car::vif(full_bp)
##      age   weight      bsa      dur    pulse   stress 
## 1.762807 8.417035 5.328751 1.237309 4.413575 1.834845

We can either manually select unimportant variables based on literature, or use an automatic variable selection procedure. For this example, we use backward elimination based on sequential F-tests \(\alpha=0.05\).

backward_bp <- olsrr::ols_step_backward_p(full_bp, p_val = 0.05, 
                                      progress = TRUE, details=TRUE)
## Backward Elimination Method 
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. age 
## 2. weight 
## 3. bsa 
## 4. dur 
## 5. pulse 
## 6. stress 
## 
## 
## Step   => 0 
## Model  => bp ~ age + weight + bsa + dur + pulse + stress 
## R2     => 0.996 
## 
## Initiating stepwise selection... 
## 
## Step     => 1 
## Removed  => dur 
## Model    => bp ~ age + weight + bsa + pulse + stress 
## R2       => 0.99556 
## 
## Step     => 2 
## Removed  => pulse 
## Model    => bp ~ age + weight + bsa + stress 
## R2       => 0.99493 
## 
## Step     => 3 
## Removed  => stress 
## Model    => bp ~ age + weight + bsa 
## R2       => 0.99454 
## 
## 
## No more variables to be removed.
## 
## Variables Removed: 
## 
## => dur 
## => pulse 
## => stress

After a backward elimination algorithm the independent variables selected for predicting blood pressure are age, weight, and body surface area.

summary(backward_bp$model)
## 
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")), 
##     data = l)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75810 -0.24872  0.01925  0.29509  0.63030 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.66725    2.64664  -5.164 9.42e-05 ***
## age           0.70162    0.04396  15.961 3.00e-11 ***
## weight        0.90582    0.04899  18.490 3.20e-12 ***
## bsa           4.62739    1.52107   3.042  0.00776 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.437 on 16 degrees of freedom
## Multiple R-squared:  0.9945, Adjusted R-squared:  0.9935 
## F-statistic: 971.9 on 3 and 16 DF,  p-value: < 2.2e-16

Now, checking for multicollinearity:

car::vif(backward_bp$model)
##      age   weight      bsa 
## 1.201901 4.403645 4.286943

This model may help individuals estimate their own blood pressures using only their age, weight, and body surface area.


Centering of observations

  • Centering a predictor is subtracting the mean of the predictor values in the data set from each predictor value. \(X_i \rightarrow (X_i-\bar{X})\)

  • This is especially effective if complex functions of X’s (e.g. \(f(X)=X^2\)) are present in the design matrix. After centering, the non-linear relationship of \(X\) and \(f(X)\) will be more apparent, and the collinearity issue will be solved.

  • An example of including both \(X\) and a complex function of \(X\) in the model:

    • suppose \(Y\) and \(X\) have a quadratic relationship, and can be characterized using a polynomial model:

    \[ y_i = \beta_0 +\beta_1X_{i} + \beta_2X_{i}^2+\varepsilon_i \]

    • We can use OLS to estimate \(\beta_1\) and \(\beta_2\), but for some range of values of \(X\), \(X\) and \(X^2\) may look like linearly related (near linearity). This is an example of Structural Multicollinearity.

    • as a solution, center the Xs, and fit the following line instead:

    \[ y_i = \beta_0^* +\beta_1^*(X_{i}-\bar{X}) + \beta_2^*(X_{i}-\bar{X})^2+\varepsilon_i \]

    • Multicollinearity will be solved in such a way that the nonlinearity of \(X\) and \(X^2\) will be more defined and apparent in the datapoints.
Example

According to researchers, amount of immunoglobin in the blood (igg) and the maximal oxygen uptake (oxygen) of an individual can be modeled using a quadratic (polynomial) equation.

\[ igg_i=\beta_0+\beta_1oxygen + \beta_{2} (oxygen)^2 + \varepsilon_i \]

Notice that both oxygen and some complex function of oxygen are both regressors in the model.

immunity <- readr::read_csv("immunity.csv")
igg oxygen
881 34.6
1290 45.0
2147 62.3
1909 58.9
1282 42.5
1530 44.3
2067 67.9
1982 58.5
1019 35.6
1651 49.6
752 33.0
1687 52.0
1782 61.4
1529 50.2
969 34.1
1660 52.5
2121 69.9
1382 38.8
1714 50.6
1959 69.4
1158 37.4
965 35.1
1456 43.0
1273 44.1
1418 49.8
1743 54.4
1997 68.5
2177 69.5
1965 63.0
1264 43.2

The goal is to estimate the parameters \(\beta_0\), \(\beta_1\), and \(\beta_{2}\)

Fitting the line \(igg_i=\beta_0+\beta_1oxygen + \beta_{2} (oxygen)^2 + \varepsilon_i\) on the data, using OLS, we have:

immunity$oxygen_sq <- immunity$oxygen^2
immunity_model <- lm(igg ~ oxygen + oxygen_sq, data = immunity)
immunity_model
## 
## Call:
## lm(formula = igg ~ oxygen + oxygen_sq, data = immunity)
## 
## Coefficients:
## (Intercept)       oxygen    oxygen_sq  
##  -1464.4042      88.3071      -0.5362

Plotting the fitted line \(\widehat{igg}_i=\hat{\beta_1} + \hat{\beta_1}oxygen_i + \hat{\beta}_{2}oxygen^2_i\), we have:

The line is good in such a way that it passes through the center of the points. However, near multicollinearity exists between \(oxygen\) and \(oxygen^2\), even if the relationship is nonlinear. To visualize:

Variance of \(\hat{\beta}_1\) and \(\hat{\beta}_{2}\) are both inflated, indicated by the \(\text{VIF}\).

car::vif(immunity_model)
##    oxygen oxygen_sq 
##  99.94261  99.94261

As a solution, we center the observed values of \(oxygen_i\rightarrow (oxygen_i-\text{mean}(oxygen)\)

Centering \(X\), we can formulate the model as:

\[ y_i = \beta_0 +\beta_1^*(X_{i}-\bar{X}) + \beta_{2}^*(X_{i}-\bar{X})^2+\varepsilon_i \]

immunity <- immunity %>% 
    mutate(
        cent_oxy     = oxygen - mean(oxygen),
        cent_oxy_sq  = (oxygen - mean(oxygen))^2
    )

The centered variables now have more defined nonlinear relationship.

We expect that multicollinearity between the two variables will be solved.

imm_model_cntr <- lm(igg ~ cent_oxy + cent_oxy_sq, data = immunity)
imm_model_cntr
## 
## Call:
## lm(formula = igg ~ cent_oxy + cent_oxy_sq, data = immunity)
## 
## Coefficients:
## (Intercept)     cent_oxy  cent_oxy_sq  
##   1632.1962      33.9995      -0.5362

The fitted line can be expressed as \[ \hat{y}_i = 1632.1962 + 33.9995(X_i - mean(X)) - 0.5362 (X_i - mean(X))^2 \]

Plotting the fitted line on the data:

# original equation
fitted_immunity <- function(x){ 
# defining estimated parameters
beta_0  = coef(immunity_model)["(Intercept)"] 
beta_1  = coef(immunity_model)["oxygen"]
beta_11 = coef(immunity_model)["oxygen_sq"]
# expressing the equation
y = beta_0 + beta_1*x +beta_11*x^2
return(y)
}
# new equation after centering 
fitted_cnter <- function(x){
imm_model_cntr$coefficients[1] + 
imm_model_cntr$coefficients[2]*(x - mean(x)) + imm_model_cntr$coefficients[3]* (x-mean(x))^2
}

ggplot(immunity, aes(x = oxygen, y = igg))+
    geom_point()+
    geom_function(fun = fitted_immunity,
                aes(color = "red"),
                lwd = 1,
                linetype = 1) +
    geom_function(fun = fitted_cnter,
                aes(color = "green"),
                lwd = 1,
                linetype = 1)+
    scale_color_manual(values = c("red", "green"),
    labels = c(expression(hat(y)==-1464.40 + 88.30*x -0.54*x^2),
                expression(hat(y)== 1632.19 + 33.99*(x-bar(x)) -0.54*(x-bar(x))^2)))+
    theme_bw()+
    theme(legend.title=element_blank(),
      legend.position=c(.3,.9)
      )

The VIF of the parameters are now lower.

car::vif(imm_model_cntr)
##    cent_oxy cent_oxy_sq 
##    1.050628    1.050628

Shrinkage Estimation (Ridge Regression)

Instead of using the OLS estimator, we use another estimator that has less inflated variance, but compromises the unbiasedness.

In Ridge regression, the Normal Equations is given by \(\left(\textbf{X}'\textbf{X}+k \textbf{I}\right) \hat{\boldsymbol{\beta}}_R = \textbf{X}'\textbf{Y}\), where \(k\geq 0\) is called the biasing parameter.

  • The ridge estimator is \(\hat{\boldsymbol{\beta}}_R = \left(\textbf{X}'\textbf{X}+k \textbf{I}\right)^{-1} \textbf{X}'\textbf{Y}\)
  • \(k\) is usually chosen to be a number between 0 and 1.
  • If \(k=0\) then the ridge estimator reduces to OLS estimator.
  • If \(k\) is very large, the coefficients will become 0.

Properties of Ridge Estimator

  1. The ridge estimator is a linear transformation of the OLS estimator. Hence, it is also a linear estimator of \(\boldsymbol{\beta}\).

  2. Expected value: \(E(\hat{\boldsymbol{\beta}}_R)=(\textbf{X}'\textbf{X}+k \textbf{I})^{-1} \textbf{X}'\textbf{X}\boldsymbol{\beta}\). This implies that \(\hat{\boldsymbol{\beta}}_R\) is a biased estimator.

  3. Bias:

    • \(Bias(\hat{\boldsymbol{\beta}}_R) = E(\hat{\boldsymbol{\beta}}_R)-\boldsymbol{\beta}=((\textbf{X}'\textbf{X}+k \textbf{I})^{-1} \textbf{X}'\textbf{X}-\textbf{I})\boldsymbol{\beta}\)
    • \(tr\left(Bias(\hat{\boldsymbol{\beta}}_R)\right)=k^2\boldsymbol{\beta}'(\textbf{X}'\textbf{X}+k \textbf{I})^{-1}\boldsymbol{\beta}\)
  4. Variance:

    • \(Var(\hat{\boldsymbol{\beta}}_R)=\sigma^2(\textbf{X}'\textbf{X}+k \textbf{I})^{-1} (\textbf{X}'\textbf{X})(\textbf{X}'\textbf{X}+k \textbf{I})^{-1}\)
    • \(tr\left(Var(\hat{\boldsymbol{\beta}}_R)\right)=\sigma^2\sum_{j=1}^p\frac{\lambda_j}{(\lambda_j+k)^2}\)
  5. As the biasing parameter \(k\) increases, the Variance of ridge estimator \(\hat{\boldsymbol{\beta}}_R\) “shrinks” (more reliable) but the Bias increases (less accurate).

  6. \(R^2\) increases as \(k\) increases.

Ridge regression will not necessarily give the best fit (unbiased) but will surely give more stable (reliable) estimates. We want a value of \(k\) that makes estimates for the coefficients more stable, but does not compromise the unbiasedness that much.

Determination of \(k\)

  • Ridge Trace

    Plot the estimates of the parameters for some values of \(k\) between 0 and 1. Choose smallest value of \(k\) where the trace starts to stabilize.

  • VIF Trace

    Plot the VIF of the parameters for some values of \(k\) between 0 and 1. Choose smallest value of \(k\) where all \(VIF\) are less than your set threshold (less than 5)

 

Example in R

We use the bloodpressure data again. Recall the VIF of the parameters:

car::vif(full_bp)
##      age   weight      bsa      dur    pulse   stress 
## 1.762807 8.417035 5.328751 1.237309 4.413575 1.834845

The lmridge package can help perform ridge regression. We try for different values of the biasing parameter \(k\).

library(lmridge)
bp_ridge_trace <- lmridge(bp~age+weight+bsa+dur+pulse+stress,
                    data=bloodpressure,
                    K = seq(0,1,0.005))

In selecting the value of the biasing parameter \(k\), we check at what value of \(k\) will the estimates become to stabilize.

plot(bp_ridge_trace, type ="ridge")

We can also look at the changes of \(VIF\) as \(k\) increases. We want the lowest value of \(k\) where the \(VIF\) are all less than 5.

plot(bp_ridge_trace, type ="vif")

We select \(k=0.1\)

bp_ridge <- lmridge(bp~age+weight+bsa+dur+pulse+stress,
                    data=bloodpressure,
                    K = 0.1)
summary(bp_ridge)
## 
## Call:
## lmridge.default(formula = bp ~ age + weight + bsa + dur + pulse + 
##     stress, data = bloodpressure, K = 0.1)
## 
## 
## Coefficients: for Ridge parameter K= 0.1 
##             Estimate Estimate (Sc) StdErr (Sc) t-value (Sc) Pr(>|t|)    
## Intercept    -3.1368    -1458.8787    116.2184     -12.5529   <2e-16 ***
## age           0.5855        6.3815      0.7711       8.2755   <2e-16 ***
## weight        0.6390       11.9620      0.8867      13.4900   <2e-16 ***
## bsa           9.9908        5.9437      0.8488       7.0028   <2e-16 ***
## dur           0.0773        0.7231      0.7010       1.0314   0.3195    
## pulse         0.1268        2.1024      0.8752       2.4021   0.0305 *  
## stress       -0.0016       -0.2560      0.7387      -0.3466   0.7340    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Ridge Summary
##        R2    adj-R2  DF ridge         F       AIC       BIC 
##   0.90670   0.87340   4.76418 181.43035 -10.31040  54.34809 
## Ridge minimum MSE= 71.22473 at K= 0.1 
## P-value for F-test ( 4.76418 , 14.47388 ) = 1.885411e-12 
## -------------------------------------------------------------------

Now, the VIF of the parameters are now lower.

vif(bp_ridge)
##          age  weight     bsa     dur   pulse  stress
## k=0.1 1.1604 1.53437 1.40577 0.95903 1.49481 1.06477

Imposing Constraints

Suppose your model is \[ Y=\beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3+\varepsilon \] The eigensystem analysis suggests that there are regressors that are linearly related, for example, \(2X_1+X_2=0\).

To fit the model , we can combine \(X_1\) and \(X_2\) by imposing constraint \(\beta_1=2\beta_2\).

Inserting the restriction, rewriting the original regression model, we will get

\[ \begin{align} Y&=\beta_0+2\beta_2X_1+\beta_2X_2+\beta_3X_3+\varepsilon\\ &=\beta_0+\beta_2(2X_1+X_2)+\beta_3X_3+\varepsilon\\ &=\beta_0+\beta_2Z+\beta_3X_3+\varepsilon \end{align} \] This will have an effect of regressing on the new variable \(Z\) rather than each of \(X_1\) and \(X_2\).


Transform or Combine Multicollinear Variables

You can perform dimension reduction procedures while still using all available variables. Combination of multicollinear variables will help.

  • For example, instead of including both \(GDP\) and \(population\) in the model (which may be correlated), include \(\frac{GDP}{population}=\text{GDP per capita}\).

  • Another method is the Principal Component Regression. The idea is that it combines related variables as “Principal Components”, which will be used as the regressors in the model.

    Example of Principal Component Regression

    In the following example, we model the price of bangus based on different factors:

    1. prices of different goods (gg, tilapia, pork, chicken)

    2. economic indicators

      • cpi: Consumer Price Index
      • cpifbt: Consumer Price Index of Food, Beverage, and Tobaco
    3. a climate variable

      • soi: Southern Oscillation Index
    bangus <- readr::read_csv("bangus.csv")
    …1 cpi cpifbt bangus gg tilapia pork chicken soi
    1 46.3 45.1 12.92 8.17 12.64 16.90 14.58 -2.6
    2 47.6 46.4 12.40 8.84 11.98 16.76 14.50 -3.2
    3 48.0 46.8 12.39 9.16 5.10 17.15 14.83 -4.0
    4 48.3 47.2 12.12 9.41 7.20 17.24 14.65 -3.3
    5 48.5 47.3 12.05 8.53 6.53 17.61 14.36 -3.6
    6 48.9 47.7 12.09 8.43 9.43 16.63 14.15 -5.1
    7 49.0 47.8 12.67 8.56 9.91 16.40 15.09 -4.6
    8 46.1 44.4 12.80 9.90 10.14 18.87 15.43 -6.9
    9 46.3 44.6 12.51 9.81 10.95 19.07 15.04 -7.6
    10 46.5 44.7 12.76 9.27 11.25 18.99 15.07 -5.6
    11 46.8 45.1 13.16 8.89 11.76 19.02 14.97 -2.2
    12 47.2 45.7 12.42 8.32 10.78 19.08 14.99 0.7
    13 47.9 46.5 11.78 8.67 11.33 19.06 14.89 -0.5
    14 49.0 47.5 11.22 8.52 10.60 18.95 15.21 -1.3
    15 49.9 48.4 11.63 9.34 11.03 19.03 14.70 -0.3
    16 50.2 48.6 12.24 9.24 10.43 19.04 16.10 1.7
    17 50.6 49.1 13.07 10.38 10.63 19.28 15.98 0.4
    18 53.3 51.9 14.37 11.61 11.74 20.89 17.62 -0.3
    19 57.5 56.6 13.77 14.49 14.30 22.61 19.12 -0.2
    20 60.1 59.3 18.78 14.93 15.20 23.35 19.91 0.2
    21 62.3 61.5 19.92 15.98 16.19 27.63 20.82 0.9
    22 63.6 63.1 20.63 15.58 15.75 29.34 20.34 -1.5
    23 64.7 64.2 21.25 15.69 16.71 30.63 21.44 0.3
    24 65.6 65.2 20.51 14.50 16.07 32.46 22.35 -0.1
    25 69.4 69.2 20.69 15.22 16.81 34.95 23.64 -1.3
    26 75.8 74.9 20.40 16.71 19.09 36.36 25.15 0.1
    27 77.8 76.9 21.47 18.64 18.47 35.97 24.99 0.1
    28 79.9 79.6 22.07 19.38 18.50 35.68 25.85 0.2
    29 81.3 81.3 22.11 19.77 19.03 34.50 25.30 -1.0
    30 84.3 84.5 24.35 22.12 20.56 35.84 26.40 0.4
    31 85.6 85.9 25.78 23.02 21.86 35.83 26.06 -0.7
    32 87.9 87.9 27.98 22.94 23.05 37.30 28.17 -0.7
    33 88.8 89.3 27.07 20.71 21.40 37.42 28.33 1.7
    34 89.2 89.3 26.09 18.80 20.68 37.21 26.67 0.3
    35 88.6 88.6 26.39 17.97 20.50 37.93 27.85 1.7
    36 88.7 88.5 24.90 17.78 19.98 36.33 27.15 0.3
    37 88.9 88.6 25.12 20.42 20.75 38.99 27.34 -1.5
    38 90.4 89.8 25.93 20.32 21.25 37.17 27.90 -0.4
    39 90.6 90.1 27.20 20.68 21.19 36.80 26.60 1.1
    40 90.4 89.5 26.29 21.28 21.34 36.53 27.54 -0.1
    41 89.9 88.6 26.70 22.50 22.22 37.75 27.36 -1.2
    42 90.1 88.8 27.58 21.51 22.54 36.82 27.62 -0.5
    43 90.3 89.0 29.70 22.67 28.34 37.22 27.88 0.2
    44 90.5 89.9 35.09 24.99 26.44 40.54 31.78 1.5
    45 91.3 90.9 34.73 22.51 25.57 42.03 31.11 -2.7
    46 91.5 90.9 30.97 20.73 23.42 40.57 30.25 -0.1
    47 89.6 88.4 30.06 17.44 21.68 41.41 28.44 0.1
    48 88.9 87.3 27.65 18.48 21.42 41.86 27.00 -0.9
    49 88.1 86.1 27.45 18.07 21.92 41.82 28.73 1.1
    50 87.9 85.8 27.36 19.45 21.77 40.25 28.28 0.2
    51 87.8 85.7 27.70 21.09 22.01 40.48 28.06 -1.6
    52 87.9 85.7 27.00 19.91 20.88 40.22 27.88 -1.0
    53 88.2 86.2 26.30 19.34 19.73 41.64 27.87 0.9
    54 89.0 87.4 28.58 21.99 20.82 41.74 28.05 -2.5
    55 88.7 86.8 30.29 23.54 24.01 42.90 29.61 -3.0
    56 89.3 87.9 33.00 24.27 25.33 41.32 32.49 -1.5
    57 89.7 88.4 32.97 22.64 24.69 41.75 31.32 -3.1
    58 89.7 88.3 31.33 19.44 22.86 42.09 30.94 -3.3
    59 89.7 88.0 31.12 17.32 22.95 42.39 31.79 -3.0
    60 90.5 89.3 29.82 16.64 22.85 43.23 32.36 -2.8
    61 91.6 90.7 29.47 17.98 22.60 43.24 33.14 -2.8
    62 92.7 92.1 28.06 19.12 22.04 44.05 34.24 -2.8
    63 93.1 92.5 27.59 19.06 22.63 44.37 35.66 -2.5
    64 93.3 92.8 27.55 20.12 21.80 44.29 36.14 -1.9
    65 93.3 92.5 27.65 19.53 22.33 44.69 35.91 -1.1
    66 93.8 93.1 28.64 20.87 25.27 45.34 35.05 -0.2
    67 94.9 94.7 30.62 23.75 26.69 44.30 35.74 -1.2
    68 96.7 96.6 31.19 22.04 25.37 39.54 36.22 -0.3
    69 97.4 97.0 32.50 21.35 26.24 39.75 36.31 -1.4
    70 98.2 97.7 32.05 21.71 26.13 39.73 36.46 0.1
    71 98.4 98.1 37.03 21.23 27.54 41.46 35.78 -0.1
    72 98.8 98.6 33.50 20.14 25.56 40.90 36.04 1.3
    73 99.7 99.5 32.96 21.19 25.58 40.30 37.34 -0.4
    74 100.4 100.3 31.19 22.59 23.90 40.86 37.55 1.7
    75 100.9 100.8 30.84 22.14 23.75 41.41 38.62 2.2
    76 101.2 100.9 30.40 21.72 23.78 42.40 37.27 3.4
    77 101.4 101.2 31.30 22.27 26.09 41.83 37.39 2.2
    78 102.9 103.7 33.40 24.85 27.46 42.81 38.70 3.0
    79 104.0 105.5 38.73 28.88 30.15 43.74 39.83 2.1
    80 106.3 108.2 48.59 30.12 34.14 44.93 42.16 2.7
    81 107.0 108.6 46.52 28.91 34.31 45.01 41.06 1.8
    82 107.0 108.2 45.98 26.43 33.46 45.87 41.86 1.0
    83 108.0 109.3 44.27 23.78 32.77 46.96 40.75 2.6
    84 109.1 110.9 43.41 24.52 32.30 49.34 40.94 1.9
    85 111.1 112.3 43.64 24.46 32.92 50.35 41.02 0.8
    86 112.4 113.8 43.13 25.54 34.94 51.28 40.02 1.4
    87 114.9 117.3 43.91 27.65 36.24 51.39 40.88 -1.3
    88 115.9 118.2 43.93 28.84 34.99 51.50 39.36 0.9
    89 116.8 119.3 43.95 27.09 36.70 51.90 40.02 1.0
    90 117.9 120.3 45.53 29.42 38.34 52.55 42.72 -0.6
    91 120.1 122.1 48.77 31.15 40.06 52.44 45.21 -1.2
    92 122.2 123.2 45.47 28.35 34.51 52.27 44.42 -0.4
    93 122.4 123.0 46.16 24.86 35.19 57.63 40.16 -3.9
    94 123.0 123.1 46.35 27.05 32.21 53.18 43.41 -1.9
    95 124.0 123.8 46.87 25.96 31.51 53.16 45.00 -0.1
    96 124.8 124.9 46.80 22.84 32.51 53.14 44.14 1.8
    97 126.1 125.9 46.44 23.30 32.81 53.23 43.35 -0.1
    98 128.1 127.7 45.57 24.54 33.98 53.18 43.78 0.8
    99 128.9 128.4 45.44 25.80 34.52 52.86 44.83 -1.0
    100 130.2 129.8 45.44 25.53 33.75 53.40 45.25 -1.3
    101 132.7 131.6 46.05 26.02 33.74 54.07 46.09 0.1
    102 135.2 133.5 47.51 29.96 37.23 53.92 46.28 -1.1
    103 139.5 136.6 49.70 32.92 38.65 56.48 48.67 -0.7
    104 143.7 140.6 51.93 30.96 37.95 57.53 50.33 1.0
    105 146.8 143.0 52.46 27.83 37.18 57.98 48.86 -0.1
    106 148.1 143.8 53.21 28.30 38.91 58.49 48.62 -2.2
    107 149.1 144.5 53.67 26.59 38.82 59.80 48.59 -1.7
    108 149.9 145.2 52.91 25.91 38.35 61.51 49.16 -2.4
    109 151.5 146.0 52.57 26.41 39.53 62.89 49.35 -0.9
    110 152.9 147.5 52.86 27.90 40.62 62.86 47.75 -0.2
    111 154.6 149.9 53.82 30.00 41.45 63.85 51.52 -1.4
    112 156.4 151.5 54.63 31.91 42.50 65.01 55.82 -2.9
    113 156.5 150.9 53.70 29.41 41.25 66.00 54.80 -2.4
    114 157.1 151.5 54.87 30.33 41.56 67.01 54.04 -1.4
    115 157.8 152.2 55.41 30.62 42.29 68.80 51.07 -3.7
    116 159.2 153.0 62.68 35.03 50.42 70.21 55.49 -5.6
    117 159.8 152.9 62.59 33.31 48.30 70.73 52.31 -2.3
    118 161.1 152.9 62.45 30.43 49.07 69.42 52.54 -4.8
    119 162.0 153.4 63.04 29.11 48.98 70.06 53.23 -2.3
    120 163.7 154.9 62.28 28.56 48.63 70.66 52.53 0.1
    121 165.4 156.5 59.46 28.57 48.99 71.65 53.72 -1.9
    122 167.0 157.9 56.75 27.49 48.97 72.20 56.42 -1.3
    123 168.3 159.4 54.21 29.88 49.55 72.15 61.08 0.0
    124 169.6 161.4 54.19 32.18 49.00 72.54 58.47 0.0
    125 170.0 161.5 55.30 32.45 45.95 72.65 54.87 -3.2
    126 170.5 161.8 56.12 35.60 50.67 71.45 54.15 -1.4
    127 170.7 161.7 57.19 35.49 51.90 72.09 57.64 -1.4
    128 172.3 162.3 58.98 35.58 51.81 71.12 54.97 -2.0
    129 172.9 162.4 58.22 34.20 53.30 71.28 54.05 -2.1
    130 173.6 162.3 59.94 24.40 52.20 71.20 51.24 -1.8
    131 174.0 162.6 59.40 30.65 51.70 70.33 52.82 -2.6
    132 174.6 162.9 62.05 30.71 45.99 69.42 53.88 -1.0
    133 176.4 163.8 59.72 30.66 49.62 71.15 52.57 -2.2
    134 178.9 166.8 59.34 32.59 50.71 70.19 52.93 -1.8
    135 180.1 168.0 55.45 33.95 50.34 70.45 53.53 -2.4
    136 182.4 171.2 56.28 32.61 49.98 70.32 54.62 -1.3
    137 183.7 172.9 57.87 34.68 50.72 67.97 54.38 -2.5
    138 184.1 173.1 59.24 35.87 51.13 68.13 55.57 -0.3
    139 185.0 174.1 61.66 39.48 53.49 69.93 60.26 0.1
    140 188.0 177.0 66.70 42.99 55.44 70.34 54.53 -0.5
    141 191.0 177.4 68.05 42.67 58.06 72.15 55.97 -0.1
    142 190.5 176.5 69.19 43.37 57.79 73.28 55.43 -2.2
    143 191.0 176.7 70.70 37.41 56.61 74.72 54.35 -2.9
    144 192.2 178.2 70.15 37.81 56.63 77.21 57.48 -1.7
    145 193.7 179.3 66.71 37.06 57.11 79.07 57.32 -1.5
    146 195.3 180.7 67.48 41.43 55.52 80.49 57.35 -2.9
    147 197.8 184.4 66.49 40.96 55.44 80.82 58.14 -3.0
    148 198.0 184.8 65.17 39.44 55.45 80.76 58.80 -3.0
    149 198.0 184.5 64.33 40.14 53.51 80.49 52.65 -2.6
    150 197.8 184.0 66.53 42.30 55.77 80.78 57.27 -1.2
    151 198.3 184.4 70.34 41.78 57.11 80.78 60.26 -2.6
    152 199.6 186.2 78.26 43.24 59.53 81.44 59.60 -1.0
    153 200.4 186.3 80.42 40.86 61.49 81.05 57.07 -0.8
    154 201.2 186.3 79.03 38.26 59.45 81.75 57.35 0.4
    155 202.8 187.4 78.73 38.49 59.48 81.89 57.07 -1.8
    156 205.2 190.2 76.12 38.26 57.96 82.11 56.31 -1.2
    157 207.7 192.7 71.26 36.64 56.63 82.52 58.89 -0.4
    158 209.8 195.4 69.97 38.10 56.04 81.66 59.16 0.6
    159 213.8 201.8 68.02 39.20 55.19 81.00 61.84 -0.1
    160 220.4 212.4 66.37 40.90 54.65 81.04 60.99 0.5
    161 219.7 210.9 65.14 39.34 52.29 82.40 60.09 -0.5
    162 219.7 212.2 63.64 40.88 53.97 79.59 59.61 -0.1
    163 220.1 212.6 68.78 42.02 56.81 81.91 59.39 -1.3
    164 221.9 213.9 73.16 47.93 59.20 82.63 58.46 1.7
    165 224.1 216.1 77.48 47.14 58.14 82.79 56.52 -0.2
    166 224.9 216.2 81.45 43.15 58.04 83.35 60.17 1.2
    167 225.8 216.6 81.65 41.01 58.97 88.49 61.62 1.1
    168 226.5 217.4 101.00 40.89 57.06 88.02 60.90 0.2
    169 228.1 217.8 74.08 40.61 58.22 88.85 62.80 1.6
    170 228.4 217.7 73.62 41.70 58.72 89.39 57.01 1.0
    171 230.7 220.9 74.61 43.42 57.55 91.37 56.88 0.7
    172 230.3 219.5 73.52 42.45 56.83 88.38 56.50 1.0
    173 230.1 218.4 71.94 42.69 55.76 89.74 56.71 0.7
    174 229.7 216.5 74.12 44.38 57.19 89.15 53.60 -0.3
    175 231.5 217.9 75.70 44.64 58.44 90.05 56.31 1.3
    176 233.0 218.5 78.08 44.12 57.95 89.85 56.66 0.8
    177 234.0 218.5 81.30 45.69 59.49 89.95 56.67 2.6
    178 235.8 220.4 82.74 49.34 60.33 89.45 56.31 -1.9
    179 236.1 220.5 81.17 45.91 60.15 90.36 57.21 -1.4
    180 236.0 220.1 78.83 44.07 60.10 90.70 58.61 -3.0
    181 239.0 221.4 76.09 44.41 60.17 92.52 56.47 -3.2
    182 239.3 220.9 72.80 44.13 58.60 93.25 57.41 -1.7
    183 240.9 222.6 71.13 45.47 58.26 93.73 58.15 -3.4
    184 242.2 223.8 69.47 44.48 56.68 93.09 56.52 -2.6
    185 243.0 223.7 68.85 44.43 56.22 93.09 56.19 -3.1
    186 244.4 224.3 68.83 45.49 56.77 93.26 57.28 -2.3
    187 245.4 224.7 70.29 42.56 58.72 93.11 57.86 -2.1

    We expect that some of the possible independent variables are correlated with each other.

    cor_bangus <- dplyr::select(bangus,"cpi", "cpifbt", "gg":"soi") |> cor()
    corrplot::corrplot(cor_bangus, method = "square", addCoef.col = 'grey')

    From the correlation plot, the Consumer Price Index and the prices of different goods are highly (almost perfect) correlated. Afterall, the CPI is a function of prices of different goods.

    full_bangus <- lm(bangus~ cpi   + cpifbt + gg + tilapia + pork + chicken + soi, data = bangus)
    car::vif(full_bangus)
    ##        cpi     cpifbt         gg    tilapia       pork    chicken        soi 
    ## 692.617905 719.811114  21.096698  53.688367  72.302318  27.236422   1.343132

    CPI is a function of prices of different goods. Hence, it is redundant to include it in the model, as well as the price of other goods. Furthermore, it is not logical or ideal to include it in the model, since the CPI is computed after obtaining prices of goods in the market. We just model bangus prices based on the price of other goods.

    model_bangus <- lm(bangus ~ gg + tilapia + pork + chicken + soi, data = bangus)
    car::vif(model_bangus)
    ##        gg   tilapia      pork   chicken       soi 
    ## 17.378676 41.040410 30.910613 17.715844  1.131336

    Even with the deletion of of CPI, multicollinearity still exists. We now perform a principal component regression.

    goods_X <- bangus[,c("gg","tilapia","pork","chicken")]
    goods_prcomp <- prcomp(goods_X, center=TRUE, scale=TRUE)
    goods_prcomp
    ## Standard deviations (1, .., p=4):
    ## [1] 1.9700906 0.2656296 0.1727015 0.1354924
    ## 
    ## Rotation (n x k) = (4 x 4):
    ##               PC1         PC2        PC3        PC4
    ## gg      0.4965923 -0.71080913 -0.4817312  0.1268126
    ## tilapia 0.5039203  0.05211455  0.2233552 -0.8327430
    ## pork    0.5025945 -0.04211388  0.7097182  0.4918590
    ## chicken 0.4968492  0.70018632 -0.4629768  0.2203009

    The following is the matrix of principal components. The 4 principal components are expected to be uncorrelated

    goods_prcomp$x
    PC1 PC2 PC3 PC4
    -3.4167207 -0.0001203 0.1698407 -0.3102145
    -3.4115918 -0.0499675 0.1290518 -0.2736286
    -3.5849156 -0.0777258 0.0246243 0.0873308
    -3.5139191 -0.0963687 0.0500665 -0.0153196
    -3.5761917 -0.0547596 0.1013397 0.0116158
    -3.5222834 -0.0471807 0.1200640 -0.1591628
    -3.4753746 -0.0093265 0.0839277 -0.1726761
    -3.3395010 -0.0856934 0.0949625 -0.1088417
    -3.3277669 -0.0960592 0.1284178 -0.1517635
    -3.3444151 -0.0578773 0.1531286 -0.1744306
    -3.3492204 -0.0359716 0.1810597 -0.2052130
    -3.4031262 -0.0005471 0.1947318 -0.1613564
    -3.3741583 -0.0266459 0.1889254 -0.1866360
    -3.3948745 -0.0036598 0.1723420 -0.1495971
    -3.3593646 -0.0807837 0.1599036 -0.1672460
    -3.3348538 -0.0097847 0.1128295 -0.1173765
    -3.2748194 -0.0906032 0.0758639 -0.1104269
    -3.0931223 -0.0937761 0.0355505 -0.0914443
    -2.7936697 -0.2082521 -0.0513242 -0.1251064
    -2.7029416 -0.1984972 -0.0601527 -0.1367913
    -2.4975447 -0.2297597 0.0137933 -0.0659349
    -2.5069026 -0.2306667 0.0952427 -0.0181136
    -2.4067828 -0.1852848 0.1097869 -0.0199388
    -2.4092409 -0.0690421 0.1842261 0.0519156
    -2.2541963 -0.0579142 0.2007458 0.0975607
    -2.0340074 -0.0803830 0.1621741 0.0548165
    -1.9778392 -0.2166569 0.0600604 0.0975496
    -1.9204201 -0.2241808 -0.0088195 0.1111892
    -1.9314672 -0.2721011 -0.0395051 0.0550973
    -1.7096111 -0.3730068 -0.1160192 0.0523015
    -1.6404271 -0.4444766 -0.1285688 -0.0073151
    -1.5041339 -0.3383425 -0.1283301 -0.0038532
    -1.6488485 -0.1888370 -0.0517895 0.0573160
    -1.8192667 -0.1431141 0.0693644 0.0414112
    -1.8071447 -0.0343353 0.0900783 0.0740760
    -1.8912022 -0.0535323 0.0626243 0.0521024
    -1.6797670 -0.2215504 0.0334338 0.1062389
    -1.6914620 -0.1834301 -0.0308310 0.0482940
    -1.7286755 -0.2682534 -0.0188251 0.0280087
    -1.6709518 -0.2624036 -0.0816977 0.0356368
    -1.5666046 -0.3510723 -0.0800742 0.0302777
    -1.6148414 -0.2706058 -0.0691946 -0.0140088
    -1.3682357 -0.3175491 -0.0388327 -0.2773674
    -1.1127572 -0.2983590 -0.1846204 -0.0238072
    -1.2424493 -0.1717856 -0.0168610 0.0132979
    -1.4513805 -0.0988823 0.0144951 0.0547112
    -1.6977013 0.0257305 0.2218714 0.0944125
    -1.6958045 -0.1128160 0.2312274 0.1081280
    -1.6423895 -0.0021803 0.2008476 0.1032099
    -1.6337290 -0.1121549 0.1011888 0.0856431
    -1.5529947 -0.2305944 0.0451864 0.0947772
    -1.6535172 -0.1642075 0.0802386 0.1289016
    -1.6829259 -0.1333027 0.1358446 0.2107924
    -1.5193747 -0.2966211 0.0293530 0.1924627
    -1.2728625 -0.3173693 -0.0092326 0.1002011
    -1.1381096 -0.2221210 -0.1646678 0.0509311
    -1.2623035 -0.1726526 -0.0499550 0.0557423
    -1.4704121 0.0143853 0.0915073 0.1112776
    -1.5302151 0.1943940 0.1705607 0.1010745
    -1.5265431 0.2644049 0.2085466 0.1250697
    -1.4458250 0.2120153 0.1211029 0.1651908
    -1.3549307 0.1855096 0.0538958 0.2408479
    -1.2849458 0.2579479 0.0302119 0.2388792
    -1.2467896 0.2082075 -0.0459383 0.2982190
    -1.2567130 0.2371946 0.0075052 0.2701848
    -1.1202484 0.1159179 0.0345090 0.1406390
    -0.9446589 -0.0352629 -0.1300978 0.0909952
    -1.1546053 0.1052879 -0.2377066 0.0389673
    -1.1523962 0.1574541 -0.1912946 -0.0066518
    -1.1345176 0.1404685 -0.2142182 0.0048793
    -1.0979151 0.1411222 -0.0975030 -0.0431926
    -1.2119541 0.2203061 -0.1011892 0.0343915
    -1.1327321 0.2136907 -0.2077209 0.0519032
    -1.0992288 0.1248402 -0.2816648 0.1677998
    -1.0761886 0.2037226 -0.2795270 0.1980572
    -1.1177159 0.1657725 -0.1865565 0.1933279
    -1.0313179 0.1434186 -0.2021240 0.0736387
    -0.8047394 0.0374416 -0.3091080 0.0767360
    -0.4785145 -0.1686246 -0.4592794 0.0272013
    -0.1955806 -0.1299917 -0.4964199 -0.0965433
    -0.2814407 -0.1017614 -0.4029862 -0.1339186
    -0.3753344 0.0956658 -0.3010261 -0.0898057
    -0.5312273 0.2139542 -0.1221959 -0.0790707
    -0.4512718 0.1681029 -0.0918449 0.0084399
    -0.4098543 0.1758856 -0.0512204 0.0002255
    -0.3115934 0.0617512 -0.0116103 -0.0823397
    -0.1435443 -0.0330604 -0.1120832 -0.1071730
    -0.1749990 -0.1877567 -0.1310203 -0.0509260
    -0.1728873 -0.0363109 -0.0376970 -0.1383166
    0.0895804 -0.0584918 -0.1838883 -0.1382224
    0.3025953 -0.0492945 -0.3197525 -0.1690829
    -0.0247667 0.0812403 -0.2494301 0.0596355
    -0.1876362 0.1020934 0.2198441 0.0392670
    -0.1677176 0.1103889 -0.1614689 0.1642248
    -0.1862121 0.2555277 -0.1724630 0.2095908
    -0.3293226 0.4240803 0.0069073 0.1096148
    -0.3235390 0.3570554 0.0179288 0.0902663
    -0.2176238 0.2992523 -0.0369641 0.0517394
    -0.1150820 0.2680242 -0.1291947 0.0482007
    -0.1245398 0.3023200 -0.1233906 0.1016463
    -0.0589088 0.3084267 -0.1504425 0.1352085
    0.2314277 0.0683386 -0.2908004 0.0068328
    0.5489420 -0.0144321 -0.3976671 0.0628405
    0.5167569 0.1894734 -0.3378571 0.1225904
    0.3097770 0.3233836 -0.1477016 0.1121888
    0.3871903 0.2854169 -0.1218242 0.0389812
    0.3340597 0.3942049 -0.0038517 0.0517634
    0.3461505 0.4614166 0.0568586 0.1134253
    0.4423732 0.4384659 0.0882433 0.0936351
    0.4896547 0.2677430 0.0852766 0.0322504
    0.7606339 0.3082446 -0.0841508 0.0935326
    1.0511318 0.3867588 -0.2533082 0.1532303
    0.8860047 0.4978376 -0.0946959 0.1928421
    0.9350579 0.4001397 -0.0758016 0.1991491
    0.9110547 0.2392606 0.0709330 0.1613281
    1.5406095 0.1799667 -0.1110951 -0.0958725
    1.3020206 0.1354141 0.0536097 -0.0461514
    1.1705895 0.3414383 0.1439831 -0.1440175
    1.1445521 0.4598184 0.2006357 -0.1306870
    1.0985896 0.4607809 0.2615762 -0.1168833
    1.1722217 0.5157209 0.2601956 -0.0951901
    1.2248808 0.7138051 0.2412493 -0.0545689
    1.5081558 0.7784711 -0.0054654 0.0129448
    1.5187760 0.5004979 -0.0216849 0.0372434
    1.3205677 0.3024659 0.0415702 0.1415757
    1.5573915 0.0773062 -0.0518585 -0.0942620
    1.7211729 0.2524547 -0.1193540 -0.0908958
    1.6110430 0.1216397 -0.0718614 -0.1465044
    1.5650880 0.1735984 0.0437788 -0.2473747
    0.9833378 0.6846533 0.5532210 -0.3515509
    1.2901538 0.3466661 0.1895586 -0.2485617
    1.1354317 0.3767596 0.0482293 0.0330980
    1.2378184 0.3261179 0.1951606 -0.1301856
    1.3503038 0.2208869 0.0815414 -0.1776934
    1.4279030 0.1578062 0.0051843 -0.1285111
    1.3888705 0.2970566 0.0220920 -0.1129495
    1.4457588 0.1556984 -0.1279269 -0.1809131
    1.5566595 0.1344106 -0.2078757 -0.1661129
    1.9927377 0.1219750 -0.4274381 -0.1318948
    2.0303956 -0.3758799 -0.3660190 -0.2642549
    2.1839885 -0.2817862 -0.3040819 -0.3375372
    2.2154516 -0.3565740 -0.2861784 -0.2989271
    1.9008792 -0.0203982 0.0444743 -0.2945610
    2.0812116 0.0967388 0.0080682 -0.1893106
    2.0976414 0.1366971 0.1122677 -0.1834957
    2.2842658 -0.1582175 -0.0604402 -0.0208094
    2.2941344 -0.0906372 -0.0547029 -0.0033068
    2.2451089 0.0411687 -0.0090897 -0.0132113
    2.0060613 -0.3018196 0.1175292 -0.0056256
    2.3358212 -0.2192438 -0.0843092 -0.0178411
    2.4527858 -0.0391324 -0.1366847 -0.0463629
    2.5860151 -0.1605112 -0.1279637 -0.1453192
    2.4416299 -0.1162151 0.0716494 -0.3175653
    2.2851508 0.0611003 0.1742273 -0.2266981
    2.2904196 0.0324784 0.1775527 -0.2265701
    2.2132810 0.0065206 0.1982786 -0.1598240
    2.1942335 0.2307648 0.1852594 -0.0650042
    2.2334743 0.1468801 0.0761731 -0.0332669
    2.3337191 0.1996958 -0.0893669 0.0475277
    2.3681969 0.0453837 -0.1448276 0.0827567
    2.2252682 0.0958790 -0.0351557 0.1988405
    2.2676819 -0.0180237 -0.1559745 0.0639566
    2.4510249 -0.0992511 -0.0882572 -0.0165918
    2.7810029 -0.5275893 -0.2688049 -0.0643030
    2.6509265 -0.5708729 -0.1818106 -0.0460366
    2.5990094 -0.1358414 -0.1009704 -0.0213423
    2.6928895 0.0673853 0.1254471 0.0419237
    2.5948354 0.0361383 0.1128163 0.1148285
    2.6995153 0.1466560 0.1078183 0.1001908
    2.5825691 -0.1989522 0.2641698 0.0137134
    2.6668295 -0.3261270 0.2385583 0.1341530
    2.5201598 -0.2766464 0.1891003 0.0870934
    2.5365918 -0.2884704 0.2007173 0.1764886
    2.5400300 -0.5417809 0.2228081 0.0656339
    2.7011439 -0.4284346 0.1717081 0.0664751
    2.6695863 -0.3786656 0.1711025 0.0856271
    2.7911845 -0.4772814 0.1243144 0.0295579
    2.9616681 -0.7318769 -0.0324820 0.0142268
    2.8486694 -0.4649575 0.1194469 0.0162676
    2.8169371 -0.2779173 0.1681423 0.0254332
    2.8038534 -0.4049203 0.2787650 0.0341963
    2.7915040 -0.3482024 0.2640639 0.1394220
    2.8787478 -0.4036547 0.1916210 0.1938069
    2.7161123 -0.4191634 0.2454428 0.2226688
    2.6888125 -0.4329233 0.2518497 0.2401415
    2.7948110 -0.4499394 0.1830457 0.2451470
    2.7346378 -0.2225416 0.3174185 0.1185411

    We can use the Principal Components as predictors of bangus price

    Limitation of PCR: model interpretation may be difficult.

    More details about the theory of variable reduction and the Principal Component Analysis in Stat 147. Here in Stat 136, it is not recommended that you use this in your paper.


12.5 Exercise

Exercise 12.2 Using the bangus dataset and the model containing all possible independent variables in the dataset, show the VIF, condition indices, and variance proportions for detecting multicollinearity. Interpret the results.

bangus_model <- lm(bangus ~ cpi + cpifbt + gg + tilapia + pork + chicken + soi, data = bangus)

© 2024 UP School of Statistics. All rights reserved.