Chapter 10 Lab 8 - 03/12/2021
In this lecture: - we give some hint on Rmarkdown and how you must know it to perform the exam PSBF - we study multiple linear regression in R
10.1 RMarkdown
RMarkdown is the best framework for data science (official website: https://rmarkdown.rstudio.com/). It makes it possible to obtain reproducible reports that include text, R code and the corresponding output. See this video for an introduction to RMarkdown: https://vimeo.com/178485416. To use RMarkdown you have to install the rmarkdown
and knitr
package.
For the exercise part of the PBSF exam you will receive a RMarkdown document, i.e. a file with .Rmd
extension (see for example the file PSBF_Exam_FacSimile.Rmd available in the PSBF Moodle page). You can open the file (by double clicking) with RStudio.
In the top part of the file, as shown in Figure 10.1, you just have to write your Surname, Name and Student ID. The rest must not be modified. You can compile the RMarkdown file to obtain an html file by using the Knit button, see the yellow circle in Figure 10.1. If the compilation concludes correctly a web page will be opened with your html document. Moreover, the file PSBF_Exam_FacSimile.html will be saved in the same folder. If you want to see the web page directly in the bottom right panel of RStudio click on the wheel button (see the purple circle in Figure 10.1) and select Preview in Viewer Pane (then compile again your document with Knit
, you will see your document in the Viewer pane).
In Figure 10.2 you can view the beginning of Exercise 1 (see the purple rectangle) and the first sub-exercise (1.). This must not be modified. You have instead to write your code and your comments (preceded by #
) in the yellow area delimited by the symbols ```{r}
and ```
(this area is known as code chunk). To check what your code produces you can:
- run separately each line of code by using Ctrl/Cmd Enter. This is the approach you have used so far, you will find your results in the console and the new objects in your environment (see top right panel).
- Use the arrow located in the right part of the code chunk (see the orange arrow in Figure 10.2 ). This will run all the code lines included in the chunk. You will find your results in the console and the new objects in your environment (see top right panel).
In any case you have to compile your document (using the Knit
button) after each sub-exercise. This will make you possible to check the final html file step by step. At the end of your exam you will have to deliver both the .Rmd and .html file (you will upload them to the PSBF Moodle page).
Remember that if a code gives you errors or doesn’t work you can always keep it in your .Rmd file by commenting it out.
10.2 Preliminaries on multiple linear regrssion
We introduce the multiple linear regression model, given by the following equation: \[ Y = \beta_0+\beta_1 X_1 +\beta_2X_2+\ldots+\beta_pX_p+ \epsilon \] where \(\beta_0\) is the intercept and \(\beta_j\) is the coefficient of regressor \(X_j\) (\(j=1,\ldots,p\)). The term \(\epsilon\) represents the error with mean equal to zero and variance \(\sigma^2_\epsilon\).
We will use the same data of Lab 7 (file datareg_logreturns.csv) and regarding the daily log-returns (i.e. relative changes) of:
- the NASDAQ index
- ibm, lenovo, apple, amazon, yahoo
- gold
- the SP500 index
- the CBOE treasury note Interest Rate (10 Year)
We start with data import and structure check:
= read.csv("files/datareg_logreturns.csv", sep=";")
data_logreturns str(data_logreturns)
## 'data.frame': 1258 obs. of 10 variables:
## $ Date : chr "27/10/2011" "28/10/2011" "31/10/2011" "01/11/2011" ...
## $ ibm : num 0.02126 0.00841 -0.01516 -0.01792 0.01407 ...
## $ lenovo: num -0.00698 -0.02268 -0.04022 -0.00374 0.07641 ...
## $ apple : num 0.010158 0.000642 -0.00042 -0.020642 0.002267 ...
## $ amazon: num 0.04137 0.04972 -0.01769 -0.00663 0.01646 ...
## $ yahoo : num 0.02004 -0.00422 -0.05716 -0.04646 0.01132 ...
## $ nasdaq: num 0.032645 -0.000541 -0.019456 -0.029276 0.012587 ...
## $ gold : num -0.023184 0.006459 0.030717 0.043197 -0.000674 ...
## $ SP : num 0.033717 0.000389 -0.025049 -0.02834 0.015976 ...
## $ rate : num 0.0836 -0.0379 -0.0585 -0.0834 0.0025 ...
10.3 Multiple linear regression model
We start by implementing (again) the simple linear regression model already described in Section 9.1.2, which considers nasdq
as dependent (response) variable and SP
as independent variable.
= lm(nasdaq ~ SP, data=data_logreturns)
mod1 summary(mod1)
##
## Call:
## lm(formula = nasdaq ~ SP, data = data_logreturns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0128494 -0.0018423 0.0002159 0.0020178 0.0115080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.443e-05 8.777e-05 0.848 0.397
## SP 1.085e+00 1.019e-02 106.471 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003109 on 1256 degrees of freedom
## Multiple R-squared: 0.9003, Adjusted R-squared: 0.9002
## F-statistic: 1.134e+04 on 1 and 1256 DF, p-value: < 2.2e-16
anova(mod1)
## Analysis of Variance Table
##
## Response: nasdaq
## Df Sum Sq Mean Sq F value Pr(>F)
## SP 1 0.109579 0.10958 11336 < 2.2e-16 ***
## Residuals 1256 0.012141 0.00001
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We implement now the full model by including all the available \(p=8\) regressors. In the R function lm
the regressors are specified by name and separated by +
:
= lm(nasdaq ~ ibm + lenovo + apple + amazon + yahoo + gold + SP + rate,
mod2 data= data_logreturns)
summary(mod2)
##
## Call:
## lm(formula = nasdaq ~ ibm + lenovo + apple + amazon + yahoo +
## gold + SP + rate, data = data_logreturns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.011522 -0.001574 0.000089 0.001626 0.010487
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.097e-06 7.208e-05 0.085 0.9326
## ibm -9.625e-03 7.666e-03 -1.256 0.2095
## lenovo 5.123e-03 3.361e-03 1.524 0.1277
## apple 9.209e-02 5.008e-03 18.389 <2e-16 ***
## amazon 5.706e-02 4.285e-03 13.316 <2e-16 ***
## yahoo 3.901e-02 4.681e-03 8.333 <2e-16 ***
## gold -5.936e-03 2.931e-03 -2.025 0.0431 *
## SP 8.884e-01 1.462e-02 60.762 <2e-16 ***
## rate 2.604e-03 3.471e-03 0.750 0.4532
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002547 on 1249 degrees of freedom
## Multiple R-squared: 0.9334, Adjusted R-squared: 0.933
## F-statistic: 2189 on 8 and 1249 DF, p-value: < 2.2e-16
In the summary table we have the parameter estimates (\(\hat \beta_0,\hat \beta_1,\ldots,\hat\beta_p\) and \(\sigma_\epsilon=\sqrt{\frac{SSE}{n-1-p}}\)) with the corresponding standard errors (i.e. estimate precision). By means of the t value
and the corresponding p-value we can test the hypotheses \(H_0:\beta_j=0\) vs \(H_1:\beta_j\neq 0\) separately for each covariate coefficient (then \(j=1,\ldots,p\)). In the considered case we do not reject the \(H_0\) hypothesis for the following regressors: ibm
, lenovo
, rate
(i.e. the corresponding parameters can be considered null and there is no evidence of a linear relationship). Note that for gold
there is a weak evidence from the data against \(H_0\).
Which is the interpretation of the generic \(\hat\beta_j\)? For apple
for example the interpretation is the following: when the apple
log-return change by one unit, we expect the nasdaq
log return to increase by 0.0920924 (when all other covariates are held fixed).
10.3.1 Analysis of variance (sequential and global F tests)
We now apply the function anova
to the multiple regression model. This provides us with sequential tests on the single regressors (which are entered one at a time in a hierarchical fashion) and it is useful to assess the effect of adding a new predictor in a given order.
anova(mod2)
## Analysis of Variance Table
##
## Response: nasdaq
## Df Sum Sq Mean Sq F value Pr(>F)
## ibm 1 0.040144 0.040144 6186.8396 <2e-16 ***
## lenovo 1 0.006941 0.006941 1069.7375 <2e-16 ***
## apple 1 0.020520 0.020520 3162.4311 <2e-16 ***
## amazon 1 0.013850 0.013850 2134.4469 <2e-16 ***
## yahoo 1 0.006197 0.006197 955.0328 <2e-16 ***
## gold 1 0.000006 0.000006 0.8585 0.3543
## SP 1 0.025956 0.025956 4000.1962 <2e-16 ***
## rate 1 0.000004 0.000004 0.5629 0.4532
## Residuals 1249 0.008104 0.000006
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the anova
table 0.0081042 is the value of SSE (note that it is lower than the corresponding value observed for mod1
).
As shown also in Figure 10.3 and 10.4, the SS values reported in column Sum Sq
should be read as the increase in \(SS_R\) when a new variable is added to the predictors already in the model. It’s important to point out that the order of the regressors (used in the lm
formula) is relevant and has an effect on the output of the anova
function.
The value 0.0401439 represents the additional SSR of the model \(Y=\beta_0+\beta_1 ibm+\epsilon\) with respect to the model with no regressors
\(Y=\beta_0+\epsilon\) (for the latter SSR=0). Note that the two models differs just in one predictor (that’s why df
=1). Analyzing the corresponding F-statistic value (with 1 and 1249 degrees of freedom) and p-value, we conclude that the coefficient of ibm
is significantly different from zero (and thus ibm
is an useful predictor). Similarly, the value 0.0069411 represents the additional SSR of the model \(Y=\beta_0+\beta_1 ibm+\beta_2 lenovo + \epsilon\) with respect to the model containing only ibm
: \(Y=\beta_0+\beta_1 ibm+\epsilon\). Also in this case the p-value leads to conclude that lenovo
is an useful covariate. This reasoning can be repeated step by step for all the regressors. At the end by summing all the sequential additional SSRs we obtain the total SSR for the model with \(p=8\) regressors:
= anova(mod2)
mod2_an = sum(mod2_an$`Sum Sq`[1:8])
SSR SSR
## [1] 0.1136159
The values of SSE instead is contained in the last line of the anova tables under the name Residuals
:
= mod2_an$`Sum Sq`[9]
SSE SSE
## [1] 0.008104249
The values of SSR and SSE can then be used for the global F-test also reported in the last of the summary
output. The considered hypotheses are: \(H_0: \beta_1=\beta_2=\ldots=\beta_p=0\) (this corresponds to the model \(Y=\beta_0+\epsilon\)) vs \(H_1: \text{at least one } \beta\neq 0\). As reported in Figure 10.5, the corresponding F-statistic is given by
\[
\text{F-value}=\frac{SS_R/p}{SS_E/(n-1-p)}=\frac{MS_R}{MS_E}
\]
which is a value of the F distribution with \(p\) and \(n-1-p\) degrees of freedom.
If we reject \(H_0\) it means that at least one of the covariates has predictive power in our linear model, i.e. that using a regression is predictively better than
just using the average \(\bar y\) (this is the best prediction in the case of the trivial model with no regressors, \(Y=\beta_0+\epsilon\)).
The p-value of this test is reported in the last line of the summary
output. (## F-statistic: 2189 on 8 and 1249 DF, p-value: < 2.2e-16
). As expected, the p-values is very small and there is evidence for rejecting \(H_0\).
10.3.2 Adjusted \(R^2\)
By using SSR, SSE, SST and the corresponding degrees of freedom (see Figure 10.5), it is possible to compute the adjusted \(R^2\) goodness of fit index: \[ adjR^2 =1- \frac{\frac{SS_E}{n-1-p}}{\frac{SS_T}{n-1}}=1-\frac{MS_E}{MS_T} \] As in the formula of \(adjR^2\) we consider \(p\), there is a penalization for the number of regressors. Thus, \(adjR^2\), differently from the standard \(R^2\) introduced in Section 9.1.3, can increase or decrease. In particular, it increases only when the added variables decrease the SSE enough to compensate for the increase in \(p\).
It can be computed manually as follows:
= 8
p = nrow(data_logreturns)
n = sum((data_logreturns$nasdaq - mean(data_logreturns$nasdaq))^2)
SST 1 - (SSE/(n-1-p))/(SST/(n-1))
## [1] 0.9329925
but it is also reported in the summary
output: Adjusted R-squared: 0.933
. This value denotes a very high goodness of fit.
10.3.3 Remove non significant regressors one by one
From the t tests reported in the summary
output of mod2
, we observed that some of the regressors are not significant. For this reason we remove them one by one, starting from the one with the highest p-value (rate
). The following is the new model:
= lm(nasdaq ~ ibm + lenovo + apple + amazon + yahoo + gold + SP, data= data_logreturns)
mod3 summary(mod3)
##
## Call:
## lm(formula = nasdaq ~ ibm + lenovo + apple + amazon + yahoo +
## gold + SP, data = data_logreturns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0114711 -0.0015527 0.0000856 0.0016361 0.0104669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.496e-06 7.204e-05 0.062 0.9502
## ibm -9.537e-03 7.664e-03 -1.244 0.2136
## lenovo 5.244e-03 3.357e-03 1.562 0.1185
## apple 9.211e-02 5.007e-03 18.396 <2e-16 ***
## amazon 5.692e-02 4.280e-03 13.298 <2e-16 ***
## yahoo 3.906e-02 4.680e-03 8.346 <2e-16 ***
## gold -6.045e-03 2.927e-03 -2.066 0.0391 *
## SP 8.913e-01 1.409e-02 63.258 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002547 on 1250 degrees of freedom
## Multiple R-squared: 0.9334, Adjusted R-squared: 0.933
## F-statistic: 2502 on 7 and 1250 DF, p-value: < 2.2e-16
We note that \(R^2_{adj}\) doesn’t change (thus including rate
doesn’t decrease the SSE significantly). We go on by removing ibm
which still has a high p-value:
= lm(nasdaq ~ lenovo + apple + amazon + yahoo + gold + SP, data= data_logreturns)
mod4 summary(mod4)
##
## Call:
## lm(formula = nasdaq ~ lenovo + apple + amazon + yahoo + gold +
## SP, data = data_logreturns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0111649 -0.0015513 0.0000687 0.0016327 0.0108501
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.350e-06 7.199e-05 0.116 0.9077
## lenovo 5.410e-03 3.355e-03 1.612 0.1071
## apple 9.214e-02 5.008e-03 18.398 <2e-16 ***
## amazon 5.711e-02 4.278e-03 13.349 <2e-16 ***
## yahoo 3.941e-02 4.672e-03 8.435 <2e-16 ***
## gold -5.886e-03 2.925e-03 -2.013 0.0444 *
## SP 8.823e-01 1.208e-02 73.027 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002547 on 1251 degrees of freedom
## Multiple R-squared: 0.9333, Adjusted R-squared: 0.933
## F-statistic: 2918 on 6 and 1251 DF, p-value: < 2.2e-16
We note that \(R^2_{adj}\) doesn’t change. We go on by removing lenovo
:
= lm(nasdaq ~ apple + amazon +
mod5 + gold + SP, data= data_logreturns)
yahoo summary(mod5)
##
## Call:
## lm(formula = nasdaq ~ apple + amazon + yahoo + gold + SP, data = data_logreturns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0114026 -0.0015730 0.0000785 0.0016151 0.0107557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.403e-06 7.202e-05 0.089 0.929
## apple 9.191e-02 5.009e-03 18.348 <2e-16 ***
## amazon 5.702e-02 4.281e-03 13.321 <2e-16 ***
## yahoo 3.987e-02 4.666e-03 8.545 <2e-16 ***
## gold -5.763e-03 2.925e-03 -1.970 0.049 *
## SP 8.871e-01 1.171e-02 75.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002549 on 1252 degrees of freedom
## Multiple R-squared: 0.9332, Adjusted R-squared: 0.9329
## F-statistic: 3496 on 5 and 1252 DF, p-value: < 2.2e-16
The p-value of gold
is very close to alpha=0.05
. This is a weak evidence against the hypothesis \(H0:\beta_{gold}=0\). We try to remove also gold
and see what happens:
= lm(nasdaq ~ apple + amazon +
mod6 + SP, data= data_logreturns)
yahoo summary(mod6)
##
## Call:
## lm(formula = nasdaq ~ apple + amazon + yahoo + SP, data = data_logreturns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0114132 -0.0015924 0.0000558 0.0016247 0.0106998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.367e-06 7.211e-05 0.102 0.919
## apple 9.186e-02 5.015e-03 18.318 <2e-16 ***
## amazon 5.731e-02 4.283e-03 13.380 <2e-16 ***
## yahoo 3.964e-02 4.670e-03 8.489 <2e-16 ***
## SP 8.871e-01 1.173e-02 75.638 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002552 on 1253 degrees of freedom
## Multiple R-squared: 0.933, Adjusted R-squared: 0.9327
## F-statistic: 4359 on 4 and 1253 DF, p-value: < 2.2e-16
We observe that \(R^2_{adj}\) of mod6
is very similar to the one of mod5
(and it’s still a very high goodness of fit!) but mod6
is more parsimonious because it has one parameter less (it’s less complex). For this reason mod6
should be preferred.
10.3.4 AIC computation
To compare models it is also possible to use the Akaike Information Criterion (AIC) given, in the case of multiple regression model, by \[ \text{AIC}=c + n \log({\hat\sigma^2_\epsilon})+2(1+p). \] where \(c\) is a constant not important for model comparison. The criterion for AIC is : the lower, the better.
It is possible to compute the AIC by using the extractAIC
function:
extractAIC(mod5)
## [1] 6.00 -15019.69
extractAIC(mod6)
## [1] 5.0 -15017.8
The function returns a vector of length 2: the first element represents the total number of parameters (\(p\) + 1 (intercept)), while the second elements is the AIC value. In this case the AIC is lower for mod5
; indeed the two AIC values are very similar. For this reason, considering also what it was said in the previous Section, we still prefer mod6
because it has a very high goodness of fit but it’s less complex.
10.4 An authomatic procedure
As an alternative we can use the step()
that implement a sequential procedure to find an optimal (actually it is sub-optimal model)
<- step(mod2) mod_step
## Start: AIC=-15018.43
## nasdaq ~ ibm + lenovo + apple + amazon + yahoo + gold + SP +
## rate
##
## Df Sum of Sq RSS AIC
## - rate 1 0.0000037 0.008108 -15020
## - ibm 1 0.0000102 0.008114 -15019
## <none> 0.008104 -15018
## - lenovo 1 0.0000151 0.008119 -15018
## - gold 1 0.0000266 0.008131 -15016
## - yahoo 1 0.0004506 0.008555 -14952
## - amazon 1 0.0011505 0.009255 -14853
## - apple 1 0.0021942 0.010298 -14719
## - SP 1 0.0239564 0.032061 -13290
##
## Step: AIC=-15019.86
## nasdaq ~ ibm + lenovo + apple + amazon + yahoo + gold + SP
##
## Df Sum of Sq RSS AIC
## - ibm 1 0.0000100 0.008118 -15020
## <none> 0.008108 -15020
## - lenovo 1 0.0000158 0.008124 -15019
## - gold 1 0.0000277 0.008136 -15018
## - yahoo 1 0.0004518 0.008560 -14954
## - amazon 1 0.0011470 0.009255 -14855
## - apple 1 0.0021951 0.010303 -14720
## - SP 1 0.0259556 0.034064 -13216
##
## Step: AIC=-15020.3
## nasdaq ~ lenovo + apple + amazon + yahoo + gold + SP
##
## Df Sum of Sq RSS AIC
## <none> 0.008118 -15020
## - lenovo 1 0.000017 0.008135 -15020
## - gold 1 0.000026 0.008144 -15018
## - yahoo 1 0.000462 0.008580 -14953
## - amazon 1 0.001156 0.009274 -14855
## - apple 1 0.002196 0.010314 -14721
## - SP 1 0.034606 0.042724 -12933
summary(mod_step)
##
## Call:
## lm(formula = nasdaq ~ lenovo + apple + amazon + yahoo + gold +
## SP, data = data_logreturns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0111649 -0.0015513 0.0000687 0.0016327 0.0108501
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.350e-06 7.199e-05 0.116 0.9077
## lenovo 5.410e-03 3.355e-03 1.612 0.1071
## apple 9.214e-02 5.008e-03 18.398 <2e-16 ***
## amazon 5.711e-02 4.278e-03 13.349 <2e-16 ***
## yahoo 3.941e-02 4.672e-03 8.435 <2e-16 ***
## gold -5.886e-03 2.925e-03 -2.013 0.0444 *
## SP 8.823e-01 1.208e-02 73.027 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002547 on 1251 degrees of freedom
## Multiple R-squared: 0.9333, Adjusted R-squared: 0.933
## F-statistic: 2918 on 6 and 1251 DF, p-value: < 2.2e-16
we can compare the AIC index between model 6 and the model identified by the step function
extractAIC(mod_step)
## [1] 7.0 -15020.3
extractAIC(mod6)
## [1] 5.0 -15017.8
As expected model 6 expresses an higher value of the AIC, nevertheless we still decide to consider this model, having in mind the considerations we have made above.
10.4.1 Plot of observed and predicted values
In the case of multiple regression model it is not possible to plot the estimated regression model, as we did for the simple regression model in Section 9.1.2. This is because it’s a multivariate hyperplane which can not be represented in a 2-D plot. However we can plot together the observed (\(y\)) and predicted values (\(\hat y\), also known as fitted values) of the response variable to have an idea of how good the model is in prediction:
plot(data_logreturns$nasdaq,mod6$fitted.values)
The cloud of points is thin and narrow and this is a sign of a strong linear relationship (correlation is equal to 0.9658989) and a good performance of the model.
10.5 Variance Inflation Factor
The variance inflation factor (VIF) tells how much the variance of \(\hat \beta_j\) for a given regressor \(X_j\) is increased by having other predictors in the model (\(j=1,\ldots,p\)). It is given by \[ VIF_j = \frac{1}{1-R^2_j}\geq 1 \] where \(R^2_j\) is the goodness of fit index for the model which has \(X_j\) has dependent variable and the other remaining regressors as independent variables.
Note that \(VIF_j\) doesn’t provide any information about the relationship between the response and \(X_j\). Rather, it tells us only how correlated \(X_j\) is with the other predictors. We want VIF to be close to one (the latter is the value of VIF when all the regressors are independent).
In R VIF can by computed by using the vif
function contained in the faraway
library:
library(faraway)
vif(mod6)
## apple amazon yahoo SP
## 1.335137 1.370225 1.381315 1.967353
All the four values (one for each regressor in the model) are close to one, so we can conclude that there are no problems of collinearity. A standard threshold for identifying problematic collinearity situation is the value of 5 (or 10).
10.6 Exercise Lab 8
10.6.1 Exercise 1
A linear regression model with three predictor variables was fit to a dataset with 40 observations. The correlation between the observed data \(y\) and the predicted values \(\hat y\) is 0.65. The total sum of squares (SST) is 100.
What is the value of the non-adjusted goodness of fit index (\(R^2\))? You have to think of an alternative way to compute \(R^2\)… see theory lectures. Comment the value.
What is the value of the residual sum of squares (SSE)?
What is the value of the regression sum of squares (SSR)?
What is the estimate of \({\sigma^2_\epsilon}\)?
Fill in the following ANOVA table. To compute the p-value associated to the F-value of the F statistic you have to use the
pf
function (see?pf
):
Source of variability | Df | Sum of Squares (SS) | Mean Sum of Squares (MS) | F value | p(>F) |
---|---|---|---|---|---|
Regressors | … | … | … | … | … |
Residuals | … | … | … | ||
Total | … | … |
- What do you think about the fitted model?
10.6.2 Exercise 2
- Complete the following ANOVA table for the linear regression model \(Y=\beta_0+\beta_1X_1+\beta_2X_2+\epsilon\). Explain all the steps of your computations. Suggestion: start with the degrees of freedom and then the p-value. To obtain the F-value given the p-value you will need the
qf
function (see?qf
).
Source of variability | Df | Sum of Squares (SS) | Mean Sum of Squares (MS) | F value | p(>F) |
---|---|---|---|---|---|
Regressors | … | … | … | … | 0.04 |
Residuals | … | 5.66 | … | ||
Total | 15 | … |
What do you think about the fitted model?
Determine the value of \(R^2\) and of the adjusted \(R^2\). Comment.
10.6.3 Exercise 3
Use again the data in the prices_5Y.csv file alty used for Lab 7. They refer to daily prices (Adj.Close
) for the period 05/11/2012-03/11/2017 for the following assets: Apple (AAPL
), Intel (INTC
), Microsoft (MSFT
) and Google (GOOGL
). Import the data in R
.
Create a new data frame containing the log returns for all the assets.
Estimate the parameters of the multiple linear model which considers
GOOGL
as dependent variable and includes all the remaining explanatory variables. Are all the coefficients significantly different from zero? Comment the results.Do you suggest to remove some of the regressors?
Comment about the goodness of fit of the model.
Comment the
anova
table for the multiple linear regression model. Moreover, derive from the table the values of SSR, SSE and SST.Compute manually the value of the F-statistic reported in the last line of the
summary
output. Compute also the corresponding p-value and comment the result.Plot the observed and predicted values of the response variable.