2.3 Evaluating and interpreting the model
We are now ready to carry out the simple linear regression analysis. The results of the analysis are as follows:
Call:
lm(formula = happiness_2019 ~ income_2019, data = df)
Residuals:
Min 1Q Median 3Q Max
-19.4572 -3.5785 -0.1413 3.8410 17.5070
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.478e+01 1.559e+00 28.72 < 2e-16 ***
income_2019 5.642e-04 5.489e-05 10.28 4.94e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.768 on 76 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.5816, Adjusted R-squared: 0.5761
F-statistic: 105.6 on 1 and 76 DF, p-value: 4.945e-16
From the above output, we can note the following:
- The results related to ˆβ0 and ˆβ1 are under the heading
Coefficients:
- The first row
(Intercept)
corresponds to the intercept coefficient ˆβ0, while the second rowincome_2019
corresponds to the slope coefficient ˆβ1 - The estimate for β0 is
4.478e+01
. Thee+01
tells us to move the decimal point one place to the right, so we have that ˆβ0=44.78 - The estimate for β1 is
5.642e-04
. Thee-04
tells us to move the decimal point four places to the left, so we have that, rounded to four decimal places, ˆβ1=0.0006 - Knowing the values for ˆβ0 and ˆβ1, we can write down the estimated model as:
- ^Happiness=44.78+0.0006×Income
- We can interpret the value of ˆβ1=0.0006 as follows: "We estimate that, on average, for every $1 increase in GDP per capita, the average happiness score will be 0.0006 higher".
- Reading from the column labeled
Pr(>|t|)
, the p-value for the intercept coefficient is< 2e-16
, which is very close to zero. This is a test of the form H0:β0=0 versus H1:β0≠0. - The p-value for the slope coefficient is
4.94e-16
which is also very close to zero. This is a test of the form H0:β1=0 versus H1:β1≠0. Since we have p<0.05, we reject H0 and conclude that β1 is not zero. This means there is evidence of a significant linear association between income and happiness. (More information on this below) - The
Multiple R-squared
value, which can be found in the second last row, is R2=0.5816. This indicates that 58.16% of the variation in the response can be explained by the model, which is a good fit. (More information on this below)
2.3.1 Testing for H0:β1=0 versus H1:β1≠0
Recall the simple linear regression model
y=β0+β1x+ϵ.
If the true value of β1 were 0, then the regression model would become
y=β0+ϵ,
meaning y does not depend on x in any way. In other words, there would be no association between x and y. For this reason, the hypothesis test for β1 is very important.
2.3.2 R2, the Coefficient of Determination
R2 values are always between 0 and 1. In fact, the R2 value is simply the correlation squared. To see this, recall that in Section 1.1, we found that the correlation coefficient was r=0.76263. If we square this number, we get R2=0.762632=0.5816. Conversely, if we take the square root of R2, we can find the correlation.
The R2 value can be used to evaluate the fit of the model. R2 values close to 0 indicate a poor fit, whereas R2 values close to 1 indicate an excellent fit. Although the interpretation of the R2 value can sometimes differ by subject matter, for the purposes of this subject, the below table can be used as a guide when interpreting R2 values:
R2 value | Quality of the SLR model |
---|---|
0.8≤R2≤1 | Excellent |
0.5≤R2<0.8 | Good |
0.25≤R2<0.5 | Moderate |
0≤R2<0.25 | Weak |