Linear regression for prediction

Learning outcomes/objective: Learn…

Sources: #TidyTuesday and tidymodels

1 Regression vs. classification

2 Linear model

2.1 Linear model (Equation) (1)

  • Linear Model = LM = Linear regression model
  • Aim (normally)
    • Model (also understand) relationship between outcome variable (output) und 1+ explanatory variables (features)
    • But very popular machine learning model as well!
\(y_{i} = \underbrace{\color{blue}{\beta_{0}} + \color{orange}{\beta _{1}} \times x_{1i} + \color{orange}{\beta _{2}} \times x_{2i}}_{?} + \underbrace{\color{red}{\varepsilon}_{i}}_{?}\)


  • Q: How do we also call \(\color{blue}{\beta_{0}}\), \(\color{orange}{\beta _{1}}\) and \(\color{orange}{\beta _{2}}\)?

2.2 Linear model (Equation) (2)

\(y_{i} = \underbrace{\color{blue}{\beta_{0}} + \color{orange}{\beta _{1}} \times x_{1i} + \color{orange}{\beta _{2}} \times x_{2i}}_{\hat{f}\text{ giving predicted values }\color{green}{\widehat{y}}_{i}} + \underbrace{\color{red}{\varepsilon}_{i}}_{\color{red}{Error}} = \color{green}{\widehat{y}}_{i} + \color{red}{\varepsilon}_{i}\)


  • Q: Why is the linear model called “linear” model?

  • Important: Variable values (e.g., \(y_{i}\) or \(x_{1,i}\)) vary, parameter values (e.g., \(\boldsymbol{\color{blue}{\beta_{0}}}\)) are constant across rows

  • Important: \(\color{green}{\widehat{y}_{i}}\) varies across units

Name \(Lifesatisfaction\)
\(y_{i}\)
\(\boldsymbol{\color{blue}{\beta_{0}}}\) \(\boldsymbol{\color{orange}{\beta_{1}}}\) \(Unemployed\)
\(x_{1,i}\)
\(\boldsymbol{\color{orange}{\beta_{2}}}\) \(Education\)
\(x_{2,i}\)
\(\boldsymbol{\color{red}{\varepsilon_{i}}}\) \(\color{green}{\widehat{y}_{i}}\)
Samuel 8 ? ? 0 ? 7 ? ?
Ruth 4 ? ? 0 ? 3 ? ?
William 5 ? ? 1 ? 2 ? ?
.. .. .. .. .. .. .. .. ..

2.3 Linear model (Visualization)

  • Figure 1 visualizes the distribution of our data and a linear model that we fit to the data

Figure 1: Joint distribution + Linear Model

  • Lifesatisfactioni = b0 + b1Unemployedi + b1Educationi + \(\epsilon\)i (Wikipedia)
  • The plane in Figure 1 is not exact model of the data
    • Admissible model must be consistent with all the data points
    • Plane cannot be model, unless it exactly fits all the data points
    • Hence, error term, \(\epsilon\)i, must be included in the model equation, so that it is consistent with all data points
  • Predictive accuracy: How well does our model predict observations (in the test dataset)?
    • Calculate average error across all errors \(\epsilon\)i (in the test dataset)

2.4 Linear model (Estimation)

  • Estimation = Fitting the model to the data (by adapting/finding the parameters)
    • e.g. easy in case of the mean (analytical) but more difficult e.g. for linear (or other) model(s)
  • Modellparameter: \(\color{orange}{\beta_{0}}\), \(\color{orange}{\beta_{1}}\) and \(\color{orange}{\beta_{2}}\)
  • Ordinary Least Squares (OLS)
    • Least squares methods (Astronomy)
    • Choose \(\color{orange}{\beta_{0}}\), \(\color{orange}{\beta_{1}}\) and \(\color{orange}{\beta_{2}}\) (= plane) so that the sum of the squared errors \(\color{red}{\varepsilon}_{i}\) is minimized (See graph!)
    • Q: Why do we square the errors?

2.5 Linear model (Prediction)

\(y_{Samuel} = \color{blue}{6.23} + \color{orange}{-0.58} \times x_{1Samuel} + \color{orange}{0.20} \times x_{2Samuel} + \color{red}{\varepsilon}_{Samuel}\)

\(8 = \color{blue}{6.23} + \color{orange}{-0.58} \times 0 + \color{orange}{0.20} \times 7 + \color{red}{0.373} = \color{green}{7.63} + \color{red}{0.373}\)

Name \(Lifesatisfaction\) \(\boldsymbol{\color{blue}{\beta_{0}}}\) \(\boldsymbol{\color{orange}{\beta_{1}}}\) \(Unemployed\) \(\boldsymbol{\color{orange}{\beta_{2}}}\) \(Education\) \(\boldsymbol{\color{red}{\varepsilon}}\) \(\color{green}{\widehat{y}}\)
Samuel 8 6.23 -0.58 0 0.20 7 0.373 7.63
Ruth 4 6.23 -0.58 0 0.20 3 -2.83 6.83
William 5 6.23 -0.58 1 0.20 2 -1.05 6.05
.. .. .. .. .. .. .. .. ..
  • Important note on “prediction”
    • In Figure 1 we simply fitted the model to all observations in the dataset
    • When using the linear model as an ML we split the data first, fit the model to observations in the training data subset and use this model to predict observations in the test data subset

2.6 Linear model: Accuracy (MSE, RMSE, R-squared)

  • Mean squared error (James et al. 2013, Ch. 2.2)
    • \(MSE=\frac{1}{n}\sum_{i=1}^{n}(y_{i}- \hat{f}(x_{i}))^{2}\) (James et al. 2013, Ch. 2.2.1)
      • \(y_{i}\) is \(i\)s true outcome value
      • \(\hat{f}(x_{i})\) is the prediction that \(\hat{f}\) gives for the \(i\)th observation \(\hat{y}_{i}\)
      • MSE is small if predicted responses are to the true responses, and large if they differ substantially
  • Training MSE: MSE computed using the training data
  • Test MSE: How is the accuracy of the predictions that we obtain when we apply our method to previously unseen test data?
    • \(\text{Ave}(y_{0} - \hat{f}(x_{0}))^{2}\): the average squared prediction error for test observations \((y_{0},x_{0})\)
    • Usually, when building a model we used a third dataset to assess accuracy, i.e., analysis (training) data, assessment (validation) data and test data
  • Fundamental property of ML (cf. James et al. 2013, 31, Figure 2.9)
    • As model flexibility increases, training MSE will decrease, but the test MSE may not (danger of overfitting)
  • In practice we use the Root Mean Squared Error (RMSE)
    • MSE is expressed in squared units and not directly comparable to the target/outcome variable
    • RMSE takes the square root of the MSE and brings the units back to the original scale of the outcome variable
    • RMSE is more interpretable and comparable across models/datasets
  • R-squared measures the proportion of variance in the outcome variable that is explained by the model
    • Ranges from 0 to 1, where 0 means the model explains none of the variance and 1 means the model explains all of the variance
    • Calculated as the ratio of the explained variance to the total variance of the outcome variable (measure of the model’s goodness of fit)

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.