Chapter 6 Multiple regression

6.1 Introduction

This chapter covers one of the most versatile and useful techniques, ordinary least squares regression. Included in this chapter is a discussion of the distinction between explanatory and predictive modeling. Several examples of predictive modeling using regression are included.

Linear regression is arguably the most well-known of the many algorithms used in predictive analytics. There are several reasons for this. First, it is a logical, linear model which has many uses and is conceptually attractive. Linear relationships are easy to think about.

Second, the technique itself is relatively easy to program, so there are many, many software tools available. It can easily be programmed in any language with just a few statements.

Third, it is very flexible – it can be applied to many types of problems, even those that at first might not seem to be linear regression candidates. Many phenomena can be cast, at least approximately, into a linear model.

Fourth, regression has multiple, distinct uses. It is often used to build models that explain how one or more independent variables affect a continuous, dependent variable. It is also used for control, in the sense that regression models can help identify cases that are in error or problematic. Finally, the focus in this course will be on regression for prediction.

Fifth, regression has a long and elaborate history of development. It is a huge topic. There are many books and courses devoted to the subject. It’s been under development for more than 100 years. The basic idea is quite simple, but there are many exceptions, special cases, and assumptions that might not fit a particular situation. In ordinary regression, the target variable is continuous, but over the years many modifications to the basic model have led to new regression-type models such as Cox regression, Poisson regression, logistic regression, and others.

6.2 Regression techniques

There exist several algorithms that perform different forms of regression that were developed to better align with the objectives of the analysis and characteristics of the data. Ordinary least squares, the topic of this chapter applies to problems where there is a single continuous target variable and one or more predictor variables, which can be continuous or categorical (or a mixture of both).

The models covered in this chapter are:

  • Ordinary linear regression
  • Forward selection of features
  • Backward selection of features
  • Stepwise selection of features
  • Lasso regression

Note that regression-type problems can also be addressed using other techniques such as neural networks, support vector machines, and others. These will be discussed in other chapters

6.3 Regression for explanation

Regression is used differently in different disciplines. In economics, psychology, sociology and other fields, regression is mostly used with the goal of developing causal explanations.(Shumeli 2010)

It is assumed that most readers of this text have been introduced to multiple regression. The context for regression applications in most statistics texts (e.g., ______) is on building models with the objective of making causal inferences and explanation. A theoretical model is posed, predictor variables are identified, and observational data is collected (cross-sectional, longitudinal, or both). Hypotheses based on the theoretical model are tested using regression and related techniques. The aims of such studies include assessing both the statistical significance and magnitude of independent variables. This is a challenging task and accomplishing this with any degree of confidence requires considerations about several strict assumptions.

Violations of any of these assumptions can cast doubt on the validity of the conclusions made from the analysis so three questions should be asked about each: (1) Is the assumption met in the current situation? (2) If not, how serious are the consequences of violation of the assumption? (3) If the assumption is violated and critical to the analysis, can remedial techniques be used to alleviate the consequences?

The assumptions need to be met to confidently deduce statistical significance of the predictors and the overall regression model itself, the interpretability of the effect of each predictor on the target, and confidence intervals on estimates made by the model. The Gauss-Markov theorem states that for an additive linear model ordinary lease squares regression produces unbiased estimates that have the lowest variance of all possible linear estimators. This is usually given the acronym “BLUE” for “Best Linear Unbiased Estimators.” The four Gauss-Markov assumptions are:

  • The dependent variable is a linear, additive function of a set of predictors plus an error term.
  • The error term has a conditional mean of zero.
  • The variance of the error term is constant for all values of the predictors (homoskedasticity).
  • The error term is independently distributed (no autocorrelation).

To this list, some or all the following assumptions are sometimes added:

  • Predictor variables are not correlated with the error term.
  • There is no perfect collinearity among the predictors.
  • The error term is normally distributed (and relatedly, there are no outliers or observations with undue influence).
  • The number of observations must be greater than the number of predictor variables (usually 5 to 10 times as many observations as predictors.)
  • All the predictor variables have non-zero variability, i.e., are not constants.

Violation of these assumption can result in biased estimates of the parameters of the regression model. The estimated size and variance of predictor coefficients can be biased upward or downward. In some cases, this can render the results totally misleading.

Example: Omitted variable bias

There is considerable controversy on the question of whether increased spending on public education leads to better student outcomes. For example, the Heritage Foundation published a report that concluded:

Federal and state policymakers should resist proposals to increase funding for public education. Historical trends and other evidence suggest that simply increasing funding for public elementary and secondary Education has not led to corresponding improvement in academic achievement.(insert reference)

One approach to investigating this question is to look at expenditures per pupil in all 50 states plus the District of Columbia and measuring student performance. The performance criterion in the data set is the average combined SAT scores. The first 10 observations are shown in Table 6.1. The hypothesis is that students in states in the that spend more per pupil on primary and secondary schools should be better prepared for college and thus perform better on the SAT college prep test.

SAT scores were regressed on expenditures (in 1,000’s). The coefficient on expense is significant (p < .001) and negative (-22.28), which suggests that increasing the spending on education results in lower performance on the SATs. If this is true, this finding has important implications for school funding. Those who argue against spending more on education may be right.

It turns out that an important variable was omitted from the regression: the percentage of students taking the SAT by state. There was considerable variation in this measure, as shown in the chart in Figure 6.1. In Connecticut, more than 80% of students took the SAT while in Mississippi only 4% took the SAT. If the regression is run with expenditure per pupil and percentage taking the SAT, the conclusion is different. The coefficient on expense is positive (8.60) and significant (p = 0.046).

The second analysis controlled percentage taking the SAT. This means that the effect of expenditures is estimated removing the effect of percentage taking the SAT. By including one or more control variables in a regression, their effects are separated from the variable of interest. The control variable itself is usually not of primary interest to the analyst. The point here is that omitting a key variable can lead to incorrect conclusions in regression. Explanatory models are more difficult and demanding than using regression for prediction.

Percent taking SAT by state and DC.

Figure 6.1: Percent taking SAT by state and DC.

Table 6.1: Expenditures on education and SAT scores.
State + DC Expenditures per pupil Combined SAT score
Alabama 3627 991
Alaska 8330 920
Arizona 4309 932
Arkansas 3700 1005
California 4491 897
Colorado 5064 959
Connecticut 7602 897
Delaware 5865 892
District of Columbia 9259 840
Florida 5276 882

6.4 Regression for prediction

Predictive modeling is the process of using a statistical model such as regression to make predictions on new or future observations of the predictor variables. Image recognition, natural language processing, and many business problems there is frequently an emphasis on prediction rather than explanation.

Breiman (2001) described what he considered two cultures in statistics. He called the most prevalent approach the “data modeling culture.” Applications typically involve relatively small numbers of observations and a smaller number of predictors, and the most important aim was causal inference. This approach is consistent with the explanatory use of regression.

A second approach (or culture) Breiman called the “algorithmic culture,” which focuses almost purely on predictive accuracy rather than statistical tests and interpretation of model parameters. These applications typically involved huge data sets, sometimes with millions of observations as well as many potential predictors. (In some cases, the number of potential predictors was even larger than the number of observations.) This new paradigm began in the 1980s and gained in popularity from the 1990s to the present. This is the domain of data mining, predictive analytics, machine learning, and big data.

Many have been critical of this newer approach, beginning with comments on Breiman’s by Cox (), Efron (), and other leading classical statisticians. Despite the criticism there have been many successful applications of “pure prediction algorithms” (Efron 2019), in business, medicine, biology, and other fields. The potential for misleading or even dangerous predictions using these models remains and it is the responsibility of the analyst to carefully acknowledge both limitations as well as benefits of such models.

6.4.1 Revisiting regression assumptions

When regression is used for prediction rather than explanation, the role of assumptions is changed. The primary objective is prediction accuracy assessed on new data. This is usually assessed by creating training and test subsets of a data set with many observations. The training data is used to develop the predictive model and the test set is used to represent the “new” data.(Footnote 1) In fact, some of the assumptions may be violated, but if the model predicts well, it can be useful.

Allison (2014) noted that outside of academia, “…regression is primarily used for prediction.” He further notes that there are important differences between using regression for explanation versus prediction. Specifically, he writes that omitted variable bias is much less of an issue when the goal is prediction. Multicollinearity (if it is not extreme) can be tolerated in predictive model since the coefficients on individual variables are not of primary concern. Measurement error in the predictor variables leads to bias in the estimates of regression coefficients, but again the estimates on predictor coefficients is not as critical. (Of course, with predictive applications, high degrees of measurement error can make predictions less accurate).

Other differences between the two uses of regression include the following:

  • Out-of-sample prediction accuracy is more important than R^2 with the original data set. However, high R^2 in the hold-out sample or samples is important for prediction. With causal modeling, even low R^2 values using the original data set are not as much of a concern. The hypothesis tests on the predictors in such applications are more important.
  • Normality of the error term is not a requirement since hypothesis testing is not the goal of predictive regression.

The perspectives on predictive regression discussed above are controversial and many statisticians would undoubtedly disagree with many of the statements made. In truth, causal modeling is important and if done well can lead to insights that are likely to have longer-run usefulness. However, for present purposes, the focus in this chapter on regression and the other chapters in the book will be on prediction. It is critical to understand that when a predictive model is created that ignores the traditional assumptions of regression, using the results as if causality has been established, it can lead to huge mistakes. While it may be of interest to find which predictor values had the greatest impact on the target variable, it cannot be assumed that manipulation of these impactful variables is likely to lead to expected changes in the target. Correlation, not causality, is obtained with predictive models.

The one assumption that is critical for accurate predictive regression models is that the dependent variable is a linear, additive function of the predictors. Othe algorithms,discussed in later chapters, can more or less “automatically” deal with complications such as non-linearity. With regression, however, it is up to the analyst to explore possible non-linear relationships and potential interactions among predictors. Failure to do so can lead to sub-optimal models. If some of the same predictors are transformed in some way, the accuracy of the regression model could be improved without additional collection of data.

6.4.2 Prediction example: Used Toyota Corollas

A dataset with the prices of used Toyota Corolla for sale during the late summer of 2004 in The Netherlands was obtained from Kaggle. (reference) It has 1436 records containing details on 38 attributes, but only a subset was extracted with 1,000 observations and variables is used for this example. The variables included are price (the target variable) and the following predictors:

  • Age of the car (in months)
  • Mileage (in KM)
  • Fuel type (Diesel, natural gas, or gasoline)
  • Horsepower
  • Automatic transmission (Yes or No)
  • Engine displacement (in CC)
  • Weight (in KM)

The goal is to predict the price of a used Toyota Corolla based on its specifications.The KNIME workflow for this analysis shown in Figure 6.2.

Workflow for OLS regression with Toyota Corolla price data.

Figure 6.2: Workflow for OLS regression with Toyota Corolla price data.

The regression results are shown in Table 6.2. The measures show that the results for the training and test data are comparable. The training data shows better results for R^2 and root means square error while the measures are slightly better with the test data for mean absolute error and mean absolute percent error.

Table 6.2: Regression with Toyota1000 data.
Measure Training data Test data
R^2 0.871 0.825
Mean absolute error 1028.513 1005.412
Root mean squared error 1385.478 1455.483
Mean absolute percentage error 0.094 0.091

A scatterplot of predicted price versus actual price for the test data indicates reasonable results. There is one clear outlier at row 221 (Figure 6.3). The presence of this outlier most likely explains why the measures involving the absolute value rather than the squared residuals are worse for the test data. Squaring the residuals (for R^2 and root mean square error)results in a greater effect on the performance measure.

Predicted versus actual prices of Toyota Corollas.

Figure 6.3: Predicted versus actual prices of Toyota Corollas.

KNIME workflow for regresion analysis of apartment prices.

Figure 6.4: KNIME workflow for regresion analysis of apartment prices.

Appendix: A brief history of regression

The term regression and the basic idea behind it came from a British scientist, Sir Francis Galton. He was interested in heredity and evolution; this is not surprising since he was a cousin of Charles Darwin.

Sir Francis Galton.

Galton asked the following question: “What is the relationship between the heights of children and heights of their parents?” He was an empirical scientist. So, he collected data on pairs of parents and children and carefully studied the data. He created a chart with the children’s heights (as adults) versus the height of their parents. (Figure 6.5) He fit a line to the data such that the line was as close as possible to the observations - the blue line shown in the chart. Once Galton fitted the line, he calculated the intercept and slope.

The slope and intercept of the fitted line, .611 and 26.4 respectively, are shown at the top of the chart.

Height of children vs. height of parents.

Figure 6.5: Height of children vs. height of parents.

As expected, Galton found that tall parents had tended to have tall children, but the relationship was not perfect. A one-to-one relationship between parents’ heights and the height of their children would follow the red line in Figure 6.5. Instead, there was evidence that the adult height of children of tall parents regressed or reverted to the mean height of all adults, shown as the blue line in the chart. For example, if a parent’s height was about 72 inches, the expected height of the children would be 70 inches. This was also found to be true for shorter parents. If a parent’s height was about 64 inches, the expected height of the children would be 65 inches. So, the heights of children also tended toward the mean height of all adults.

Galton use the term regression as this idea of reverting or returning to a norm. People then took this idea of drawing a line through data and estimating the equation, applied it to many other types of problems, and called the process regression, after Galton, even though the context and meaning were different. Karl Pearson, a collaborator, and lab assistant to Galton, developed the mathematics of regression that we have today.

Footnote 1 Of course, ability of the model to predict accurately in truly new situations needs to be continuously monitored as the context of the problem may change. This is important for causal regression applications as well.