Chapter 3 Introduction to Regression
3.1 Motivation
This Chapter will establish the basics of a crucial unit in this book: Linear Regression Although we will not yet delve into more involved techniques for regression (including linear modeling and multiple regression) we will explore certain concepts that reflect across the rest of Statistics. This is the first ‘real’ Statistics section of the book, so buckle up!
3.2 Covariance and Correlation
According to popular legend, Romulus and Remus are the undisputed founding brothers of Rome (ask Romulus for specifics). We will argue here that the same title applies to Covariance (\(s_{xy}\)) and Correlation (\(\rho\)), the two brothers that drive Regression.
First, let’s review how they relate.
\[\rho = \frac{s_{xy}}{s_xs_y}\]
Or the correlation of variables \(X\) and \(Y\) equals the covariance of \(X\) and \(Y\) divided by the product of their standard deviations.
Usually, if you ever have to calculate Correlation by hand, your path will take you through this formula. Covariance can be found different ways in this book (more on that later) but this formula can also prove useful to find it if you are given correlation and standard deviations.
Anyways, these two summarize the strength of a linear relationship. This is undoubtedly a significant jump from our other statistics (mean, median, etc.), since we are now dealing with multiple variables and their relationships.
As regression consists of building a relationship between two variables, you can likely see why these two metrics are important. Let’s break each down further.
1. Covariance
The first brother gives a measure of the direction with which two variables vary together. A positive value means that there is a direct relationship: that is, they move in the same direction. A negative value means there is an inverse relationship, or they move in opposite direction.
.
Let’s consider two variables: how many hours you spend studying in school and your GPA. Hopefully, these have a positive covariance. That would mean that the two vary together: the more you study, the higher the GPA tends to be (and vice versa).
An opposite example would be imagining the covariance between hours exercised and heartbeats per minute. This would likely be negative, since usually people are in better shape when they exercise more and thus their hearts tend to pump less per minute.
A key aspect of covariance is that it does not give relationship strength. Since covariance relies on units (of whatever variable you’re measuring) there is no set gauge to measure how strong a relationship is. So, even if you see a covariance of \(.001\), do not be quick to call it a weak relationship (we could be measuring something at the molecular level, maybe).
Another important reminder for covariance is that it gives association, NOT causation. In the example above, we can’t say that studying harder caused a higher GPA (notice how we use the word “tends”), only that when you study a lot you also seem to get a better GPA. Without actually performing an experiment, we can only say that these things tend to vary together, but we can never say if one actually causes the other without performing a controlled experiment.
Finally, covariance will be relevant to us in one other setting: calculating the total variance of multiple random variables. This will become more important in Portfolio Theory; for now, the equation is:
\[Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y)\]
Let’s imagine that \(X\) and \(Y\) are two portfolios (this context will be more important later in Modern Portfolio Theory). We are primarily concerned with the returns (money you make from the stocks in the portfolio), specifically the mean of these returns and their variance. If we wanted to find the mean of both portfolios combined, we would just add the separate means:
\[E(X + Y) = E(X) + E(Y)\]
Which is intuitive. To find the total variance of the two portfolios combined (i.e., what variance in total return you would expect if you were holding both portfolios) you would need the equation above: the sum of the two individual variances plus \(2\) times the covariance.
The idea here is that while there are individual variances that need to be taken into account, there is also an interactive variance between the two portfolios. The Covariance term captures this interactive variance, or risk.
To visualize this interactive risk, let’s consider if we had two portfolios: one made up primarily of oil companies, and the other of large gas stations (like Cumberland Farms). If you wanted to find the total risk/variance of both portfolios combined, you would first need the individual risk, but then you would need the added interactive risk (their covariance). That is because, since the two portfolios are overtly related (when oil companies do poorly, gas stations have to charge higher prices and thus likely also do worse) there is an added level of risk that we must take into account.
This makes intuitive sense, since for independent variables (a concept we will formalize later: for now, one does not affect the other) the Covariance is \(0\), or their is no interactive risk. You would expect close to \(0\) interactive risk between a portfolio of fruit companies and a portfolio of teddy bear companies (we can assume these are pretty much independent, although, of course, there are ripple effects in any market).
One last note about Covariance: the covariance between two identical variables is just the variance of the original variable. That is:
\[Cov(X,X) = Var(X)\]
Correlation:
The second brother is more interesting and perhaps more important. Instead of only indicating direction, Correlation measures direction and strength of a linear relationship. The sign gives direction, the magnitude strength.
Correlation is always between \(-1\) and \(1\). A Correlation of \(0\) means there is no linear relationship between two variables. As the correlation moves from \(0\) to \(-1\) or \(1\), the relationship gets stronger and stronger, culminating in a perfect relationship at either endpoint (\(-1\) or \(1\)).
What would a correlation of absolute \(0\) look like? There are a few options: first, variables could be totally independent (not related at all). If we had two fair coins that we flipped separately, the number of heads in the first \(100\) flips of the first coin has a correlation of \(0\) with the number of tails in the first \(100\) flips of the second coin. That’s because these two coins are totally unrelated: they are independent, so knowing information about one does not give any information about the other, and thus there is no relationship.
A correlation of \(0\) could also apply for a non-linear relationship. For example, you may have learned about the Laffer Curve in an economics course. This concept states that the relationship between tax size and tax revenue is (in theory) perfectly quadratic (looks like an umbrella). If we tried to use our tools of correlation on finding a relationship between tax size and tax revenue, then we should see a correlation of \(0\), since the relationship is quadratic, not linear.
What would a correlation of \(1\) or \(-1\) look like? This implies a perfect relationship, which may be tough to imagine. A good example is the relationship between feet and inches: the relationship is always exactly Feet equals twelve Inches. Since there is absolutely no error in this relationship (you never get \(12\) inches equaling \(1.2\) Feet) the correlation is \(1\). The correlation would be \(-1\) if the relationship was exact but inverse; that is, as inches increased by \(12\), feet decreased by \(1\).
It’s clear that these extreme correlations (\(0,-1,1\)) aren’t really feasible in real world data. There’s always something (ripple effect, chaos theory, whatever) that relates the two variables even a little bit, and there’s always something (imprecise nature of the real world) that keeps the relationship from being absolutely perfect. Therefore, don’t worry if you don’t see these in real world regression readouts: they are merely guides (just like you never have anything in the read world with exactly \(0\) or \(100\%\) chance of happening).
Instead, remember that the higher in magnitude the correlation, the stronger the relationship. A rough rule of thumb is \(\mid \rho \mid>.8 =\) very strong relationship, \(.8>\mid \rho \mid>.5 =\) relationship of medium strength, \(.5>\mid \rho \mid =\) not so strong relationship.
Again, Correlation measures association, not causation. We could present a case where we found a correlation of \(-.93\) between cowboys and car accidents. This does not mean that the absence of cowboys causes more car accidents. Clearly, the lurking variable here is time: as time progressed, real life cowboys became rare, while the automobile became more and more mainstream, resulting in just a raw number of more car accidents. Although it would be cool to claim that cowboys prevent car accidents, we can only say how one tends to move as the other moves.
Finally, it’s important to remember that Correlation is unitless. Therefore, it does not matter if you went from measuring the relationship between height in inches versus weight in grams to height in feet versus weight in pounds: Correlation is not affected.
3.3 Ecological Correlation
As you’ve seen, the concept of correlation vs. causation tends to rear its ugly head in different sections. While most of its applications aren’t really that novel and reflect what we discussed in previous units, there is a new concept that is worth reviewing: Ecological Correlation.
The idea here is that a well-meaning researcher aggregates individual data into group data, runs a regression on the group data, and then tries to make conclusions about the individuals based on these group results.
Take the classic example of regressing fat intake per day vs. lifespan. If we ran this regression at the national level, we would likely find a positive relationship: for countries (like the United States, major European powers and third world countries), fat intake is usually a good indicator of other factors like general affluence and education. Usually, countries that are fed better on a national level are also schooled and employed better, leading to a higher life expectancy.
However, if we actually looked at the individuals WITHIN the United States, we might find the exact opposite story. Namely, it’s clear that the higher the fat intake for US citizens, the (usually) worse health of the individual (since malnourishment is less of an issue within the country). This results, of course, in a lower life expectancy, which is the exact opposite effect.
The key here is that there is a different effect of fat intake on the national level than on the individual level. Therefore, we can’t draw individual conclusions based on group (here, national) results. Doing so would result in the Ecological Fallacy.
3.4 Modern Portfolio Theory
MPT is a perfect example of relevant applications of statistics for you career-minded individuals. It also provides a constant reminder that the worthwhile parts of Finance only exist because of statistical and mathematical tools (just kidding). Whatever the case, this concept remains one of the most financial focused sections of this book.
The key to Modern Portfolio Theory (in this book, at least) is understanding that for our purposes it really boils down to two equations. The motivation behind this theory is trying to find the best portfolio allocation - how much we invest in each stock - given multiple stocks (in case you need a refresher, portfolios are made up of various stocks). These stocks have two metrics we are concerned with: returns and risk (measured here as variance). We want, then, to optimize the portfolio based on the possible stock allocations (\(\%\) invested in various stocks).
The most important thing to us in this book is the total return and risk of the portfolio as a whole. This is just an application of statistical algebra, derived from the following equations:
\[E(aX + bY) = aE(X) + bE(Y)\] \[Var(aX + bY) = a^2Var(X) + b^2Var(Y) + 2abCov(X,Y)\]
Where \(a\) and \(b\) are constants, and \(X\) and \(Y\) are random variables.
We can apply this concept to portfolios via the following equations:
\[E(P) = w_1r_1 + w_2r_2\]
Where \(P\) is portfolio returns, \(w_i\) gives the weight of the \(i^{th}\) stock (how much of the portfolio is invested into one stock, could be 0 to 100\(\%\). In this case, we know \(w_1 + w_2 = 1.00\), so we could re-write the above equation using \(w_2 = 1 - w_1\), and, indeed you may often see it in this format) and \(r_i\) gives the return of the \(i^{th}\) stock (how much it makes).
This is pretty intuitive then: the overall expected return of the portfolio is the weighted expected returns of the stocks that make it up (weighted by how much they are represented in the portfolio). So, if we held a portfolio that was invested \(70\%\) in a stock that returned \(\$10\) and \(30\%\) in a stock that returned \(\$5\), the expected return of my portfolio would be \(E(P) = (.7)10 + (.3)5 = \$8.5\).
Slightly trickier is the total variance of the portfolio. Applying the second equation from above, we get:
\[Var(P) = w_1^2s_1^2 + w_2^2s_2^2 + 2w_1w_2s_{1,2}\]
Where \(w_i\) still gives weight, \(s_i^2\) gives the variance of the \(i^{th}\) portfolio and \(s_{i,j}\) gives the covariance of portfolios \(i\) and \(j\).
Take time to convince yourself that this is the same equation from above (hint: \(w_1\) and \(w_2\) are just \(a\) and \(b\)). The intuition here is that the total risk of the portfolio consists of the (weighted) risks of each stock (the first two terms) AND the interactive risks of the stock (the third term). This interactive risk is what we discussed earlier, with the oil and gas station companies example. If the two stocks are highly related (have a large covariance) then when one does poorly the other tends to do poorly (and vice versa). This risk is captured by the third term in the above equation.
So, if we have a portfolio that is \(70\%\) invested in a stock with \(\$3\) of variance and the other \(30\%\) invested in a stock with \(\$1\) variance and the two stocks have a covariance of \(\$5\), our portfolio variance would be \(.7^23^2 + .3^21^2 + 2(.7)(.3)(5) = \$6.6\).
In this book, we will usually only work with portfolios made up of two stocks. However, this can be applied to any number of stocks, and the general form for return and risk (as you can probably guess) is as follows:
\[E(P) = \sum\limits_{i=1}^n w_ir_i\] \[Var(P) = \sum\limits_{i=1}^n w_i^2s_i^2 + \sum\limits_{i=1}^n \sum\limits_{j=1}^n w_iw_jCov_{i,j}\]
For all \(n\) stocks.
For those following along at home, it only seems like we’ve done half the work. While it is nice to know the mean and variance of a portfolio, these are only useful if we can optimize them (pick the best allocation/weights given the stocks that make up the portfolio). This is not a book on optimization, but for the sake of interest, here is the rest of the process. If you are not interested in optimization, please feel free to skip this section.
Let’s take, for example, minimizing risk. The method would be to take the above equation and set a constraint, from which you could optimize (usually via a Lagrangian). Our constraint here would be that all the weights must add up to \(1\) (since we can’t have more than \(100\%\) of in stocks). So, the problem for two stocks becomes:
\[min \; (w_1^2s_1^2 + w_2^2s_2^2 + 2w_1w_2s_{1,2})\] \[s.t. \; \; w_1 + w_2 = 1\]
Using the Lagrangian, we would then optimize:
\[L = (w_1^2s_1^2 + w_2^2s_2^2 + 2w_1w_2s_{1,2}) - \lambda(w_1 + w_2 - 1)\] \[\frac{\partial L}{\partial w_1} = 2w_1s_1^2 + 2w_2s_{1,2} = \lambda\] \[\frac{\partial L}{\partial w_2} = 2w_2s_2^2 + 2w_1s_{1,2} = \lambda \] \[\frac{\partial L}{\partial \lambda} =w_1 + w_2 = 1\]
By setting the first two partials equal and following through with some algebra, we obtain:
\[\frac{w_1}{w_2} = \frac{s_2^2 - s_{1,2}}{s_1^2 - s_{1,2}}\]
Which we can then combine with the final partial solve the optimized value:
\[w_2^{*} = \frac{1}{\frac{s_2^2 - s_{1,2}}{s_1^2 - s_{1,2}} + 1}\]
And, of course, using our constraint:
\[w_1^{*} = 1 - w_2^{*}\]
Where \(w_i^{*}\) gives the optimal weight of the \(i^{th}\) portfolio.
Again, this sort of exercise is beyond the scope of this book; if you find it interesting, feel free to investigate an optimization-oriented economics course. For now, we will stick with our applications of the Modern Portfolio Theory: finding risk and return of a portfolio.
3.5 Introduction to Regression
This section on regression is essentially a sneak peak for what will follow in statistics.
In a nutshell, regression means fitting a line to data. In this book, we will use linear regression, and will try to fit a line using the following (hopefully familiar) linear relationship:
\[\hat{Y_i} = b_1X_i + b_0\]
Where \(X\) represents all of the values of the independent variable and \(\hat{Y_i}\) the predicted values of the dependent variable.
This line is fitted by minimizing the errors of it’s predictions. That is, for every single \(Y\) value, there is an actual value \(Y_i\) and a value predicted from the linear model \(\hat{Y_i}\). So, the error of the prediction is just the difference of the actual value and the predicted value:
\[error = e_i = Y_i - \hat{Y_i}\]
This makes intuitive sense, because an error of \(0\) would mean that the predicted value is the same as the actual value.
So, to build the linear relationship, we minimize the error above to try and get the best line possible.That is, we choose the linear equation that results in the smallest error, or smallest distance of the predicted value from the actual value. This is the same concept discussed in the previous section on Modern Portfolio Theory: optimizing a quantity using calculus methods. Again, we will not cover how to actually carry out this optimization.
Anyways, there are many different ways to minimize error, but the best for our purposes (and what we will use) is the least squares method. It is just like it sounds: minimizing the square of the error based on the line we fit.
The official problem is:
\[min \; [(\sum\limits_{i=1}^n Y_i - b_0 - b_1X_i)^2 = (Y_i - \hat{Y_i})^2 = e_i^2]\]
So, the computer takes this problem and solves it (which would take a very long time by hand) and spits out some linear equation that minimizes this error (the distance from predicted to actual squared) by choosing the optimal quantities for \(b_0\) and \(b_1\). This is called the least squares regression line.
This is all just what is going on behind the scenes when the computer spits out something like car price = \(\$10,000 + \$500\)(Horsepower). In this case, the computer figured out the optimal relationship between horsepower and car price that minimizes the errors in predicting car price.
It’s important to understand the fundamentals going on here, since concept questions will show up throughout statistics.
The important thing about generating these linear models is that they allow us to (attempt to) predict some variable. In the car price example above, we are predicting car prices based on their given horsepower. This can be pretty valuable in real life: imagine trying to predict the future price of a stock based on certain characteristics.
It’s very important to be able to interpret one of these linear models as well. Take the model we made up a second ago: car price = \(\$10,000 + \$500(Horsepower)\). How would you interpret the two constants?
First, for slope, you could interpret for every unit increase in Horsepower, our model predicts that car price will increase by \(\$500\).
For the intercept: our model predicts that a car with \(0\) Horsepower will sell for $10,000 (of course, this intercept may not make sense in real life, as this one certainly does not: you can’t have a car with no horsepower. You would then just say while the model spits that out as an output, you don’t really take it into account because it is fitting values outside our range - more on this later.).
Let’s close this chapter with a simple example. In this book, a Market Model represents a regression of a stock vs. the index (the index essentially being how the market performs). After running the regression, you would get:
\[\hat{Stock \; Return} = \alpha + \beta(Index \; Return)\]
For some constants \(\alpha\) and \(\beta\).
You could then interpret \(\alpha\) as the baseline performance of the stock, and the slope \(\beta\) as it’s relationship to the market. A \(\beta\) of \(0\) would represent cash under the mattress, or no relationship to the market. For \(0 < \beta < 1\), the stock is less risky than the market, and for \(\beta > 1\) the stock is riskier than the market. If \(\beta < 0\), the stock moves in the opposite direction of the market.
If you can convince yourself of these dynamics, you likely are comfortable with the basics of linear regression.