Name | Height |
---|---|
Tyler | 181 |
Steve | 190 |
Jenny | 173 |
Cindy | 158 |
5 Statistical Models
When we use theory and generate hypotheses, we are attempting to create a model of real-world phenomenon. For example, conisder the following theory proposed by Edwin Shneidman: suicide is caused by psychache (unbearable mental pain). Imagine this was truly how suicide worked in the real world. Consider the following three posisble models:
- Researcher 1: as pyschache increases so does suicide risk
- Researcher 2: as social connectedness decreases, suicide risk increases
- Researcher 3: presence of gene XX4r2 causes suicide
Each researcher could collect data to test how well the model fits the real-world phenomenon. If the model fits the data well, it can accurately predict suicide. For example, statistical analyses might indicate that increased psychache is an accurate predictor of suicide (more to come on regression and casual experiments in later chapters!). If it does not fit the data well, it likely does not accurately represent the real-world phenomenon of interest. For example, if researcher 3 collected genetic data and the presence of the hypothetical gene did not lead to suicide in some individuals, it would indicate a poor model fit.
5.1 A basic model
Let’s try to model the mean height of psychology professors (in centimeters). You can measure all the psych professors in the world, so go to the Arts and Sciences Building and ask four of your psychology professors. You get the following data.
The average height of these professors is 175.5cm. This mean is a model. The model can be represented as:
\(x_i = \overline{x} + e_i\)
Here: \(x_i\) presents the height of professor i, \(\overline{x}\) represents the sample mean height of the professors; and \(e_i\) represent the difference between the professor and the mean, or errors.
We can assess how well the model fits with the data we collected. For our model, it would make sense to try to calculate how large our \(e_i\)s are, as these represent the model error.
5.2 Deviations
One method to assess the quality of the fit of the model, our mean, to the data is compare how different our data are from the model. You now know that these are model errors. We can subtract the mean from each value to create a numerical representation of this fit. For example, Tyler is 181cm tall. Our model suggests that the average height is 175.5cm tall. We can calculate the deviation here as:
\(e_i = (x_i - \overline{x}) = (181 - 175.5) = 5.5\)
If we sum all the errors up across all our data, we get:
Name | Deviation |
---|---|
Tyler | 5.5 |
Steve | 14.5 |
Jenny | -2.5 |
Cindy | -17.5 |
So, \(\sum{e_i}=5.5 + 14.5 + (-2.5) + (-17.5) = 0\). What?? That can’t be right. Well, yes, it is. The sum of errors around a mean is zero.
\(\sum_{i=1}^n{e_i}=0\)
There is a way to bypass this statistical conundrum.
5.3 Variance and Standard Deviation
We may effectively model the fit of our mean model with the variance and standard deviation. These are extremely important in statistics so it’s imperative to become familiar with them.
Above we calculated the the deviation of each score. The variance is, in essence, the average squared differnce between a score and its mean.
\(\sigma^2 = {\frac{\sum\limits_{i = 1}^N {\left( {x_i - \bar x} \right)^2 }}{N} }\)
But for a sample, our equation is (see last chapter for the rationale):
\(s^2 = {\frac{\sum\limits_{i = 1}^N {\left( {x_i - \bar x} \right)^2 }}{N-1} }\)
This equation simply means we add up all the squared differences between a score and the mean and divide by the number of scores. So, the squared deviations are:
Name | Squared |
---|---|
Tyler | 30.25 |
Steve | 210.25 |
Jenny | 6.25 |
Cindy | 306.25 |
We then add up the squared deviations, \(30.25+210.25+6.25+306.26=553\). And divide by the number of scores (with sample adjustment to \(N-1\)), \(4-1=3\), to get:
\(\sigma^2 = {\frac{\sum\limits_{i = 1}^N {\left( {x_i - \bar x} \right)^2 }}{N-1} } = \frac{30.25+210.25+6.25+306.26}{4-1} = \frac{553}{3}=184.33\)
Thus, the variance of the heights of psychology professors is \(184.33\). The standard deviation is simply the squared root of the variance:
\(s = \sqrt{{\frac{\sum\limits_{i = 1}^N {\left( {x_i - \bar x} \right)^2 }}{N-1} }}\)
While you might think that the standard deviation (SD) is the average absolute difference between a score and the mean, it is not. For example, the SD of our heights is 13.5769412. But the average deviation is \(\frac{|5.5| + |14.5 |+ |-2.5| + |-17.5|}{4} = 13.33\). It is most likely helpful to think of the variance as the average squared deviation and the SD as the root of the variance.
5.4 Advanced Models
While above we have simply modeled a mean, later chapters will build up to more advanced models, such as:
\(y_i=\beta_0+x_{1i}\beta_1+x_{2i}\beta_2+x_{3i}\beta_3+e_i\)
Don’t be intimidated, this is a whole lot like your classic high school’s \(y=mx+b\), with some intercepts and slopes. More to come.
- Calculate the mean, variance, and standard deviation for both the height (in cms) and weight (in kgs) of these NHL players.
Player | Height | Weight |
---|---|---|
Connor McDavid | 185.4 | 99.0 |
Auston Matthews | 190.5 | 93.0 |
Sidney Crosby | 180.0 | 91.0 |
Alex Ovechkin | 191.0 | 107.9 |
Write out the model for NHL height.
What are the \(e_i\) values for each player when modeling their height?
Mean_Height | 186.725000 |
SD_Height | 5.148058 |
var_Height | 26.502500 |
Mean_Weight | 97.725000 |
SD_Weight | 7.587435 |
var_Weight | 57.569167 |
- Write out the model for NHL height.
\(height_i=\overline{x}_{height}+e_i\)
- What are the \(e_i\) values for each player when modeling their height?
Player | Height | e_i |
---|---|---|
Connor McDavid | 185.4 | -1.325 |
Auston Matthews | 190.5 | 3.775 |
Sidney Crosby | 180.0 | -6.725 |
Alex Ovechkin | 191.0 | 4.275 |