Chapter 1 Simple Linear Regression

1.1 Correlation

We’ll start off by loading some required package, entering in a very small data set by hand, and graphing that data set. Then we will compute some basic descriptive statistics.

require(tidyverse)
require(mosaic)
x<-c(0,1,2,3,4,4,5,6,6,9) #hours studying for exam
y<-c(20,70,65,84,72,80,75,95,64,90) #exam score
mean(x)
## [1] 4
sd(x)
## [1] 2.666667
mean(y)
## [1] 71.5
sd(y)
## [1] 20.78595
cor(y~x)
## [1] 0.6795457
exam <- tibble(x,y)
ggplot(exam,aes(x,y)) + geom_point() + theme_bw()

When I refer to the “summary statistics” of a bivariate data set, I’m referring to the means, the standard deviations, and the correlation between the two variables.

The correlation coefficient (also known as Pearson’s product-moment correlation) can be computed with a variety of algebraically equivalent formulas. Typically we will compute this with technology.

r=(xiˉx)(yiˉy)(xiˉx)2(yiˉy)2

r=SSXYSSXSSY

r=Cov(X,Y)Var(X)Var(Y)=sxysxsy

A possibly edifying form of the calculation is to base this calculation on standardizing both of the variables.

r=1n1(xiˉxsx)(yiˉysy) r=1n1zxzy

Let’s show these calculations and how the formula based on standardized z-scores looks graphically.

ggplot(exam,aes(x,y)) + geom_point() +
  geom_hline(yintercept=mean(y),linetype="dashed",color="red") +
  geom_vline(xintercept=mean(x),linetype="dashed",color="red") + 
  theme_bw()

z.x<-(x-mean(x))/sd(x)
z.y<-(y-mean(y))/sd(y)
z.xy<-z.x * z.y
sum(z.xy)/9 #calculates r
## [1] 0.6795457
z.scores <- tibble(x,y,z.x,z.y,z.xy)
z.scores
## # A tibble: 10 x 5
##        x     y    z.x     z.y    z.xy
##    <dbl> <dbl>  <dbl>   <dbl>   <dbl>
##  1     0    20 -1.5   -2.48    3.72  
##  2     1    70 -1.12  -0.0722  0.0812
##  3     2    65 -0.75  -0.313   0.235 
##  4     3    84 -0.375  0.601  -0.226 
##  5     4    72  0      0.0241  0     
##  6     4    80  0      0.409   0     
##  7     5    75  0.375  0.168   0.0631
##  8     6    95  0.75   1.13    0.848 
##  9     6    64  0.75  -0.361  -0.271 
## 10     9    90  1.88   0.890   1.67
cor(y~x,data=exam)
## [1] 0.6795457
cor(z.y~z.x,data=z.scores)
## [1] 0.6795457
ggplot(z.scores,aes(z.x,z.y)) + geom_point() +
  geom_hline(yintercept=0,linetype="dashed",color="red") +
  geom_vline(xintercept=0,linetype="dashed",color="red") + 
  theme_bw() 

We see a moderate positive correlation between hours of study and exam score. This probably matches our intutition.

1.2 Linear Model

Now let’s try to find the “line of best fit”, or in other words a linear model to describe the relationship between hours of study and exam score.

DATA=MODEL+ERROR Y=f(X)+ϵ

starting with a guess. I’m going to guess that the line y=60+4x could be a candidate for predicting exam scores. The y-intercept means that I believe students that do not study will have an average score of 60, while the positive slope indicates that I believe each hour of study will result in a 4 point increase in the exam score. Notice this model would predict you would get a score of Y=100 with x=10 hours of study.

ggplot(exam,aes(x,y)) + geom_point() +
  geom_abline(intercept=60,slope=4,color="blue")

Obviously we do not have a perfect correlation and I cannot exactly predict your exam score with just the number of hours you spent studying. The error term includes both other possible factors I haven’t considered in the model, and random error.

A residual ei=yiˆyi represents the difference between the actual value of Y that we observed in reality from the predicted value from a model, which we will denote with ˆy. For instance, student #7 in our data set had x7=5,y7=75 and ˆy=60+4(5)=80

Our model gives a residual of e7=y7ˆy7=7580=5 This student did 5 points worse than our naive model would predict. I’ll show this graphically for all 10 students.

exam2 <- exam %>%
  mutate(y.hat=60+4*x) %>%
  mutate(e=y-y.hat) 

ggplot(exam2) + geom_point(aes(x,y,color=(e>0))) +
  geom_abline(intercept=60,slope=4,color="blue") +
  geom_segment(aes(x,y,xend=x,yend=y.hat,color=(e>0)))

It seems doubtful that our naive model ˆY=60+4x which was chosen for no real mathematical reason is truly the “line of best fit”. In statistics, the principle of least squares is generally used to find the equation of the linear model. The idea is to minimize the sum of squared residuals, i.e. minimize SSE=(yiˆy)2=e2i

Graphically, this is literally the combined area of all of the squares that represent the squared residuals. In other words, SSE is the size of all of these squares, and we want to choose the line that makes this cumulative area as small as possible.

exam2 <- exam %>%
  mutate(y.hat=60+4*x) %>%
  mutate(e=y-y.hat) 

ggplot(exam2) + geom_point(aes(x,y,color=(e>0)),size=2) +
  geom_abline(intercept=60,slope=4,color="blue") +
  geom_segment(aes(x,y,xend=x,yend=y.hat,color=(e>0))) +
  geom_rect(aes(xmin=x,ymin=y,xmax=x-e,ymax=y-e,color=(e>0)),fill="white") +
  coord_fixed(ratio=1)

SSE <- sum(exam2$e^2)
SSE
## [1] 2403

Instead of using trial and error to reduce SSE and improve our model fit, equations for the least squares regression coefficients can be found through the use of calculus (which we won’t do in this class).

1.3 Assumption of the Linear Model

Let’s express our linear model as: Y=β0+β1X+ϵ with the following mathematical assumptions:

  1. Linearity: We assume a linear pattern. We’ll deal with handling this assumption later.

  2. Zero Mean: The error distribution is centered at zero and the residuals have mean zero. This means that the line will have points randomly above and below it.

  3. Uniform Spread: The error distribution has a common variance σ2. This means the variance of the response (in our example, exam score) does not change as the explanatory variable changes (in our case, as we study for fewer or more hours).

  4. Independence: Errors are assumed to be independent of each other. This assumption is often not met if the explanatory variable is time (think of the temperature or the price of a stock).

  5. Normality: When we wish to make inferences on our linear model, the errors are assumed to be random and to be normal. ϵN(0,σ2)

In a basic class like STA 135, I will give students the following formulas, which relate the estimates of the regression coefficients to basic summary statistics:

ˆβ1=rsysx

ˆβ0=ˉyˆβ1ˉx

We can also find the slope based on the sums of squares.

ˆβ1=SSXYSSX

beta.hat1<-cor(y~x,data=exam)*sd(y)/sd(x)
beta.hat0<-mean(y)-beta.hat1*mean(x)
beta.hat1
## [1] 5.296875
beta.hat0
## [1] 50.3125
SSX <- sum((x-mean(x))^2)
SSX
## [1] 64
SSXY <- sum((x-mean(x))*(y-mean(y)))
SSXY
## [1] 339
SSXY/SSX #beta.hat1
## [1] 5.296875

Our new and improved least squares regression equation is ˆY=50.3125+5.296875X

Taking student #7 again, ^y7=50.3125+5.2968755=76.79688

Rounding off, the residual is e7=7576.8=1.8

exam2 <- exam %>%
  mutate(y.hat=50.3125+5.239687*x) %>%
  mutate(e=y-y.hat) 

ggplot(exam2) + geom_point(aes(x,y,color=(e>0)),size=2) +
  geom_abline(intercept=50.3125,slope=5.239687,color="blue") +
  geom_segment(aes(x,y,xend=x,yend=y.hat,color=(e>0))) +
  geom_rect(aes(xmin=x,ymin=y,xmax=x-e,ymax=y-e,color=(e>0)),fill="white") +
  coord_fixed(ratio=1)

SSE <- sum(exam2$e^2)
SSE
## [1] 2093.592

The SSE decreased from the earlier model’s value of 2403 to 2093.592, and there is no equation of a straight line that would give a smaller SSE for this data set.

Obviously, it will be easier to compute the least squares equation, the fitted values ˆyi and the residuals ei with R.

model1 <- lm(y~x,data=exam)
model1
## 
## Call:
## lm(formula = y ~ x, data = exam)
## 
## Coefficients:
## (Intercept)            x  
##      50.313        5.297
summary(model1)
## 
## Call:
## lm(formula = y ~ x, data = exam)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.312  -6.438   2.297  11.805  17.797 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   50.313      9.569   5.258 0.000766 ***
## x              5.297      2.022   2.620 0.030655 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.17 on 8 degrees of freedom
## Multiple R-squared:  0.4618,	Adjusted R-squared:  0.3945 
## F-statistic: 6.864 on 1 and 8 DF,  p-value: 0.03066
fitted(model1)
##        1        2        3        4        5        6        7        8 
## 50.31250 55.60938 60.90625 66.20312 71.50000 71.50000 76.79688 82.09375 
##        9       10 
## 82.09375 97.98438
residuals(model1)
##          1          2          3          4          5          6          7 
## -30.312500  14.390625   4.093750  17.796875   0.500000   8.500000  -1.796875 
##          8          9         10 
##  12.906250 -18.093750  -7.984375

Where is the SSE at in this information? It’s there, but not in an obvious fashion.

The book covers the standard error of regression, which is an estimate of the square root of the common variance σ2 that we assume the error term has.

ˆσϵ=(yiˆy)n2=SSEn2

We have n2 degrees of freedom for estimating the standard error of regression in a simple linear regression model (losing 1 df for each parameter in the model; we will lose more df when we fit more complex models).

For our example, we know n=10 and SSE=2093.592, so

ˆσϵ=2093.5928=16.17711 Oh, R calls this residual standard error. It is also called root mean square error or root MSE.

Another common notation for ˆσϵ is sy|x.

1.4 Tidying Our Linear Model

Let’s look at another new package to help “tidy” up our data. The broom package is also part of the tidyverse and can be used to more neatly package up the results of statisical analyses, such as fitting linear models, into “tibbles” for subsequent analyses and/or graphing. You may need to install broom.

require(broom)
tidy(model1)
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    50.3       9.57      5.26 0.000766
## 2 x               5.30      2.02      2.62 0.0307
glance(model1)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.462         0.395  16.2      6.86  0.0307     2  -40.9  87.8  88.7
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
model1.stuff <- augment(model1)
model1.stuff
## # A tibble: 10 x 9
##        y     x .fitted .se.fit  .resid  .hat .sigma   .cooksd .std.resid
##    <dbl> <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>     <dbl>      <dbl>
##  1    20     0    50.3    9.57 -30.3   0.35    9.85 1.45         -2.32  
##  2    70     1    55.6    7.93  14.4   0.241  16.1  0.165         1.02  
##  3    65     2    60.9    6.52   4.09  0.162  17.2  0.00742       0.277 
##  4    84     3    66.2    5.50  17.8   0.116  15.7  0.0895        1.17  
##  5    72     4    71.5    5.11   0.500 0.1    17.3  0.0000590     0.0326
##  6    80     4    71.5    5.11   8.50  0.1    17.0  0.0170        0.554 
##  7    75     5    76.8    5.50  -1.80  0.116  17.3  0.000912     -0.118 
##  8    95     6    82.1    6.52  12.9   0.162  16.4  0.0738        0.872 
##  9    64     6    82.1    6.52 -18.1   0.162  15.6  0.145        -1.22  
## 10    90     9    98.0   11.3   -7.98  0.491  16.8  0.230        -0.692
# help(augment)
sum(model1.stuff$.resid) #add up to zero
## [1] 1.776357e-15

The most basic residual plot places the fitted values on the x-axis and the residuals on the y-axis. When the regression assumptions are met, the points in such a plot should be randomly scattered above and below a horizontal line through zero, with no patterns that might indicated violations of linearity, normality, and/or nonconstant variance.

The normal probability plot or qq-plot can be used to look for violations of normality in the residuals. If the residuals are normally distributed, their values will coincide with “theoretical quantiles” form the standard normal distribution and the points will lie on a straight line. This is very cumbersome to compute by hand, so we will use R.

ggplot(model1.stuff,aes(x=.fitted,y=.resid)) + geom_point() +
  geom_hline(yintercept=0,linetype="dashed",color="red") + 
  ggtitle("Fitted vs Residuals")

ggplot(model1.stuff,aes(sample=.resid)) +
  stat_qq() + stat_qq_line(color="red") +
  ggtitle("Normal Probability Plot")

It is extremely difficult to make judgements based on a sample this small, so let’s look at a larger data set. Here, we will try to predict the weight of a backpack based on the body weight of a student from a sample of n=100

require(Stat2Data)
data(Backpack)
model.backpack <- lm(BackpackWeight~BodyWeight,data=Backpack)
tidy(model.backpack)
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   5.98      3.03        1.97  0.0514
## 2 BodyWeight    0.0371    0.0195      1.91  0.0592
backpack.stuff <- augment(model.backpack)

ggplot(Backpack, aes(BodyWeight,BackpackWeight)) +
  geom_point() +
  stat_ellipse() +
  ggtitle("Body Weight vs Backpack Weight")

ggplot(backpack.stuff,aes(x=.fitted,y=.resid)) + geom_point() +
  geom_hline(yintercept=0,linetype="dashed",color="red") + 
  ggtitle("Fitted vs Residuals")

ggplot(backpack.stuff,aes(sample=.resid)) +
  stat_qq() + stat_qq_line(color="red") +
  ggtitle("Normal Probability Plot")

1.5 Log Transfromation

With this data set, I’ll show a bit about when a transformation is needed (we’ll use the log transformation) with species of oak trees from the “Atlatnic” region.

Acorn Size and Geographical Range in the North American Oaks (Quercus L.) Author(s): Marcelo A. Aizen and William A. Patterson Source: Journal of Biogeography, Vol. 17, No. 3 (May, 1990) pp.327-332 URL: http://www.jstor.org/stable/2845128

An EXCEL file I saved in .csv format, then posted on my webpage http://murraystate.edu/academic/faculty/cmecklin/Acorns.csv

I will generally recommend you save an .xls or .xlsx file in the .csv format before importing into R.

Acorn.Size is weight in grams, Range.Area is area in hundreds of squared kilometers, and Tree.Height is in meters.

acorns<-read_csv(file="http://campus.murraystate.edu/academic/faculty/cmecklin/STA265/Acorns.csv")
acorns
## # A tibble: 39 x 5
##    Species                   Region   Range.Area Acorn.Size Tree.Ht
##    <chr>                     <chr>         <dbl>      <dbl>   <dbl>
##  1 Quercus alba L.           Atlantic      24196        1.4      27
##  2 Quercus bicolor Willd.    Atlantic       7900        3.4      21
##  3 Quercus macrocarpa Michx. Atlantic      23038        9.1      25
##  4 Quercus prinoides Willd.  Atlantic      17042        1.6       3
##  5 Quercus Prinus L.         Atlantic       7646       10.5      24
##  6 Quercus stellata Wang.    Atlantic      19938        2.5      17
##  7 Quercus virginiana Mill.  Atlantic       7985        0.9      15
##  8 Quercus Michauxii Nutt.   Atlantic       8897        6.8      30
##  9 Quercus lyrata Walt.      Atlantic       8982        1.8      24
## 10 Quercus Laceyi Small.     Atlantic        233        0.3      11
## # ... with 29 more rows
atlantic <- acorns %>%
  filter(Region=="Atlantic") 

ggplot(atlantic,aes(x=Acorn.Size,y=Range.Area)) +
  geom_point()

cor(Range.Area~Acorn.Size,data=atlantic)
## [1] 0.4259023
# log transform both axes

ggplot(atlantic,aes(x=Acorn.Size,y=Range.Area)) +
  geom_point() +
  coord_trans(x="log10",y="log10")

cor(log10(Range.Area)~log10(Acorn.Size),data=atlantic)
## [1] 0.6241522
# fit model with no transformation, residual plot
model.no <- lm(Range.Area~Acorn.Size,data=atlantic)
tidy(model.no)
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    7399.     1950.      3.79 0.000796
## 2 Acorn.Size     1218.      508.      2.40 0.0238
ggplot(augment(model.no),aes(x=.fitted,y=.resid)) +
  geom_point() + 
  geom_hline(yintercept=0,color="red",linetype="dashed")

# nonconstant variance

# fit model with transofrmation, residual plot
model.log <- lm(log10(Range.Area)~log10(Acorn.Size),data=atlantic)
tidy(model.log)
## # A tibble: 2 x 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)          3.66     0.0881     41.5  2.70e-25
## 2 log10(Acorn.Size)    0.764    0.188       4.07 3.86e- 4
ggplot(augment(model.log),aes(x=.fitted,y=.resid)) +
  geom_point() + 
  geom_hline(yintercept=0,color="red",linetype="dashed")

Now let’s try the California region. Will the same general trend of larger acorns being associated with a larger range hold on the West Coast?

You will notice a point that is “influential”–seems to be an “outlier” that is way off the general pattern.

There exist fancier version of residuals and other statistics to help quantify the amount of “leverage” a point has on the regression equation. The general idea is that no point should exert too much influence on the fit.

What’s the story with the Calfornia outlier? Why is the acorn large but the range over which that species of oak tree is found so small?

california <- acorns %>%
  filter(Region=="California")

# log transform both axes

ggplot(california,aes(x=Acorn.Size,y=Range.Area)) +
  geom_point() +
  coord_trans(x="log10",y="log10")

cor(log10(Range.Area)~log10(Acorn.Size),data=california)
## [1] 0.02281815
# fit model with transofrmation, residual plot
model.log <- lm(log10(Range.Area)~log10(Acorn.Size),data=california)
tidy(model.log)
## # A tibble: 2 x 5
##   term              estimate std.error statistic    p.value
##   <chr>                <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)         2.60       0.263    9.89   0.00000393
## 2 log10(Acorn.Size)   0.0272     0.397    0.0685 0.947
ggplot(augment(model.log),aes(x=.fitted,y=.resid)) +
  geom_point() + 
  geom_hline(yintercept=0,color="red",linetype="dashed")

# we will delete outiler, refit

RESULTS: (from the original article) Logarithmically transformed geographical areas and acorn volumes were correlated for oaks of each of the Atlantic and California regions (Fig. 2). A highly significant positive correlation was found for twenty-eight Atlantic species (r=0.635,P<0.0005). For the eleven California oaks the correlation was not significant (r=0.02,P>0.95), but the exclusion of Quercus tomentella Engelm, the only non-continental species, resulted in a highly positive correlation (r=0.667,n=10,P<0.05). The small range of Quercus tomentella is a function of the species being confined to the Channel Islands of southern California and the oceanic island of Guadalupe off the dry coast of Baja California (Little, 1976). The large acorns in this species could, in fact, be related to its condition as an island species (Carlquist, 1974). Oak species with larger trees could bear bigger and more seeds and larger and more exposed amounts of winddispersed pollen grains than species with smaller trees. Morphological evidence (Kaul & Abbe, 1984) and selfpollination experiments (Wright, 1953) indicate that the oak group is composed principally of outcrossing species.

1.6 Simple Linear Regression with the Planets

How many planets are there in our solar system?

When I was a kid, the answer was 9. For those of you educated in the 21st century, the answer is 8 after Pluto got “demoted” to “dwarf planet” status in 2006.

If you lived in ancient times, your answer was 5 (Mercury, Venus, Earth, Mars, Jupiter), the five planets visible with the naked eye.

If you lived in the 17th century, the answer became 6 when Galileo discovered Saturn in 1610. Let’s consider the data below (source: https://www.enchantedlearning.com/subjects/astronomy/planets/), where D is the distance from the sun, measured in AU (astronomical units, where 1 AU is approx. 93 million miles or 150 million km, the average distance Earth is from the sun) and T is the period of revolution around the sun (i.e. a year), measured in Earth units. We’ll need to make sure all planets are measured with the same unit for T. I’ll suggest dividing days by 365.25 to convert to years (these are Earth-years, of course).

Planet D T
Mercury 0.390 87.96 days
Venus 0.723 224.68 days
Earth 1 365.25 days
Mars 1.524 686.98 days
Jupiter 5.203 11.862 years
Saturn 9.539 29.456 years

I’m going to create a small EXCEL spreadsheet with this data, to illustrate the use of the “Import Dataset” option in the upper right hand window of R Studio. We will save in the .xls or .xlsx format. If you don’t have EXCEL, you can create a spreadsheet with OpenOffice or just a plain text file with Notepad and save in either .txt or .csv format (you can also save in these formats with EXCEL).

We will plot the following: TD ln(T)D Tln(D) ln(T)ln(D)

Which seems to be the best choice of transformations if we know how to fit a least squares linear model?

# here I'm entering data straight into R, will demonstrate using EXCEL in class
D <- c(0.390,0.723,1,1.524,5.203,9.539)
T <- c(87.96/365.25,224.68/365.25,1,686.98/365.25,11.862,29.456)
planets <- tibble(D,T)
ggplot(planets,aes(x=D,y=T)) + geom_point() + geom_smooth()

ggplot(planets,aes(x=D,y=log(T))) + geom_point() + geom_smooth()

ggplot(planets,aes(x=log(D),y=T)) + geom_point() + geom_smooth()

ggplot(planets,aes(x=log(D),y=log(T))) + geom_point() + geom_smooth()

planet.mod <- lm(log(T)~log(D),data=planets)
summary(planet.mod)
## 
## Call:
## lm(formula = log(T) ~ log(D), data = planets)
## 
## Residuals:
##          1          2          3          4          5          6 
## -0.0064103  0.0042433  0.0029735  0.0018373 -0.0008732 -0.0017707 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.002974   0.001948  -1.527    0.202    
## log(D)       1.502021   0.001592 943.259 7.58e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.004336 on 4 degrees of freedom
## Multiple R-squared:      1,	Adjusted R-squared:      1 
## F-statistic: 8.897e+05 on 1 and 4 DF,  p-value: 7.579e-12
D^3 / T^2 # equal to approx. 1 when units are AU for D and Earth-years for T
## [1] 1.0228309 0.9987723 1.0000000 1.0005693 1.0010266 1.0003708

Once we fit the equation of the least squares model, we will algebraically solve the model for T and D. Using AUs and Earth-years as units, you’ll end up finding that D3T21

If you know Kepler’s third law of planetary motion, it might be familiar.

https://en.wikipedia.org/wiki/Kepler%27s_laws_of_planetary_motion

Finally, we’ll use our model to make predictions of T for Uranus (discovered in 1761), Nepture (1846), and Pluto (1930, demoted to dwarf planet in 2006).

Planet D T
Uranus 19.18 ???
Neptune 30.06 ???
Pluto 39.53 ???

By the way, here are the actual values of T for Uranus, Neptune, and Pluto.

Planet D T
Uranus 19.18 84.07 years
Neptune 30.06 164.81 years
Pluto 39.53 247.7 years