16 Covariance, Correlation, and Least-squares regression

Before we can start on least-squares regression, we need to review covariance and correlation.

16.1 Covariance and Correlation

Covariance and Correlation provide a numerical measure of how and how strongly two random variables \(X\) and \(Y\) are related.

If you look at two plots of the same data below, you can see that the strength of the (linear) relationship between the two variables seems to depend on the scale used in the plots. This tells us that a graphical tool such as a scatter plot has limitations when it comes to judging correlation.

This demonstrates the need to develop a measurement for the correlation of two random variables that is scale and unit invariant.

Let’s start with a summary of useful formulas and properties.

  1. Recall the formulas for the expected value and the variance:
    \(\small E[X]= \sum_{ \text{all x} }x \cdot P(X=x)\) for a discrete random variable \(X\), and
    \(\small E[X]= \int_{ \text{all x} } x \cdot f(x)dx\) for a continuous random variable \(X\).
    \(\small Var[X]) = E[(X-E[X])^2] = E[X^2]-(E[X])^2\)

  2. For real valued constants a and b we have
    \(\small E[aX+b] = aE[X]+b\) and
    \(\small VAR[aX+b] = a^2 VAR[X].\)

  3. The standardized score or z-score of a random variable X with mean \(\mu\) and standard deviation \(\sigma\) is \[\small z=z_x = \frac{X-\mu}{\sigma}\] Note that \(E[z]=0\) and \(VAR[z]=1\).

  4. If \(\small X\) and \(\small Y\) are two independent random variables, then
    \(\small E[XY]=E[X]E[Y]\) and
    \(\small VAR[X \pm Y]=VAR[X]+VAR[Y].\)

  5. For two random variables \(\small X\) and \(\small Y\), the covariance is defined as \[\small Cov[X,Y] = E[(X-\mu_{X})(Y-\mu_{Y})]\] where \(\small \mu_{X}\) and \(\small \mu_{Y}\) are the expected values of X and Y resp.

Assignment Assume \(\small X\) and \(\small Y\) are two random variables. For practice, prove the following:

P1 Prove \(\small Cov[X,y] = Cov[Y,X]\)

P2 Prove \(\small Cov[aX+b,y] = aCov[X,Y]\)

P3 Prove \(\small Cov[X,Y] = E[XY] - E[X]E[Y]\).

P4 Prove \(\small Cov[X+Y,Z] = Cov[X,Z] + Cov[Y,Z]\)

P5 Prove \[ \small Cov[\sum_{i=1}^n X_i,\sum_{j=1}^m Y_j] = \sum_{i=1}^n \sum_{j=1}^m Cov(X_i Y_j)\]

P6 Prove If X and Y are independent, then Cov(X,Y)=0.
Note that the converse is false. Cov(X,Y)=0 does not imply X and Y are independent.

P7 Prove \(\small cov(X,c) = 0\) for any real constant \(\small c\).

P8 Prove \(\small VAR[X \pm Y] = VAR[X] + VAR[Y] \pm 2 Cov(X,Y)\)

P9 Prove If X and Y are independent, then \(\small VAR[X+Y] = VAR[X] + VAR[Y]\).

Let’s compute the co-variance for Bwt and Hwt for cats from the cats data set in the MASS package. We convert the body weight from kg to grams and re-compute the co-variance. Finally, we convert grams to lbs (1 lb = 453.592 gr) and re-compute the co-variance. You see that the covariance is unit dependent.

plot(cats$Bwt, cats$Hwt, xlab="Body weight in kg",ylab="Heart weight")
cb1000 <- cats$Bwt*1000
plot(cb1000, cats$Hwt, xlab="Body weight in gr",ylab="Heart weight")
cblb <- cb1000/453.592
plot(cblb, cats$Hw,xlab="Body weight in lb",ylab="Heart weight")
print(paste("Using kg, the co-variance is",cov(cats$Bwt,cats$Hwt)))
#> [1] "Using kg, the co-variance is 0.950112665112665"
print(paste("Using gr, the co-variance is",cov(cb1000, cats$Hwt)))
#> [1] "Using gr, the co-variance is 950.112665112665"
print(paste("Using lb, the co-variance is",cov(cblb, cats$Hwt)))
#> [1] "Using lb, the co-variance is 2.09464158343327"

A unit-invariant measure of the relation ship between two random variables \(\small X\) and \(\small Y\) is the correlation coefficient \[ \rho (X,Y) = \frac{Cov[X,Y]}{\sigma_X \sigma_Y},\]

where \(\sigma _X\) and \(\sigma_Y\) are the respective standard deviations of \(\small X\) and \(\small Y\).


P10 Prove Multiplying the variables by constants, or adding constants, does not change the correlation coefficient, i.e. \[\small \rho(a+bX, c+dY) = \rho(X,Y).\]

Next, we shall prove that \(\small | \rho(X,Y) | \le 1\).

Theorem For two random variables \(\small X\) and \(\small Y\), we have \(\small | \rho(X,Y) | \le 1.\).

Proof We use the standardized versions of the random variables \(\small x\) and \(\small y\):
\(\small Z_x = \frac{x-\mu_x}{\sigma_x}\) and \(\small Z_y = \frac{x-\mu_y}{\sigma_y}\). Note that \(\small \mu_{Z_X} = \mu_{Z_Y} = 0\) and \(\small \sigma_{Z_X} = \sigma_{Z_Y} = 1\).
We have

\(\begin{align} \ \small 0 &\small\le Var[Z_X + Z_Y] = Var[Z_X]+ VAR[Z_Y] + 2 Cov(Z_X,Z_Y) &\small by \hspace{3pt} P7 \\ & \small = 2 + 2 Cov(Z_X,Z_Y) = 2 + 2 \frac{Cov(Z_X,Z_Y)}{\sigma_{Z_X} \sigma_{Z_Y}} &\small because \hspace{3pt} \sigma_{Z_X} = \sigma_{Z_Y} = 1 \\ &\small = 2+2\rho(Z_X,Z_Y) & \small definition \hspace{3pt} of \hspace{3pt} \rho\\ &\small = 2+2 \rho(X,Y)&\small by \hspace{3pt} P10 \end{align}\)

Similarly, we get \(\small 0 \le 2-2\rho(X,Y)\).

Put together, we have \(\small 0 \le 2 \pm 2 \rho(X,Y)\) or \(\small \rho(X,Y) \le 1\) and \(\small -\rho(X,Y) \le 1\). This means \(\small |\rho(X,Y)| \le 1|\)

Assignment P11 Prove \(\small | \rho(X,Y) | = 1\) iff \(\small Y=aX+b\).

A positive correlation coefficient means the data are positively correlated, a negative one that they are negatively correlated. However, the magnitude of the correlation coefficient does not give the slope, it measures how “perfectly” the data are lined up. As you showed in P11, \(\small | \rho(X,Y) | = 1\) means the data form a perfect line. In that case (perfectly correlated variables), the slope of the resulting line is \(\small s_y/s_x\).

16.2 (Linear) least squares regression

When trying to find the line-of-best-fit (aka regression line, least squares regression line, trend line), we need to define what “best” means. For us, “best” means that the distance between the observed y-values (the data) and the predicted y-values (the line) is minimize. We compute the coefficients \(\small a\) and \(\small b\) in \(\small y=a+bx\) in such a way that the sum of the squares (\(\small ssq\)) of the vertical differences between a data point and \(\small y=a+bx\) is minimized. Because the line is determined by the parameters \(\small a\) and \(\small b\), we interpret \(\small ssq\) as a function \(\small ssq(a,b)\)of \(\small a\) and \(\small b\).

Assume we have paired data \(\small (x_i,y_i), i=1,...,n\). We then minimize \[\small ssq(a,b) = \sum_{i=1}^n (y_i - (a + b x_i))^2 .\] If you already took calculus 3, you will recognize this as a simple two variable optimization problem. You will have to find the partial derivatives of \(\small ssq(a,b)\) with respect to \(\small a\) and \(\small b\), set those to zero, and solve for \(\small a\) and \(\small b\). In practice, we will use R to do that.

Note if you just want to match the sample data, then this is a deterministic problem and the least squares regression line is derived without making any assumptions about the data. You could use a similar process to find non-linear regression lines. However, if you want to use the regression line derived from your sample to estimate the regression line for the whole population, then this becomes a statistical problem and we need to make assumptions. More about this later.

16.3 Pearson’s correlation coefficient

Person’s correlation coefficient \(\small r_{Pearson}\) or \(\small r_{x,y}\) measures the strength and direction of the linear relationship between two variables. You can think of it as the sample analog to the correlation \(\small \rho\) introduced above. Given a sample of \(\small (x,y)\) values, the sample Pearson correlation coefficient is computed as \[\small r_{Pearson}= \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}\sqrt{ \sum_{i=1}^n(y_i-\bar{y})^2}}\] Note that, if we have a population, this is the correlation coefficient introduced above.

16.4 Spearman’s rank correlation coefficient

Spearman’s rank correlation coefficient assesses whether there is a monotone relationship between the rank of two variables. This relationship may be linear or not. This works similar to Pearson’s, except we use the rank of each variable instead of the value of the variable. The advantage is that you can use this with qualitative ordinal data.

After assigning each value a rank, there may be ties. There are different methods of dealing with ties. Assuming the 2nd and 3rd data point are tied, the ranking could be:

1 3 3 4, 1 2 2 3, 1 2 2 4, 1 2 3 4 , 1 2.5 2.5 4 ….

If \(\small r(x_i)\) stands for the rank of \(\small x\), then the Spearman correlation coefficient is computed as

\[\small r_{Spearman}= r_S = \frac{\sum_{i=1}^n (r(x_i)-\bar{r}(x_i))(r(y_i)-\bar{r}(y_i))}{\sqrt{\sum_{i=1}^n (r(x_i)-\bar{r}(x_i))^2} \sqrt{ \sum_{i=1}^n (r(y_i)-\bar{r}(y_i))^2}}\]

You will sometimes see an alternative formula involving \(\small d_i = r(x_i)-r(y_i)\): \[\small r_{Spearman} = 1-\frac{6 \sum d_i^2}{n(n^2-1)},\] Careful! This formula can only be used if you have no ties, i.e. all ranks are distinct integers.

Example The example below shows data with a perfect rang correlation, but imperfect linear correlation.

#> Warning: package 'ggplot2' was built under R version 4.3.2
x <- c(0,1,2,3,5,6,8,9,11,7)
y <- 10-x^3
data <- data.frame(x,y)
ggplot(data, aes(x, y))+geom_point()+labs(title="Some sample data")
#> [1] "Pearson correlation coefficient: -0.893628265531117"
#> [1] "Spearman correlation coefficient: -1"

16.5 Sample investigation of the geyser data from MASS

Let’s put all this to practice using the geyser data from the MASS package. The MASS package contains a data set called geyser which contains data on waiting time between eruptions of Old Faithful and length of said eruptions.

16.5.1 Graphical investigation

We start by looking at a graph of our data.

ggplot(geyser, aes(waiting, duration))+geom_point()+
  labs(title="duration vs waiting time")

Looking at the graph, we find that the duration seems to be inversely correlated to the waiting time. Calculating the correlation coefficients confirms this.

print(paste("Pearson correlation coefficient:",
            cor(geyser$duration, geyser$waiting, method="pearson")))
#> [1] "Pearson correlation coefficient: -0.644623006047782"
print(paste("Spearman correlation coefficient:",
            cor(geyser$duration, geyser$waiting, method="spearman")))
#> [1] "Spearman correlation coefficient: -0.650089254703162"

To compute the equation of the regression line, you use the lm linear model function. I want to use the outputs later, so I am calling the result using Pearson glmP, and the one from Spearman glmS.

x <- geyser$waiting
y <- geyser$duration
glmP <- lm(y~x)
#> Call:
#> lm(formula = y ~ x)
#> Coefficients:
#> (Intercept)            x  
#>     7.31314     -0.05327

yr <- rank(y)
xr <- rank(x)
glmS <- lm(yr~xr)
#> Call:
#> lm(formula = yr ~ xr)
#> Coefficients:
#> (Intercept)           xr  
#>    247.2607      -0.6484

We can use the confint function to find confidence intervals for the parameters \(\small a\) and \(\small b\), the y-intercept and the slope.

#>                   2.5 %      97.5 %
#> (Intercept)  6.78191691  7.84437178
#> x           -0.06048661 -0.04605735
#>                   0.5 %      99.5 %
#> (Intercept)  6.61334245  8.01294624
#> x           -0.06277603 -0.04376793
#>                   2.5 %      97.5 %
#> (Intercept) 232.2846765 262.2367392
#> xr           -0.7349511  -0.5618584

I now want to plot the data with the regression lines added:

interceptp <- glmP$coefficients[1]
slopep <- glmP$coefficients[2]
intercepts <- glmS$coefficients[1]
slopes <- glmS$coefficients[2]

plot1 <- ggplot(geyser, aes(x=waiting, y=duration))+
  geom_point() + 
  geom_abline(intercept = interceptp, slope = slopep, color="red",linewidth=1.5)+
  labs(title="Pearson's correlation, geyser data")
plot2 <- ggplot(geyser, aes(x=rank(waiting), y=rank(duration)))+
  geom_point() + 
  geom_abline(intercept = intercepts, slope = slopes, color="red",linewidth=1.5)+
  labs(title="Spearman's correlation, geyser data")
grid.arrange(plot1, plot2, ncol=2)

As we will be (mostly) using numerical data at the ratio level, we will from now on use Pearson’s correlation.

16.6 Formal hypothesis test for correlation

We can now proceed to formal hypothesis testing. In particular we will test:

\[\small H_O: \text{there is no correlation between x and y}\] \[\small H_A: \text{there is correlation between x and y}\]

As mentioned above, we have to satisfy some pre-requisites. Formally, they are:

Let \(\small (x_1, Y_1), (x_2, Y_2), ..., (x_n, Y_n)\) be points with fixed values \(\small x_i\) and random \(\small Y_i\) values.

  1. The x-values are fixed, not random (so in our example, the waiting times are fixed or given)
  2. Each \(\small Y_i\) value is independent of the other \(\small Y_j\) values.
  3. \(\small Y_i \sim \mathcal{N}(\mu_i,\sigma^2)\) In our example, that means that for any given wait time, the possible duration is random and normally distributed. This one is really hard to check, you would need many observations for each \(\small x_i\) to be able to run a test. Luckily, this assumption is also the least important.
  4. The relationship between the x vales and the means \(\small \mu_i\) is linear. This means that when you plot the $(x_i,y_i)) pairs, you should observe a linear-ish shape, not for example a parabola. This assumption is critical, but often violated in practice. If the violation is “small” it’s accepted practice to go on, but if the violation is large, then p-values, confidence intervals, and standard errors may all be incorrect.
  5. The residuals \(\small \epsilon_i = Y_i - \mu_i\) are independent (this is critical), all have the same variance, and are normally distributed.

Our data wasn’t that linear, but also not obviously non-linear. Let’s proceed, but let’s keep in mind that our results may not be reliable. We will next check a plot of the residuals.

ggplot(glmP, aes(x, glmP$residuals))+geom_point()+
geom_hline(yintercept=0, color="purple", linewidth=1)

We see two distinct bands, which is cause for concern. Next we check if the residuals are normally distributed. A simple QQ-plot will do:

ggplot(glmP, aes(sample=glmP$residuals))+stat_qq(distribution = qnorm, dparams=list(0,sd(glmP$residuals)))+
  geom_abline(intercept=0, slope=1,color="red")

This pattern, an S-curve, implies a distribution with long tails. Not good. (BTW, a curve that looks like a cube function would imply that you have a distribution with short tails, also not good. Something that is very concave down means a right skewed distribution, something very concave up a left skewed distribution. You really want to see a pretty straight line here.)

When conducting a residual analysis, a “residuals versus fits” plot is a very useful plot. It is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. The plot is used to detect non-linearity, unequal error variances, and outliers.

Don’t forget though that interpreting these plots is subjective. It is easy to over-interpret these plots, looking at every little deviation as a big problem. You want to be careful about putting too much weight on residual vs. fits plots when you are working with small data sets. That said, lets compute one. glmP already has the fitted values stored.

ggplot(glmP, aes(x=glmP$fitted.values, y=glmP$residuals))+geom_point()+
geom_hline(yintercept=0, color="purple", linewidth=1)+
  labs(title="residuals vs fits plot - geyser data",x="fits",y="residuals")

Here again we see a pattern, namely two distinct groups of data in two decreasing bands. This leads us to think we missed something. We should go back and check if there is something else that causes this grouping. Maybe the data were collected during two different seasons, bu two different operators, during different rainfalls / drought events. Further investigation is warranted.

All this leads us to assume that the pre-requisites are not satisfied, so any p-value we may compute might/will be off.

16.6.1 Diagnostic plots

R has built-in diagnostic plots for linear regression. You create them like this:



  1. Residuals vs fitted plot Here we check if there is a non-linear pattern in the residuals, which would indicate that a linear model wasn’t a good choice to begin with. A horizontal-ish line means that the linear model was indeed appropriate.
  2. Normal qq-plot
    You know qq plots. This one checks if the residuals are normally distributed. The ones that are too far from the 45 degree line are flagged so you can check into them.
  3. Scale location Plot
    This plot is used to check for homoscedasticity, i.e.the assumption of equal variance among the residuals in our regression model. If the red line is horizontal-ish, then we likely have equal variance.
  4. Residuals vs Leverage plot
    I spare you the details, but basically this is used to detect overly influential data points. We are looking for data points that fall outside “Cook’s distance”, indicated usually by dashed lines. Here, there are none, Cook’s distance lines aren’t even drawn.

16.6.2 The test

\[\small H_O: \text{there is no correlation between x and y}\] \[\small H_A: \text{there is correlation between x and y}\] If the prerequisites are satisfied, you have two options for a test statistic:

  1. You can use Pearson’s correlation coefficient \(\small r_p= \frac{\sum(x-\bar{x})(y-\bar{y})}{\sqrt{\sum (x-\bar{x})^2 \sum(y-\bar{y})^2}}\) and interpret the r-value by looking it up in a table such as http://statisticsunesa.blogspot.com/2010/12/table-of-t-values.html.
  2. The other option would be to compute a t-statistic using the formula
    \[\small t = \frac{r}{\sqrt{1-r^2}} \cdot \sqrt{n-2}\] and use a t-distribution with n-2 degrees of freedom.
x <- geyser$waiting
y <- geyser$duration
r <- cor(x,y,method="pearson")
n <- length(x)
df <- n-2
t <- r/sqrt(1-r^2)*sqrt(n-2)
#> [1] -14.53136
p <- pt(t, df=df,lower.tail=TRUE)*2
#> [1] 1.648496e-36

In R, just use the function cor(x,y,method="pearson") to find the correlation coefficient and the function cor.test(x,y,method="pearson") to test for significance. Let’s do this for the geyser data:

#>  Pearson's product-moment correlation
#> data:  geyser$waiting and geyser$duration
#> t = -14.531, df = 297, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.7064014 -0.5730975
#> sample estimates:
#>       cor 
#> -0.644623

You see a very small p-value, so we can accept the alternative hypothesis, the true correlation is not 0. Observe that we also receive a 95% confidence interval for the correlation. But also remember that we violated some of the prerequisites, so these results are not reliable.

16.7 The coefficient of determination \(\small R^2\)

Assume you have data points \((x_i, y_i)\), that \(\bar y\) is the mean of the \(y_i\), and that you have estimates \(f(x_i) = f_i\). Then the coefficient of determination is computed as
\[R^2 = 1-\frac{\sum_i (y_i - f_i)^2}{\sum_i (y_i - \bar y )^2}= 1-\frac{SS_{residuals}}{SS_{total}}\] \(SS\) stands for sums of squares. If you are using linear regression, then \(R^2 = r_{pearson}^2\).

\(R^2\) can be re-written as \[R^2 = \frac{\sum_i (f_i - \bar y)^2}{\sum_i (y_i - \bar y )^2}= \frac{\sum_i (f_i - \bar y)^2 /n}{\sum_i (y_i - \bar y )^2/n} = \frac{SS_{explained}/n}{SS_{total}/n},\] which is the ratio of the explained variance to the total variance.

16.7.1 Another example

Here, I generate linear data with some normally distributed errors. Because I generated the data myself, I know the pre-requisites are met.

x <- seq(0,100, 0.5)
mean <- runif(201,-0.1, 0.1)
e <- rnorm(201,0,0.5 )
y <- 10-x/10 +mean+e
data <- data.frame(x,y)

Here is the code for linear and rank correlation:

glmP <- lm(y~x)
#> Call:
#> lm(formula = y ~ x)
#> Coefficients:
#> (Intercept)            x  
#>     10.0447      -0.1007
#> Call:
#> lm(formula = y ~ x)
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.9507 -0.3899 -0.0289  0.3113  1.8677 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 10.044714   0.071545  140.40   <2e-16 ***
#> x           -0.100651   0.001238  -81.33   <2e-16 ***
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Residual standard error: 0.5091 on 199 degrees of freedom
#> Multiple R-squared:  0.9708, Adjusted R-squared:  0.9706 
#> F-statistic:  6614 on 1 and 199 DF,  p-value: < 2.2e-16
yr <- rank(y)
xr <- rank(x)
glmS <- lm(yr~xr)
#> Call:
#> lm(formula = yr ~ xr)
#> Coefficients:
#> (Intercept)           xr  
#>    200.6158      -0.9863
#>                  2.5 %      97.5 %
#> (Intercept)  9.9036308 10.18579751
#> x           -0.1030918 -0.09821066
#>                  2.5 %      97.5 %
#> (Intercept) 197.929359 203.3022832
#> xr           -1.009359  -0.9632316

And a few ggplots showing data, regression lines, and residuals.

interceptp <- glmP$coefficients[1]
slopep <- glmP$coefficients[2]
intercepts <- glmS$coefficients[1]
slopes <- glmS$coefficients[2]

p1 <- ggplot(data, aes(x=x, y=y))+
  geom_point() + 
  geom_abline(intercept = interceptp, slope = slopep, color="red",linewidth=1)+
  labs(title=paste("Pearson's correlation",round(cor(x, y, method="pearson"),3)))
p2 <- ggplot(data, aes(x=xr, y=yr))+
  geom_point() + 
  geom_abline(intercept = intercepts, slope = slopes, color="red",linewidth=1)+
  labs(x="Rank x",y="Rank y",title=paste("Spearman's correlation",round(cor(x, y, method="spearman"),3)))

grid.arrange(p1, p2, ncol=2)

p3 <- ggplot(glmP, aes(x, glmP$residuals))+geom_point()+
geom_hline(yintercept=0, color="purple", size=1)+
  labs(x="x",y="Residual",title="Residual plot")
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2
#> 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where
#> this warning was generated.

p4 <- ggplot(glmP, aes(sample=glmP$residuals))+stat_qq(distribution = qnorm, dparams=list(0,sd(glmP$residuals)))+
  geom_abline(intercept=0, slope=1,color="red")+
  labs(title="Q-Q plot of residuals, Normal distribution")

p5 <- ggplot(glmP, aes(x=glmP$fitted.values, y=glmP$residuals))+geom_point()+
  geom_hline(yintercept=0, color="purple", size=1)+
  labs(title="residuals vs fits plot ",x="fits",y="residuals")
grid.arrange(p3,p4,p5, ncol=3)

Finally we look at the diagnostic plots. Note that even here, the lines are not perfectly straight and the tails are too fat (see Q-Q plot), even though the residuals were generated from a normal distribution.

The correlation test reveals that there is indeed a negative correlation between x and y, estimated to be contained in the interval [-0.9883265 -0.9797099] with 95% confidence.
#>  Pearson's product-moment correlation
#> data:  x and y
#> t = -81.325, df = 199, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.9888441 -0.9806059
#> sample estimates:
#>        cor 
#> -0.9852867

Computing \(\small R^2\) tells us that almost 97% of the variance or variation iny is explained by the linear regression model.

rsquare <- cor(x,y)^2
#> [1] 0.9707899

16.8 Non-linear regression - a few remarks

To use a non-linear model, you use nls, non-linear least squares. You have to / get to specify the formula you think works, the data frame to use, and starting values for the parameters in the model. In the example below, I used data I generated, but I am not going to show you how.

The plot shows that the data is some sort of parabola, so it looks like a quadratic model might be worth trying.

We start by plotting the data. There is no need to pretty up the plot at this point.

ggplot(data, aes(x=x, y=y))+geom_point()

The data looks quadratic, so we will try a quadratic model. You have to enter the general quadratic formula \(\small ax^2+bx+c\) and some starting values.

model <- nls(y ~ a*x^2 + b*x + c, data=data, start=list(a=1, b=1, c=0.1))
#> Nonlinear regression model
#>   model: y ~ a * x^2 + b * x + c
#>    data: data
#>        a        b        c 
#>  1.06392 -0.04993 -0.06709 
#>  residual sum-of-squares: 4.694
#> Number of iterations to convergence: 1 
#> Achieved convergence tolerance: 3.861e-08

Next, we extract the parameters \(\small a, b, c\) and plot the data together with the estimated function.

a <- coef(model)[1]
b <- coef(model)[2]
c <- coef(model)[3]
#option 1 computing the estimated values directly
y_est <- a*x^2 + b*x + c
lines(x, y_est, col="red")
#option 2 defining a function for estimated values
qfit <- function(x, a, b, c) {a*x^2+b*x+c} 
plot(x, y)
ymod <- qfit(x, a=coef(model)[1], b=coef(model)[2], c=coef(model)[3])
lines(x,ymod, type="l", col="steelblue") 

Both options give us the same graph, use whichever you prefer. We see that there are at least three outliers, at \(\small x=-2,-1.7, 0\).

Finally, I plot the residuals and test for normality.

abline(h=0, col="red")
#>  Anderson-Darling normality test
#> data:  residuals(model)
#> A = 4.611, p-value = 1.212e-11

16.9 Assignment

Use the regression data file on Plato and perform a complete regression analysis.