6 Regression
Regression analysis will likely be the main focal
6.1 Linear Models
The basic linear regression function in R is lm
, which performs OLS regression. The lm
function, like most model functions, requires two elements to get started - a formula and data. The formula for the model is in the form you are used to for OLS, such that \(y = X_i \beta + \epsilon_i\) is equivalent to y ~ x
. Data is given by data = data
and is usually in the form of a data frame. Let’s look at an example.
data("mtcars")
<- lm(mpg ~ disp, data = mtcars)
model summary(model)
##
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## disp -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
plot(mtcars$disp, mtcars$mpg)
abline(coef(model), col = "blue")
In mutivariate situation we are not able to plot the regression coefficients on a single plot. However we can plot them using the car
package function avPlots
to view the marginals for each explanatory variable.
<- lm(mpg ~ disp + cyl + wt + drat, data = mtcars)
model2 summary(model2)
##
## Call:
## lm(formula = mpg ~ disp + cyl + wt + drat, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4067 -1.4096 -0.4954 1.3346 6.0729
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.160271 7.304325 5.635 5.56e-06 ***
## disp 0.007472 0.012062 0.619 0.54080
## cyl -1.786074 0.634821 -2.814 0.00903 **
## wt -3.638075 1.102460 -3.300 0.00272 **
## drat -0.010492 1.337929 -0.008 0.99380
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.642 on 27 degrees of freedom
## Multiple R-squared: 0.8326, Adjusted R-squared: 0.8078
## F-statistic: 33.57 on 4 and 27 DF, p-value: 4.057e-10
library(car)
avPlots(model2)
6.2 Logistic Regression
To perform logistic regression in R we will focus on the “general linear model” function which follows from the class of models by the same name. This function is imaginitively named glm
and takes the same form as the OLS function lm
. In this example we will model a binary outcome variable using the logit
link function.
<- mtcars
mtcars.bin $mpg <- 1*(mtcars.bin$mpg > 21)
mtcars.bin
<- glm(mpg ~ disp + cyl + wt + drat, data = mtcars.bin,
model3 family = binomial(link = "logit"))
summary(model3)
##
## Call:
## glm(formula = mpg ~ disp + cyl + wt + drat, family = binomial(link = "logit"),
## data = mtcars.bin)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.866e-05 -2.110e-08 -2.110e-08 2.110e-08 2.292e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.469e+02 1.544e+06 0.000 1
## disp 9.035e-01 1.490e+03 0.001 1
## cyl -7.122e+01 2.026e+05 0.000 1
## wt -4.885e+01 4.645e+05 0.000 1
## drat 8.582e+00 3.189e+05 0.000 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.2340e+01 on 31 degrees of freedom
## Residual deviance: 1.1643e-09 on 27 degrees of freedom
## AIC: 10
##
## Number of Fisher Scoring iterations: 25
Prediction and confusion matrix
<- data.frame(
pred3.bin mpg = mtcars.bin$mpg,
mpg.pred = predict(model3, newdata = mtcars.bin, type = "response")
)
$mpg.bin <- 1*(pred3.bin$mpg.pred > 0.5)
pred3.bin
library(caret)
confusionMatrix(factor(pred3.bin$mpg), factor(pred3.bin$mpg.bin))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 20 0
## 1 0 12
##
## Accuracy : 1
## 95% CI : (0.8911, 1)
## No Information Rate : 0.625
## P-Value [Acc > NIR] : 2.939e-07
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.625
## Detection Rate : 0.625
## Detection Prevalence : 0.625
## Balanced Accuracy : 1.000
##
## 'Positive' Class : 0
##
We can change our glm
to match the data analysis we wish to perform by changing the family
of the model. The family refers to the link function that R will use when performing the regression.
binomial(link="logit")
guassian(link="identity")
Gamma(link="inverse")
poisson(link="log")
quasi(link="identity", variance="constant")
6.3 Hypothesis Tests
one-way t-test
# some random data
<- rnorm(100, 1, 0.5)
x
# t-test
t.test(x, mu=1.2)
##
## One Sample t-test
##
## data: x
## t = -4.2235, df = 99, p-value = 5.35e-05
## alternative hypothesis: true mean is not equal to 1.2
## 95 percent confidence interval:
## 0.9004436 1.0919413
## sample estimates:
## mean of x
## 0.9961925
# directional
t.test(x, mu=1.2, alternative = "greater")
##
## One Sample t-test
##
## data: x
## t = -4.2235, df = 99, p-value = 1
## alternative hypothesis: true mean is greater than 1.2
## 95 percent confidence interval:
## 0.9160699 Inf
## sample estimates:
## mean of x
## 0.9961925
two-way t-test
# some random data
<- rnorm(100, 1, 0.5)
x <- rnorm(100, 0, 1.2)
y1 <- rnorm(100, 1, 0.5)
y2
# note difference in p-value
t.test(x, y1)
##
## Welch Two Sample t-test
##
## data: x and y1
## t = 9.185, df = 134.41, p-value = 6.618e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.8861011 1.3724189
## sample estimates:
## mean of x mean of y
## 1.0044594 -0.1248007
t.test(x, y2)
##
## Welch Two Sample t-test
##
## data: x and y2
## t = 0.081147, df = 197.97, p-value = 0.9354
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1290327 0.1401076
## sample estimates:
## mean of x mean of y
## 1.0044594 0.9989219
Other tests:
- Wilcox - used when data is non-parametric
- Correlation - used when you want to compare the correlation between two variables
# some random data
<- rnorm(100, 1, 0.5)
x <- rnorm(100, 0, 1.2)
y
wilcox.test(x, y)
##
## Wilcoxon rank sum test with continuity correction
##
## data: x and y
## W = 7917, p-value = 1.032e-12
## alternative hypothesis: true location shift is not equal to 0
cor.test(x, y)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = -0.4892, df = 98, p-value = 0.6258
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2434145 0.1485016
## sample estimates:
## cor
## -0.04935612
6.4 Nonlinear Models
The built-in function nls
is used to find parameter values of non-linear models in R. Non-linear models take the generic form \(y = f(x, \beta)\), where \(y\) is a set of outcome variables, \(x\) represents a set of random variables, and \(\beta\) represents the parameters of the model, such that \(f(x, \beta)\) cannot be expressed as a linear combination of the \(\beta\).
from RPubs - https://rpubs.com/RobinLovelace/nls-function
<- 24
len = runif(len)
x = x^3 + runif(len, min = -0.1, max = 0.1)
y plot(x, y)
<- seq(from = 0, to = 1, length = 50)
s lines(s, s^3, lty = 2)
<- data.frame(x, y)
df <- nls(y ~ I(x^power), data = df, start = list(power = 1), trace = T) m
## 1.830568 (2.11e+00): par = (1)
## 0.4111573 (1.69e+00): par = (1.884938)
## 0.1150580 (5.84e-01): par = (2.832978)
## 0.08663172 (5.39e-02): par = (3.369605)
## 0.08639049 (1.44e-03): par = (3.432622)
## 0.08639032 (5.82e-05): par = (3.4309)
## 0.08639032 (2.33e-06): par = (3.43097)
6.5 Other Model Types
There are obviously hundreds of model types available to fit the analysis you are using in your research. There are too many to list here but I will give a few highlights to point you in the right direction for commonly used model types in the social sciences.
Mixed and Random Effects Models - using the package lme4
you will be able to create mixed and random effects models in R. The function call for this is lmer
. A great resource for this can be found here: https://m-clark.github.io/mixed-models-with-R/ .
Ordinal data - If your data has an ordered structure the function ordinal
will perform ordinal regression and can be used to analyze this type of data.