Chapter 5 Exploring and Modeling Data
The purpose of this chapter is to expose you to several options for meaning-making with numbers beyond descriptive statistics. More specifically, you will be asked to apply your previous knowledge regarding descriptive statistics, data visualization, and data manipulation in the pursuit of the scientific method through statistical modeling.
Many interesting research questions are addressed with descriptive statistics. For instance, the census might ask the question: How many people live in the United States?relationships between and within variables tend to provide answers to more nuanced research questions—for instance, what is the relationship between median household income and educational attainment, or even what are the most influential predictors of happiness.
5.1 Quick Review
First, let’s review some of the things we’ve learned in this text so far. To get started, install and/or load the following packages into R Studio.
library(tidyverse)
library(SDT)
library(DataExplorer)
library(psych)
library(rstatix)
library(olsrr)
library(car)
We will be using a built-in dataset from the SDT
package that contains
information on learning motivation data based on Self Determination
Theory. Call the dataset with the object name learning_motivation
and
assign it to the object data
with the operator <-
. See the code
below.
To get a quick look at many facets of a dataset in one line of code, try
DataExplorer
. DataExplorer
includes an automated data exploration
process for analytic tasks and predictive modeling. Simply run the
function create_report()
on a data frame to produce multiple plots and
tables related to descriptive and correlation related analyses. Use the
summary()
, str()
, and psych::describe()
functions to examine and
assess the variables in the dataset for normality
For some more quick review, arrange the dataset by age in descending order. Then examine the first and last six rows of the dataset.
<- data %>% arrange(desc(age))
data head(data)
## sex age intrinsic identified introjected external
## 1 2 22 1.4 3.00 3.00 3.50
## 2 1 19 4.6 2.75 3.50 2.25
## 3 1 19 1.4 2.25 1.00 2.00
## 4 2 19 3.0 3.50 3.75 4.00
## 5 1 19 2.6 3.50 3.25 2.75
## 6 1 19 4.8 4.25 1.75 1.50
tail(data)
## sex age intrinsic identified introjected external
## 1145 2 10 3.6 4.75 3.50 4.25
## 1146 1 10 3.2 2.75 3.25 2.75
## 1147 1 10 3.6 4.25 3.00 3.50
## 1148 1 10 3.4 3.25 3.50 2.75
## 1149 2 10 3.2 3.00 3.25 2.75
## 1150 2 10 3.6 3.50 3.25 4.50
Create a new column based on values from other columns without opening and saving the data file.
<- data %>% mutate(internalizedmotivation = intrinsic + identified)
data <- data %>% mutate(internalizedmotivationlog = log10(intrinsic + identified))
data head(data)
## sex age intrinsic identified introjected external internalizedmotivation
## 1 2 22 1.4 3.00 3.00 3.50 4.40
## 2 1 19 4.6 2.75 3.50 2.25 7.35
## 3 1 19 1.4 2.25 1.00 2.00 3.65
## 4 2 19 3.0 3.50 3.75 4.00 6.50
## 5 1 19 2.6 3.50 3.25 2.75 6.10
## 6 1 19 4.8 4.25 1.75 1.50 9.05
## internalizedmotivationlog
## 1 0.6434527
## 2 0.8662873
## 3 0.5622929
## 4 0.8129134
## 5 0.7853298
## 6 0.9566486
::describe(data) psych
## vars n mean sd median trimmed mad min max
## sex 1 1150 1.50 0.50 1.00 1.50 0.00 1.0 2
## age 2 1150 14.09 1.85 14.00 13.98 1.48 10.0 22
## intrinsic 3 1150 3.13 1.09 3.20 3.15 1.19 1.0 5
## identified 4 1150 3.29 1.01 3.25 3.34 1.11 1.0 5
## introjected 5 1150 2.75 1.04 2.75 2.73 1.11 1.0 5
## external 6 1150 2.94 0.94 3.00 2.95 1.11 1.0 5
## internalizedmotivation 7 1150 6.42 1.80 6.55 6.47 1.82 2.0 10
## internalizedmotivationlog 8 1150 0.79 0.14 0.82 0.80 0.12 0.3 1
## range skew kurtosis se
## sex 1.0 0.01 -2.00 0.01
## age 12.0 0.51 -0.04 0.05
## intrinsic 4.0 -0.15 -0.75 0.03
## identified 4.0 -0.39 -0.49 0.03
## introjected 4.0 0.12 -0.80 0.03
## external 4.0 -0.10 -0.49 0.03
## internalizedmotivation 8.0 -0.27 -0.39 0.05
## internalizedmotivationlog 0.7 -1.17 1.58 0.00
Select rows of a data frame based on a certain condition. Use the
dim()
function to ascertain the dimensions of the resultant dataset. If
1 represents males and 2 represents females, how many males are 18 or
older in the dataset?
<- data %>% filter(age >= 18, sex == 1)
malesover18 dim(malesover18)
## [1] 34 8
Select specific columns of a data frame using the select()
function.
Or select all columns except one using -column (e.g.
select(-intrinsic)
)
<- data %>% select(sex, age, intrinsic, external)
selected_data head(selected_data)
## sex age intrinsic external
## 1 2 22 1.4 3.50
## 2 1 19 4.6 2.25
## 3 1 19 1.4 2.00
## 4 2 19 3.0 4.00
## 5 1 19 2.6 2.75
## 6 1 19 4.8 1.50
Use transformations as an opportunity to label your data categorically.
<- data %>% mutate(preteen = age <= 12)
newdata head(newdata)
## sex age intrinsic identified introjected external internalizedmotivation
## 1 2 22 1.4 3.00 3.00 3.50 4.40
## 2 1 19 4.6 2.75 3.50 2.25 7.35
## 3 1 19 1.4 2.25 1.00 2.00 3.65
## 4 2 19 3.0 3.50 3.75 4.00 6.50
## 5 1 19 2.6 3.50 3.25 2.75 6.10
## 6 1 19 4.8 4.25 1.75 1.50 9.05
## internalizedmotivationlog preteen
## 1 0.6434527 FALSE
## 2 0.8662873 FALSE
## 3 0.5622929 FALSE
## 4 0.8129134 FALSE
## 5 0.7853298 FALSE
## 6 0.9566486 FALSE
Save your edited data frame to a .csv file.
write_csv(data, "sdtedit.csv")
5.2 Correlational Models
Next we will review and explore correlational models, which were conceptually covered in detail in the previous chapter. Most correlation coefficients are interpreted as a number between -1 and 1. This number tells us about the strength of the relationship between the two variables. Even if your final goal is regression or an inferential statistical test, correlations are useful to look at during exploratory data analysis before trying to fit a predictive model. Here are several correlation tasks using the Rstatix library.
# Correlation test between two variables
%>% cor_test(age, intrinsic, method = "pearson") data
## # A tibble: 1 × 8
## var1 var2 cor statistic p conf.low conf.high method
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 age intrinsic -0.16 -5.61 0.0000000261 -0.219 -0.106 Pearson
# Correlation of one variable against all
%>% cor_test(intrinsic, method = "pearson") data
## # A tibble: 7 × 8
## var1 var2 cor statistic p conf.low conf.high method
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 intrinsic sex 0.22 7.60 6.03e- 14 0.163 0.273 Pears…
## 2 intrinsic age -0.16 -5.61 2.61e- 8 -0.219 -0.106 Pears…
## 3 intrinsic identified 0.48 18.4 2.61e- 66 0.431 0.520 Pears…
## 4 intrinsic introjected 0.29 10.4 2.41e- 24 0.240 0.346 Pears…
## 5 intrinsic external -0.026 -0.881 3.78e- 1 -0.0837 0.0319 Pears…
## 6 intrinsic internalizedmo… 0.87 60.3 0 0.857 0.885 Pears…
## 7 intrinsic internalizedmo… 0.84 52.7 6.05e-309 0.824 0.857 Pears…
# Pairwise correlation test between all variables
%>% cor_test(method = "pearson") data
## # A tibble: 64 × 8
## var1 var2 cor statistic p conf.low conf.high method
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 sex sex 1 Inf 0 1 1 Pears…
## 2 sex age 0.027 0.899 3.69e- 1 -0.0313 0.0842 Pears…
## 3 sex intrinsic 0.22 7.60 6.03e-14 0.163 0.273 Pears…
## 4 sex identified 0.13 4.60 4.62e- 6 0.0774 0.191 Pears…
## 5 sex introjected 0.039 1.33 1.82e- 1 -0.0185 0.0969 Pears…
## 6 sex external 0.0094 0.317 7.51e- 1 -0.0485 0.0671 Pears…
## 7 sex internalizedmotiva… 0.21 7.19 1.13e-12 0.152 0.262 Pears…
## 8 sex internalizedmotiva… 0.2 6.96 5.78e-12 0.145 0.256 Pears…
## 9 age sex 0.027 0.899 3.69e- 1 -0.0313 0.0842 Pears…
## 10 age age 1 Inf 0 1 1 Pears…
## # … with 54 more rows
# Compute correlation matrix
<- data %>% cor_mat()
cor.mat cor.mat
## # A tibble: 8 × 9
## rowname sex age intrinsic identified introjected external
## * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 sex 1 0.027 0.22 0.13 0.039 0.0094
## 2 age 0.027 1 -0.16 -0.16 -0.13 -0.074
## 3 intrinsic 0.22 -0.16 1 0.48 0.29 -0.026
## 4 identified 0.13 -0.16 0.48 1 0.46 0.29
## 5 introjected 0.039 -0.13 0.29 0.46 1 0.5
## 6 external 0.0094 -0.074 -0.026 0.29 0.5 1
## 7 internalizedmotivation 0.21 -0.19 0.87 0.85 0.44 0.14
## 8 internalizedmotivatio… 0.2 -0.16 0.84 0.84 0.43 0.15
## # … with 2 more variables: internalizedmotivation <dbl>,
## # internalizedmotivationlog <dbl>
# Show the significance levels
%>% cor_get_pval() cor.mat
## # A tibble: 8 × 9
## rowname sex age intrinsic identified introjected external
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 sex 0 3.69e- 1 6.03e- 14 4.62e- 6 1.82e- 1 7.51e- 1
## 2 age 3.69e- 1 0 2.61e- 8 2.16e- 8 5.05e- 6 1.2 e- 2
## 3 intrinsic 6.03e-14 2.61e- 8 0 2.61e- 66 2.41e-24 3.78e- 1
## 4 identified 4.62e- 6 2.16e- 8 2.61e- 66 0 1.26e-61 4.26e-23
## 5 introjected 1.82e- 1 5.05e- 6 2.41e- 24 1.26e- 61 0 1.03e-73
## 6 external 7.51e- 1 1.2 e- 2 3.78e- 1 4.26e- 23 1.03e-73 0
## 7 internalizedmotiv… 1.13e-12 7.54e-11 0 2.08e-316 2.47e-54 9.85e- 7
## 8 internalizedmotiv… 5.78e-12 3.57e- 8 6.05e-309 1.90e-300 6.78e-52 5.24e- 7
## # … with 2 more variables: internalizedmotivation <dbl>,
## # internalizedmotivationlog <dbl>
# Replacing correlation coefficients by symbols
%>%
cor.mat cor_as_symbols() %>%
pull_lower_triangle()
## rowname sex age intrinsic identified introjected external
## 1 sex
## 2 age
## 3 intrinsic
## 4 identified .
## 5 introjected . .
## 6 external . .
## 7 internalizedmotivation * * .
## 8 internalizedmotivationlog * * .
## internalizedmotivation internalizedmotivationlog
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8 *
# Mark significant correlations
%>%
cor.mat cor_mark_significant()
## rowname sex age intrinsic identified introjected
## 1 sex
## 2 age 0.027
## 3 intrinsic 0.22**** -0.16****
## 4 identified 0.13**** -0.16**** 0.48****
## 5 introjected 0.039 -0.13**** 0.29**** 0.46****
## 6 external 0.0094 -0.074* -0.026 0.29**** 0.5****
## 7 internalizedmotivation 0.21**** -0.19**** 0.87**** 0.85**** 0.44****
## 8 internalizedmotivationlog 0.2**** -0.16**** 0.84**** 0.84**** 0.43****
## external internalizedmotivation internalizedmotivationlog
## 1
## 2
## 3
## 4
## 5
## 6
## 7 0.14****
## 8 0.15**** 0.98****
# Draw correlogram using R base plot
%>%
cor.mat cor_reorder() %>%
pull_lower_triangle() %>%
cor_plot()
5.3 Simple and Multiple Regression
Researchers use either bivariate regression (simple—one dependent and one independent) or multiple regression (more than one independent and one dependent) as even more nuanced models of influence between two or more variables in a dataset. Dependent variables in each of these scenarios are often called criterion, outcome, or response variable, while the independent variable is often called the predictor or explanatory variable.
Regression is used to find relationships between variables—similarly to correlation analysis—however, there are three major differences between correlation and regression.
Test Purpose
“While correlation analyses are used to determine relationships and the strength of relationships between variables, researchers use regression analysis to predict, based on information from another variable or set of variables, the outcome of one variable. Otherwise, researchers can use regression analysis as a tool to explain why, based on any given predictor variables, differences exist between participants” (Russell, 2018, p. 193).
Variable Names
There must be a dependent variable in regression that a researcher is trying to predict based on other predictor (independent) variables. This concept is not used in correlation methods.
Regression Coefficients and Slope Intercepts
Researchers can use regression coefficients to infer beyond the direction and magnitude of relationships and discuss changes in Y due to specific changes in X (e.g., for every increase of 1 unit in Y, X increases .5)
The unstandardized regression formula is Y = a + bX
where a is the
intercept (where the line of best fit crosses the y-axis) and b is the
slope of the regression line (½ is up 1 over 2). Put otherwise, a is
the value of Y when X = 0 and b is the regression coefficient. The null
hypothesis of the regression model is that independent variables are not
able to significantly predict the dependent variable (i.e., the slope or
standardized beta weight is 0, and that the predictor variable had no
impact on the dependent variable).
Here’s how you perform simple linear regression in R. This example uses the mtcars dataset which describes several measures of a set of cars. Let’s model quarter mile time (qsec) as predicted by horsepower then examine the intercepts and coefficients before visualizing the relationship in a scatter plot.
The following code represents the unstandardized regression formula
using the function lm()
assigned to the object mod
<- lm(qsec ~ hp, data = mtcars) mod
Use summary(mod)
to display the statistical output. See the asterisks
for the level of significance, and the coefficient labelled ‘estimate’
as a description of the extent to which the target variable should
change as the predictor variable increases by 1 unit. The r-squared,
F-statistics, degrees of freedom, and p-value provide important model
fit and parsimony statistics that provide important context for the
interpretation of regression coefficients.
summary(mod)
##
## Call:
## lm(formula = qsec ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1766 -0.6975 0.0348 0.6520 4.0972
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.556354 0.542424 37.897 < 2e-16 ***
## hp -0.018458 0.003359 -5.495 5.77e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.282 on 30 degrees of freedom
## Multiple R-squared: 0.5016, Adjusted R-squared: 0.485
## F-statistic: 30.19 on 1 and 30 DF, p-value: 5.766e-06
It’s helpful to store the model coefficients as objects to be referred to in plotting procedures.
<- coef(mod)[1]
beta0 <- coef(mod)[2] beta1
Use the geom_point()
and geom_abline()
functions to display a
scatterplot with a line of best fit.
ggplot(mtcars, aes(x = hp, y = qsec)) +
geom_point() +
geom_abline(intercept = beta0, slope = beta1, colour='red') +
ggtitle("Quarter Mile Time as a function of Horsepower") +
xlab('Horsepower') +
ylab('Quarter Mile Time')
Practically, these findings suggest that horsepower is a significant predictor of quarter mile time. More specifically, the model output tells us that quarter mile time decreases by about 2 tenths of a second (.018) for each unit increase of horsepower in a given car. While statistically significant, this may seem practically insignificant. However, an increase of 100 horsepower tends to predict a loss of 1.8 seconds in quarter mile time, and a change in 300 horsepower tends to predict a loss of 5.4 seconds (which is good when the goal is speed). Is horsepower acting as a proxy for another variable? Put otherwise, would another variable (or set of variables) serve as better predictors of quarter mile time than hp?
Checking for Regression Assumptions
The assumptions of regression are as follows:
1. Linearity—relationship between dependent and independent variables is linear.
2. Homoscedasticity—the variance remains equal along the entire slope of the data
3. Normal distribution for variables included in the model.
4. Number of variables—rule of thumb is that you have 10-20 participants per variable. Others argue less works (5).
5. No multicollinearity—no two independent variables should be highly correlated as they would then be redundant in the model (no new information to interpret).
Let’s assess each of these assumptions for our model of quarter mile time as a function of horsepower.
Are the data linear? Use a quickplot to visually assess linearity.
plot(qsec ~ hp, data = mtcars, main = 'Quarter Mile time by Horsepower') # plots raw data
Are the data normally distributed? Use the plot()
function to display
output relevant to several assumptions.
plot(mod)
The residuals vs. fitted plot used to check linear relationship. A horizontal line without distinct patterns is an indication that the linear relationship is present. The normal Q-Q is used to examine whether residuals are normally distributed. If they are they will appear to be linear on this plot. The scale-location plot is used to check homogeneity of variance of residuals. A horizontal line with equally spread points is agood indication of homogeneity of variance. The residuals vs. leverage plot is used to identify potential outliers that are influencing regression results. The red lines are curves of constant Cook’s distance, another metric to determine the influence of a point which combines leverage and residual size.
You do not want to see a pattern to residuals when ordered by independent variable as “the assumption of homoscedasticity (meaning”same variance”) is central to linear regression models. Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables. Heteroscedasticity (the violation of homoscedasticity) is present when the size of the error term differs across values of an independent variable. The impact of violating the assumption of homoscedasticity is a matter of degree, increasing as heteroscedasticity increases” (https://www.statisticssolutions.com/homoscedasticity/).
You can also check the multicollinearity assumption using variance
inflation factors with the command vif()
. Variation inflation factors
greater than 5 or 10 (depending on the area of research) are often taken
as indicative that there may be a multicollinearity problem. This is
useful when constructing models with multiple predictor
variables—ensuring that two predictors are not too highly correlated
and predicting the same variance in the outcome variable.
<- lm(qsec ~ hp + cyl, data = mtcars)
mod vif(mod)
## GVIF Df GVIF^(1/(2*Df))
## hp 3.494957 1 1.869480
## cyl 3.494957 2 1.367289
5.3.1 Multiple Regression
To estimate multiple regression equations, simply add additional predictors. This allows you to eliminate confounds and provide evidence for causality with your model (given sufficient internal/external validity and appropriate sampling procedures). In the example below, Quarter Mile Time is modeled as a function of horsepower and engine volume (disp). The adjusted r-squared value tells us that 51% of the variance in quarter mile time was explained by variance in horsepower.
<- lm(qsec ~ hp + disp, data = mtcars)
mod2 summary(mod2)
##
## Call:
## lm(formula = qsec ~ hp + disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0266 -0.9156 0.1675 0.6109 4.1752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.454114 0.531156 38.509 < 2e-16 ***
## hp -0.025422 0.005339 -4.761 4.93e-05 ***
## disp 0.004870 0.002954 1.649 0.11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.247 on 29 degrees of freedom
## Multiple R-squared: 0.5443, Adjusted R-squared: 0.5129
## F-statistic: 17.32 on 2 and 29 DF, p-value: 1.124e-05
Let’s model quarter mile time as a function of variance in mpg, number of cylinders, horsepower, and automatic/manual transmission.
<- lm(qsec ~ mpg + cyl + hp + am, data = mtcars)
mod2 summary(mod2)
##
## Call:
## lm(formula = qsec ~ mpg + cyl + hp + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7437 -0.6751 0.1015 0.5319 2.1426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.426987 1.772405 11.525 1.02e-11 ***
## mpg -0.012876 0.066295 -0.194 0.847512
## cyl.L -1.951811 0.631743 -3.090 0.004730 **
## cyl.Q 0.173954 0.358954 0.485 0.632009
## hp -0.008707 0.005815 -1.497 0.146337
## ammanual -2.307268 0.513173 -4.496 0.000127 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.931 on 26 degrees of freedom
## Multiple R-squared: 0.7724, Adjusted R-squared: 0.7286
## F-statistic: 17.64 on 5 and 26 DF, p-value: 1.249e-07
This model seems to indicate that the number of cylinders and type of transmission (am) are the strongest predictors of quarter mile time (-2.37 seconds per change in transmission type, and -0.70 seconds per change in cylinders). These variables collectively account for a larger portion of variance given the adjusted r-squared (used with multiple variables) of .74.
We can also use the olsrr
library to run stepwise regression and
identify the best fitting model. Insert a period instead of variables to
call all variables in the dataset.
<- lm(qsec ~ ., data = mtcars)
model <- ols_step_all_possible(model) %>% arrange(desc(adjr)) %>% filter(n <= 4)
modelchoice head(modelchoice)
## Index N Predictors R-Square Adj. R-Square Mallow's Cp
## 1 177 4 hp wt gear wtqsec_ratio 0.9735351 0.9684457 12.20563
## 2 176 4 cyl wt carb wtqsec_ratio 0.9750342 0.9648209 10.26807
## 3 180 4 mpg hp wt wtqsec_ratio 0.9661794 0.9611690 21.71273
## 4 179 4 cyl hp wt wtqsec_ratio 0.9669874 0.9606388 20.66849
## 5 182 4 hp wt vs wtqsec_ratio 0.9648571 0.9596508 23.42181
## 6 56 3 hp wt wtqsec_ratio 0.9635221 0.9596137 23.14730
<- lm(qsec ~ disp + wt + vs + carb, data=mtcars)
model2 summary(model2)
##
## Call:
## lm(formula = qsec ~ disp + wt + vs + carb, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.25620 -0.51160 -0.05549 0.37667 2.53603
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.000767 0.734959 17.689 6.90e-15 ***
## disp -0.013590 0.003696 -3.677 0.001249 **
## wt 2.099573 0.399921 5.250 2.51e-05 ***
## vsstraight 1.319133 0.588394 2.242 0.034906 *
## carb.L -3.180956 0.697400 -4.561 0.000139 ***
## carb.Q -0.022039 0.586603 -0.038 0.970355
## carb.C 0.243124 0.571370 0.426 0.674420
## carb^4 0.006197 0.563858 0.011 0.991327
## carb^5 0.246119 0.495808 0.496 0.624323
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7901 on 23 degrees of freedom
## Multiple R-squared: 0.8549, Adjusted R-squared: 0.8045
## F-statistic: 16.95 on 8 and 23 DF, p-value: 6.049e-08
After searching for the best model with the fewest number of predictors, it can be determined that displacement, weight, engine style (v shaped vs. straight) and carb explain the 82%, the same amount of variance in quarter mile time as the entire dataset explains.
plot(residuals(mod2))
You can store the elements of the summary such as r-squared.
<- summary(mod)
summ_mod $residuals
summ_mod$r.squared
summ_mod$adj.r.squared
summ_mod$coefficients summ_mod
Generalized linear models are fairly common. Logistic regression is used
when the response variable is binary, e.g., did you vote (yes or no), do
you have a diabetes (yes or no). Poisson regression is used when the
response variable is a count (i.e., numeric but not continuous, e.g.,
how many heart attacks have you had, how many days from outbreak to
infection). Fitting such models in R is very similar to how one would
fit a linear model. E.g., for poisson regression,
glm(counts ~ outcome + treatment, data, family = poisson())
.
5.4 Inferential Statistics
5.4.1 Hypothesis Testing
A hypothesis is a proposed solution to a problem. Hypothesis testing is a means to take a rational (logical) decision based on observed data In music education, hypotheses tend to have an underlying goal of increased student success* (What can success mean here?) After formulating some hypothesis (e.g., there is no difference in performance ability between 6th and 7th grade band students), we collect and use data to accept or reject the hypothesis (null in this case).
A null hypothesis \(H_0\): a statement that represents the phenomenon you may want to provide evidence against (e.g., two means are the same where group difference is expected). An alternative hypothesis \(H_1\): represents the opposite of the null hypothesis (e.g., two group means are statistically different).
For example, if you want to test whether automatic/manual transmission affect miles per gallon (mpg), then \(H_0\) would be that manual and automatic transmission cars have the same mpg and \(H_1\) would be that manual and automatic transmission cars have different mpg.
5.4.2 The Role of Statistics in Hypothesis Testing
Statistical tests are designed to reject or not reject a null hypothesis based on data. They do not allow you to find evidence for the alternative; just evidence for/against the null hypothesis. Assuming that \(H_0\) is true, the p-value for a test is the probability of observing data as extreme or more extreme than what you actually observed (i.e., how surprised should I be about this?).
The significance level \(\alpha\) is what you decide this probability should be (i.e., the surprise threshold that you will allow), that is, the probability of a Type I error (rejecting \(H_0\) when it is true). Oftentimes, \(\alpha = 5\% = 0.05\) is accepted although much lower alpha values are common in some research communities, and higher values are used for exploratory research in less defined fields. If we set \(\alpha = 0.05\), we reject \(H_0\) if p-value < 0.05 and say that we have found evidence against \(H_0\).
Note that statistical tests and methods usually have assumptions about the data and the model. The accuracy of the analysis and the conclusions derived from an analysis depend crucially on the validity of these assumptions. So, it is important to know what those assumptions are and to determine whether they hold.
5.4.3 Checking Biases
Biases are in reference to preconceived notions about how something does or does not work, an opinion you hold about how something should work, and your understanding of how the world operates. Positivism is an episitomological belief that the knowledge serves only to describe what we observe. However, we, the observer, are entrenched in known and unknown biases. In the view of the positivist, the world is deterministic, and can be modeled and understood in complete objectivity. As positivism does not account for our biases, social scientists shifted to a post-positivist view. Post-positivism holds space for our biases through the acknowledgement that scientific reasoning and our acquired sense of the world are not the same. Most importantly, this epistemological view allows for the revision of theory, and holds that the goal of science is to describe reality in the most accurate way possible, understanding that we can never achieve that goal.
5.4.4 Tests of Group Differences
Before testing whether means are different among groups, it can be helpful to visualize the data distribution.
ggplot(mtcars) +
geom_boxplot(aes(x=am, y=mpg))
Remember, violin plots are similar to box plots except that violin plots show the probability density of the data at different values.
ggplot(mtcars) +
geom_violin(aes(x = am, y = mpg, fill = am))
T-tests can be used to test whether the means of two groups are significantly different. The null hypothesis \(H_0\) assumes that the true difference in means \(\mu_d\) is equal to zero. The alternative hypothesis can be that \(\mu_d \neq 0\), \(\mu_d >0\), or \(\mu_d<0\). By default, the alternative hypothesis is that \(\mu_d \neq 0\).
t.test(mtcars$mpg ~ mtcars$am, alternative="two.sided")
##
## Welch Two Sample t-test
##
## data: mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group automatic and group manual is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group automatic mean in group manual
## 17.14737 24.39231
library(webr)
## Registered S3 method overwritten by 'ggforce':
## method from
## scale_type.units units
<- t.test(mtcars$mpg ~ mtcars$am, alternative = "less")
model plot(model)
The t test results indicate that there is a significant difference in mpg between cars with a manual (0) and automatic (0) transmission (t = -3.77, p < .001). The webr library also allows you to easily produce distribution plots. Using this plot, we can determine that the null hypothesis will only be true (no difference between mpg by transmission type) < .01% of the time. Most differences are significant (p < .05) when t values exceed +- 1.96.
T-tests require a normality assumption which may or may not be valid
depending on the situation. The function wilcox.test
allows us compare
group medians (not means) in a non-parametric setting (i.e., without the
assumption of the data coming from some probability distribution). You
may have heard of the Mann-Whitney U test. This test is just a
two-sample Wilcoxon test. Is the median mpg different for automatic and
manual cars?
wilcox.test(mtcars$mpg ~ mtcars$am)
##
## Wilcoxon rank sum test with continuity correction
##
## data: mtcars$mpg by mtcars$am
## W = 42, p-value = 0.001871
## alternative hypothesis: true location shift is not equal to 0
Yes, there is a significant difference in the median of mpg across transmission types (W = 42, p = .002)
5.4.5 Reporting Support
Writing up statistical reports are a vital skill for researchers. The report library supports automated APA compliant reports for statistical analyses. For instance, the previous t-test is assigned to an object, and that object is passed through the “report()” function. The output includes all the important facets for reporting a t-test—including confidence intervals, a t-value, degrees of freedom, a p-value, and Cohen’s d (i.e., a reflection of the mean difference between groups in terms of standard deviations).
library(report)
<- t.test(mtcars$mpg ~ mtcars$am, alternative="two.sided")
ttestoutcome report(ttestoutcome)
## Effect sizes were labelled following Cohen's (1988) recommendations.
##
## The Welch Two Sample t-test testing the difference of mtcars$mpg by mtcars$am (mean in group automatic = 17.15, mean in group manual = 24.39) suggests that the effect is negative, statistically significant, and large (difference = -7.24, 95% CI [-11.28, -3.21], t(18.33) = -3.77, p = 0.001; Cohen's d = -1.41, 95% CI [-2.26, -0.53])
5.4.6 Inferential Modeling
In this section, each chunk will include the code for running an
univariate method of statistical analysis in baseR
and other
libraries. The figure below describes four separate dependent designs
that are associated with two basic inferential models which can be
analyzed using a one-sample t-test, where a mean with an unknown
distribution (e.g., M = 48
) is assessed against a known distribution
of data. The dependent samples t-test is used to measure score
differences from paired or matched samples.
5.4.6.1 One Sample T-Test
We’re going to use the band participation dataset for these examples.
For the one sample t-test, call in the vector where your data are loaded
from, and use the argument mu
to define the previously known mean
(with an unknown distribution). Examine both the output of the
statistical test and the automated written report.
<- read_csv('participationdata.csv')
data <- data %>% drop_na(GradeLevel)
data
t.test(data$intentionscomp, mu = 5)
##
## One Sample t-test
##
## data: data$intentionscomp
## t = 9.5182, df = 39, p-value = 1.017e-11
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
## 11.18181 14.51819
## sample estimates:
## mean of x
## 12.85
<- t.test(data$intentionscomp, mu = 5)
onesamplewriteup report(onesamplewriteup)
## Effect sizes were labelled following Cohen's (1988) recommendations.
##
## The One Sample t-test testing the difference between data$intentionscomp (mean = 12.85) and mu = 5 suggests that the effect is positive, statistically significant, and large (difference = 7.85, 95% CI [11.18, 14.52], t(39) = 9.52, p < .001; Cohen's d = 1.50, 95% CI [1.06, 1.98])
5.4.6.2 Dependent Sample T-Test
To run a dependent sample t-test, use the argument paired = TRUE
t.test(data$Intentions1, data$Intentions2, paired = TRUE)
##
## Paired t-test
##
## data: data$Intentions1 and data$Intentions2
## t = 0.1383, df = 39, p-value = 0.8907
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.3406334 0.3906334
## sample estimates:
## mean difference
## 0.025
<- t.test(data$Intentions1, data$Intentions2, paired = TRUE)
dependentsamplewriteup report(dependentsamplewriteup)
## Effect sizes were labelled following Cohen's (1988) recommendations.
##
## The Paired t-test testing the difference between data$Intentions1 and data$Intentions2 (mean difference = 0.03) suggests that the effect is positive, statistically not significant, and very small (difference = 0.03, 95% CI [-0.34, 0.39], t(39) = 0.14, p = 0.891; Cohen's d = 0.02, 95% CI [-0.29, 0.34])
Independent designs are used to compare a measures across differing samples, whereas dependent designs assess differences within a sample.
5.4.6.3 Independent Samples T-Test
Here is another chance to run a basic t-test
t.test(data$intentionscomp ~ data$GradeLevel, var.equal = T)
##
## Two Sample t-test
##
## data: data$intentionscomp by data$GradeLevel
## t = -2.9009, df = 38, p-value = 0.006156
## alternative hypothesis: true difference in means between group eighth-grade and group seventh-grade is not equal to 0
## 95 percent confidence interval:
## -8.125373 -1.446055
## sample estimates:
## mean in group eighth-grade mean in group seventh-grade
## 9.50000 14.28571
Let’s try plotting the difference using ggstatsplot
.
library(ggstatsplot)
ggbetweenstats(
data = data,
x = GradeLevel,
y = intentionscomp,
title = "Difference in Elective Intentions Between Grade Levels"
)
And for the write up?
<- t.test(data$intentionscomp ~ data$GradeLevel, var.equal = T)
ttestwriteup report(ttestwriteup)
## Effect sizes were labelled following Cohen's (1988) recommendations.
##
## The Two Sample t-test testing the difference of data$intentionscomp by data$GradeLevel (mean in group eighth-grade = 9.50, mean in group seventh-grade = 14.29) suggests that the effect is negative, statistically significant, and large (difference = -4.79, 95% CI [-8.13, -1.45], t(38) = -2.90, p = 0.006; Cohen's d = -1.00, 95% CI [-1.71, -0.28])
5.4.6.4 One-Way ANOVA
There are multiple new terms that emerge when discussing ANOVA approaches. These include:
Within-subjects factors: the task(s) all participants did (i.e., repeated measures)
Between-subjects factors: separate groups did certain conditions (e.g., levels of an experimental variable)
Mixed design: differences were assessed for variables by participants (within-subjects), and something separates them into groups (e.g., between-subjects)
A main effect is the overall effect of a variable on particular groups.
Sometimes you have more than one IV, so you might have two main effects: one for Variable 1 and one for Variable 2.
Also, you can have an interaction between the variables such that performance in “cells” of data is affected variably by the other variable. You can report a significant p-value for the interaction line IF there is a significant main effect p-value.
You can only report a significant interaction if there is a significant main effect.
Interpreting the main effect doesn’t make sense: interpreting the interaction DOES (e.g., the main effect changes based on membership in one of the groups)
One-way ANOVAs are like t-tests, but they compare differences across
more than two groups (i.e., categorical variables. For instance, in the
participation dataset, you might compare socioeconomic status across the
four schools contributing students to the sample. To run the ANOVA,
first assign the model using the function aov()
with the continuous
comparison variable (e.g., SES and the categorical vector of comparison
(e.g., School). To see the statistical output, run the summary()
function with the model name inside the parenthesis.
<- aov(SESComp ~ data$School, data = data)
model summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## data$School 3 2.804 0.9345 7.881 4e-04 ***
## Residuals 34 4.032 0.1186
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness
The output reflected a significant difference is present among the four schools. However, it is not possible to determine which schools are significantly different from others without Tukey’s HSD, which allows you to examine mean differences between each combination of school. Note the school relationships with a p value below .05 to see what schools differ statistically.
TukeyHSD(model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = SESComp ~ data$School, data = data)
##
## $`data$School`
## diff lwr upr p adj
## School2-School1 0.6673611 0.2721651 1.06255711 0.0003541
## School3-School1 0.2377315 -0.2141924 0.68965540 0.4955172
## School4-School1 0.3895833 -0.2400647 1.01923134 0.3542291
## School3-School2 -0.4296296 -0.8093217 -0.04993759 0.0215282
## School4-School2 -0.2777778 -0.8577669 0.30221139 0.5732634
## School4-School3 0.1518519 -0.4681826 0.77188635 0.9107596
The reporting function produces information regarding the main effect in the model. For more on how to report Tukey’s HSD, visit https://statistics.laerd.com/spss-tutorials/one-way-anova-using-spss-statistics-2.php
report(model)
## For one-way between subjects designs, partial eta squared is equivalent to eta squared.
## Returning eta squared.
## Warning: 'data_findcols()' is deprecated and will be removed in a future update.
## Its usage is discouraged. Please use 'data_find()' instead.
## Warning: Using `$` in model formulas can produce unexpected results. Specify your
## model using the `data` argument instead.
## Try: SESComp ~ School, data =
## data
## The ANOVA (formula: SESComp ~ data$School) suggests that:
##
## - The main effect of data$School is statistically significant and large (F(3, 34) = 7.88, p < .001; Eta2 = 0.41, 95% CI [0.17, 1.00])
##
## Effect sizes were labelled following Field's (2013) recommendations.
Factorial/Two-Way ANOVA
For factorial ANOVAs, there should be more than one categorical independent variable with one dependent variable (more dependent variables should be MANOVA) You are assessing first for main effects (between groups) and then for interaction effects (within groups)
Factorial ANOVAs are named more specifically for their number groupings Some examples are listed below:
2 x 2 x 2 ANOVA = Three independent variables with two levels each (e.g., gender, certified to teach or not, instrumentalist or not—main effects and interaction effects)
2 x 3 x 4 ANOVA = three independent variables, one has two levels (certified or not), one has three levels (teach in a rural, suburban, or urban school) and one has four levels (no music ed degree, music ed bachelors, music ed masters, music ed doctorate)
2 x 3 = two independent variables each one with two levels (gender) another with three levels (degree type)
To conduct a factorial ANOVA in R, use an asterisk to connect the two categorical variables of interest, as seen below. Then, use the summary function, Tukey HSD, and report function to interpret the output.
<- aov(intentionscomp ~ Class*GradeLevel, data = data)
factorial summary(factorial)
## Df Sum Sq Mean Sq F value Pr(>F)
## Class 1 16.6 16.55 0.701 0.40783
## GradeLevel 1 178.1 178.14 7.548 0.00932 **
## Class:GradeLevel 1 16.8 16.82 0.713 0.40415
## Residuals 36 849.6 23.60
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(factorial)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = intentionscomp ~ Class * GradeLevel, data = data)
##
## $Class
## diff lwr upr p adj
## orchestra-band 1.288221 -1.831278 4.40772 0.4078281
##
## $GradeLevel
## diff lwr upr p adj
## seventh-grade-eighth-grade 4.525003 1.125608 7.924398 0.0105076
##
## $`Class:GradeLevel`
## diff lwr upr
## orchestra:eighth-grade-band:eighth-grade 2.6250000 -5.38701652 10.637017
## band:seventh-grade-band:eighth-grade 5.8365385 -0.04267434 11.715751
## orchestra:seventh-grade-band:eighth-grade 5.5083333 -0.21962035 11.236287
## band:seventh-grade-orchestra:eighth-grade 3.2115385 -4.26927785 10.692355
## orchestra:seventh-grade-orchestra:eighth-grade 2.8833333 -4.47920176 10.245868
## orchestra:seventh-grade-band:seventh-grade -0.3282051 -5.28599268 4.629582
## p adj
## orchestra:eighth-grade-band:eighth-grade 0.8139391
## band:seventh-grade-band:eighth-grade 0.0522852
## orchestra:seventh-grade-band:eighth-grade 0.0631394
## band:seventh-grade-orchestra:eighth-grade 0.6577297
## orchestra:seventh-grade-orchestra:eighth-grade 0.7187103
## orchestra:seventh-grade-band:seventh-grade 0.9979489
report(factorial)
## Warning: 'data_findcols()' is deprecated and will be removed in a future update.
## Its usage is discouraged. Please use 'data_find()' instead.
## The ANOVA (formula: intentionscomp ~ Class * GradeLevel) suggests that:
##
## - The main effect of Class is statistically not significant and small (F(1, 36) = 0.70, p = 0.408; Eta2 (partial) = 0.02, 95% CI [0.00, 1.00])
## - The main effect of GradeLevel is statistically significant and large (F(1, 36) = 7.55, p = 0.009; Eta2 (partial) = 0.17, 95% CI [0.03, 1.00])
## - The interaction between Class and GradeLevel is statistically not significant and small (F(1, 36) = 0.71, p = 0.404; Eta2 (partial) = 0.02, 95% CI [0.00, 1.00])
##
## Effect sizes were labelled following Field's (2013) recommendations.
The more complex a model is the larger the sample size needs to be. As such, the lack of a significant effect may be due to low study power rather than the lack of the phenomenon of interest.
The MANOVA is an extended technique which determines the effects of independent categorical variables on multiple continuous dependent variables. It is usually used to compare several groups with respect to multiple continuous variables.
5.4.6.5 ANCOVA
The Analysis of Covariance (ANCOVA) is used in examining the differences in the mean values of the dependent variables that are related to the effect of the controlled independent variables while taking into account the influence of the uncontrolled independent variables as covariates. Use an addition operator to add covariates to the model.
<- aov(intentionscomp ~ GradeLevel + valuescomp, data = data)
ancova_model summary(ancova_model, type="III")
## Df Sum Sq Mean Sq F value Pr(>F)
## GradeLevel 1 178.9 178.92 9.73 0.00362 **
## valuescomp 1 208.4 208.37 11.33 0.00186 **
## Residuals 35 643.6 18.39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness
report(ancova_model)
## Warning: 'data_findcols()' is deprecated and will be removed in a future update.
## Its usage is discouraged. Please use 'data_find()' instead.
## The ANOVA (formula: intentionscomp ~ GradeLevel + valuescomp) suggests that:
##
## - The main effect of GradeLevel is statistically significant and large (F(1, 35) = 9.73, p = 0.004; Eta2 (partial) = 0.22, 95% CI [0.05, 1.00])
## - The main effect of valuescomp is statistically significant and large (F(1, 35) = 11.33, p = 0.002; Eta2 (partial) = 0.24, 95% CI [0.07, 1.00])
##
## Effect sizes were labelled following Field's (2013) recommendations.
5.5 Review
In this chapter we reviewed necessary skills for successful R coding, in addition to introducing correlational and inferential statistical modeling procedures. To make sure you understand this material, there is a practice assessment to go along with this and the previous chapter at https://jayholster.shinyapps.io/StatsinRAssessment/
5.6 References
Cui B., (2020). DataExplorer. https://cran.r-project.org/web/packages/DataExplorer/index.html
Fox, J., & Weisberg, S. (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.
Kassambara, A. (2021). Rstatix. https://cran.r-project.org/web/packages/rstatix/index.html
Hebbali, A. (2020). olsrr. https://cran.r-project.org/web/packages/olsrr/index.html
Makowski, D., Lüdecke, D., Mattan S., Patil, I., Wiernik, B. M., & Siegel, R. (2022). report: Automated Reporting of Results and Statistical Models. https://cran.r-project.org/web/packages/report/index.html
Revelle, W. (2017) psych: Procedures for Personality and Psychological Research, Northwestern University, Evanston, Illinois, USA, <https://CRAN.R-project.org/package=psych> Version = 1.7.8.
Russell, J. A. (2018). Statistics in music education research. Oxford University Press.
Uenlue, A. (2018). Self-Determination Theory Measures. https://cran.r-project.org/web/packages/SDT/index.html
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T.L., Miller, E., Bache, S.M., Müller, K., Ooms, J., Robinson, D., Seidel, D.P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686.
5.6.1 R Short Course Series
Video lectures of each guidebook chapter can be found at https://osf.io/6jb9t/. For this chapter, find the follow the folder path Statistics in R -> AY 2021-2022 Spring and access the video files, r markdown documents, and other materials for each short course.
5.6.2 Acknowledgements
This guidebook was created with support from the Center for Research Data and Digital Scholarship and the Laboratory for Interdisciplinary Statistical Analaysis at the University of Colorado Boulder, as well as the U.S. Agency for International Development under cooperative agreement #7200AA18CA00022. Individuals who contributed to materials related to this project include Jacob Holster, Eric Vance, Michael Ramsey, Nicholas Varberg, and Nickoal Eichmann-Kalwara.