Chapter 4 Inferential statistics

Inferential Statistics is an essential step in drawing conclusions from your data and making predictions about a larger population. It involves using sample data to infer the properties of an entire population, typically through hypothesis testing, confidence intervals, and regression analysis. In this section, we will explore how to conduct inferential statistical analyses and interpret the results using R.

4.1 Pearson Correlation: Assessing Relationships Between Variables

One of the fundamental aspects of data analysis is understanding the relationships between different variables in your data set. Pearson correlation is a statistical method used to quantify the strength and direction of a linear relationship between two continuous variables. It is often denoted as “r” and ranges from -1 to 1. A positive value indicates a positive linear relationship (as one variable increases, the other tends to increase), a negative value indicates a negative linear relationship (as one variable increases, the other tends to decrease), and a value close to 0 suggests little to no linear relationship.

You can calculate the Pearson correlation coefficient using the cor.test() function.

# Create a data set
data <- data.frame(variable1 = c(9, 15, 9, 11, 4, 6, 4, 12, 2, 5, 3, 11, 9, 10),
                   variable2 = c(4, 8, 4, 10, 3, 5, 11, 4, 12, 3, 4, 8, 3, 8))

# Calculate Pearson correlation
cor.test(data$variable1, 
         data$variable2,
         alternative = "two.sided", # can be edited to one-tailed with either "less" or "greater"
         method = "pearson",        # can be edited to "kendall" or "spearman"
         conf.level = 0.95)         # can be edited to change confidence level

4.2 Linear regression: Predicting outcomes based on a single predictor

Linear regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. This section will guide you through the process of conducting linear regression in R to predict outcomes based on one or more independent variables.

You can perform a linear regression using the lm() function.

# Load packages
library(tidyverse)

# Create a data set
data <- data.frame(variable1 = c(9, 15, 9, 11, 4, 6, 4, 12, 2, 5, 3, 11, 9, 10),
                   variable2 = c(4, 8, 4, 10, 3, 5, 11, 4, 12, 3, 4, 8, 3, 8),
                   variable3 = c(1, 8, 6, 7, 11, 4, 12, 3, 4, 7, 3, 8, 9, 11))

# Conduct a linear regression predicting variable1 from variable2
lm(variable1 ~ variable2, 
   data) %>% # we pipe the regression model to the summary() function to attain inferential statistics
  summary()

# Conduct a linear regression predicting variable1 from variable2 and variable3
lm(variable1 ~ variable2 + variable3, 
   data) %>%
  summary()

# Conduct a linear regression predicting variable1 from variable2, variable3, and their interaction
lm(variable1 ~ variable2 + variable3 + variable2:variable3, 
   data) %>%
  summary()

# Conduct a linear regression predicting variable1 from variable2, variable3, and their interaction (with a simplified syntax)
lm(variable1 ~ variable2*variable3, 
   data) %>%
  summary()

4.3 T-test: Comparing means between two groups

The t-test is a widely used statistical test for comparing a mean to a single value or to that of another group. This section will walk you through the process of performing t-tests in R when comparing a mean to a single value (one-sample), when comparing data that is independent (two-sample / between-participants) and dependent (paired-sample / within-participants).

You can perform t-tests using the t_test() function from the apa package.

# Load packages
library(apa)

# Create a data set
data <- data.frame(condition = c("high", "high", "low", "high", "low", "low", "low", "high"),
                   variable1 = c(9, 15, 9, 11, 4, 6, 4, 12),
                   variable2 = c(2, 4, 9, 6, 1, 6, 4, 3))

# Conduct a t-test (one-sample) comparing the mean of variable1 to 0
t_test(data$variable1, 
       mu = 0,                    # value to be compared to
       alternative = "two.sided", # can be edited to one-tailed with either "less" or "greater"
       conf.level = 0.95)         # can be edited to change confidence level

# Conduct a t-test (two-sample / between-participants) comparing the mean of variable1 in the "high" condition with that of the "low"
t_test(data$variable1[data$condition=="high"], # statement in square brackets selects values of variable1 in the high condition
       data$variable1[data$condition=="low"],  # statement in square brackets selects values of variable1 in the low condition
       alternative = "two.sided",              # can be edited to one-tailed with either "less" or "greater"
       paired = FALSE,                         # can be edited to TRUE for paired-sample / within-participants t-test (see below)
       var.equal = TRUE,                       # can be edited to FALSE for Welch test
       conf.level = 0.95)                      # can be edited to change confidence level

# Conduct a t-test (paired-sample / within-participants) comparing the mean of variable1 with that of variable2
t_test(data$variable1,
       data$variable2,
       alternative = "two.sided", # can be edited to one-tailed with either "less" or "greater"
       paired = TRUE,             # can be edited to FALSE for two-sample / between-participants t-test (see above)
       conf.level = 0.95)         # can be edited to change confidence level

4.4 ANOVA: Analyzing variance between multiple groups

Analysis of Variance (ANOVA) is a statistical method used to analyze the differences in means among multiple groups. This section will guide you through conducting ANOVA in R.

You can perform Analysis of Variance using the aov_ez() function from the afex package.

Let’s start with a simple example where you have a single between-participants factor with three levels (high, medium, low).

# Load packages
library(afex)

# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                   condition1 = c("high", "high", "high", "high", "medium", "medium", "medium", "medium", "low", "low", "low", "low"),
                   condition2 = c("active", "control", "active", "control", "active", "control", "active", "control", "active", "control", "active", "control"),
                   positive_score = c(9, 15, 9, 11, 2, 4, 7, 3, 4, 6, 4, 12),
                   neutral_score = c(5, 6, 9, 11, 2, 4, 7, 3, 7, 6, 6, 4),
                   negative_score = c(1, 2, 9, 1, 2, 4, 3, 3, 4, 2, 4, 4))

# Conduct an Analysis of Variance comparing positive scores across "high", "medium", and "low" condition1 (between participants)
aov_ez(id = "id",
       dv = "positive_score",
       between = "condition1",
       data = data)

# Conduct an Analysis of Variance comparing positive scores across condition1 (between participants) and condition2 (between participants) 
aov_ez(id = "id",
       dv = "positive_score",
       between = c("condition1", "condition2"),
       data = data)

Now let’s consider a within-participants comparison with three levels (positive, neutral, negative). It is common for psychological scientists to save such data will as three individual columns, one for positive responses, one for neutral, and one for negative.

To analyze this via the aov_ez() function we first need to restructure this so that it is in a long format, with each participant represented three times and the levels indicated via a new variable (similar to how between-participants conditions are coded). This can be achieved via the can be done via the melt() function from the reshape2 package.

# Load packages
library(tidyverse)
library(reshape2)
library(afex)

# Create a data set
data <- data.frame(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                   condition1 = c("high", "high", "high", "high", "medium", "medium", "medium", "medium", "low", "low", "low", "low"),
                   condition2 = c("active", "control", "active", "control", "active", "control", "active", "control", "active", "control", "active", "control"),
                   positive_score = c(9, 15, 9, 11, 2, 4, 7, 3, 4, 6, 4, 12),
                   neutral_score = c(5, 6, 9, 11, 2, 4, 7, 3, 7, 6, 6, 4),
                   negative_score = c(1, 2, 9, 1, 2, 4, 3, 3, 4, 2, 4, 4))

# Restructure data set
restructured_data <- data %>%
  melt(id.vars = c("id", "condition1", "condition2"),
       measure.vars = c("positive_score", "neutral_score", "negative_score"),
       variable.name = "dependent_variable",
       value.name = "score")

# Analysis of Variance comparing scores across dependent variables "positive", "neutral", and "negative" (within participants)
aov_ez(id = "id",
       dv = "score",
       within = "dependent_variable",
       data = restructured_data)

# Analysis of Variance comparing scores across condition1 (between participants) and dependent variables (within participants)
aov_ez(id = "id",
       dv = "score",
       between = "condition1",
       within = "dependent_variable",
       data = restructured_data)