3.4 Bivariate Statistics
Correlation between
- Two Continuous variables
- Two Discrete variables
- Categorical and Continuous
Categorical | Continuous | |
---|---|---|
Categorical | ||
Continuous |
Questions to keep in mind:
- Is the relationship linear or non-linear?
- If the variable is continuous, is it normal and homoskadastic?
- How big is your dataset?
3.4.1 Two Continuous
n = 100 # (sample size)
data = data.frame(A = sample(1:20, replace = TRUE, size = n),
B = sample(1:30, replace = TRUE, size = n))
3.4.2 Categorical and Continuous
3.4.2.1 Point-Biserial Correlation
Similar to the Pearson correlation coefficient, the point-biserial correlation coefficient is between -1 and 1 where:
-1 means a perfectly negative correlation between two variables
0 means no correlation between two variables
1 means a perfectly positive correlation between two variables
x <- c(0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0)
y <- c(12, 14, 17, 17, 11, 22, 23, 11, 19, 8, 12)
#calculate point-biserial correlation
cor.test(x, y)
#>
#> Pearson's product-moment correlation
#>
#> data: x and y
#> t = 0.67064, df = 9, p-value = 0.5193
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> -0.4391885 0.7233704
#> sample estimates:
#> cor
#> 0.2181635
Alternatively
3.4.2.2 Logistic Regression
See 3.4.2.2
3.4.3 Two Discrete
3.4.3.1 Distance Metrics
Some consider distance is not a correlation metric because it isn’t unit independent (i.e., if you scale the distance, the metrics will change), but it’s still a useful proxy. Distance metrics are more likely to be used for similarity measure.
Euclidean Distance
Manhattan Distance
Chessboard Distance
Minkowski Distance
Canberra Distance
Hamming Distance
Cosine Distance
Sum of Absolute Distance
Sum of Squared Distance
Mean-Absolute Error
3.4.3.2 Statistical Metrics
3.4.3.2.1 Chi-squared test
3.4.3.2.1.2 Cramer’s V
- between nominal categorical variables (no natural order)
\[ \text{Cramer's V} = \sqrt{\frac{\chi^2/n}{\min(c-1,r-1)}} \]
where
\(\chi^2\) = Chi-square statistic
\(n\) = sample size
\(r\) = # of rows
\(c\) = # of columns
library('lsr')
n = 100 # (sample size)
set.seed(1)
data = data.frame(A = sample(1:5, replace = TRUE, size = n),
B = sample(1:6, replace = TRUE, size = n))
cramersV(data$A, data$B)
#> [1] 0.1944616
Alternatively,
ncchisq
noncentral Chi-squarenchisqadj
Adjusted noncentral Chi-squarefisher
Fisher Z transformationfisheradj
bias correction Fisher z transformation
3.4.3.3 Ordinal Association (Rank correlation)
- Good with non-linear relationship
3.4.3.3.1 Ordinal and Nominal
n = 100 # (sample size)
set.seed(1)
dt = table(data.frame(
A = sample(1:4, replace = TRUE, size = n), # ordinal
B = sample(1:3, replace = TRUE, size = n) # nominal
))
dt
#> B
#> A 1 2 3
#> 1 7 11 9
#> 2 11 6 14
#> 3 7 11 4
#> 4 6 4 10
3.4.3.3.2 Two Ordinal
n = 100 # (sample size)
set.seed(1)
dt = table(data.frame(
A = sample(1:4, replace = TRUE, size = n), # ordinal
B = sample(1:3, replace = TRUE, size = n) # ordinal
))
dt
#> B
#> A 1 2 3
#> 1 7 11 9
#> 2 11 6 14
#> 3 7 11 4
#> 4 6 4 10
3.4.3.3.2.4 Yule’s Q and Y
- 2 ordinal variables
Special version \((2 \times 2)\) of the Goodman Kruskal’s Gamma coefficient.
Variable 1 | ||
---|---|---|
Variable 2 | a | b |
c | d |
\[ \text{Yule's Q} = \frac{ad - bc}{ad + bc} \]
We typically use Yule’s \(Q\) in practice while Yule’s Y has the following relationship with \(Q\).
\[ \text{Yule's Y} = \frac{\sqrt{ad} - \sqrt{bc}}{\sqrt{ad} + \sqrt{bc}} \]
\[ Q = \frac{2Y}{1 + Y^2} \]
\[ Y = \frac{1 = \sqrt{1-Q^2}}{Q} \]
3.4.3.3.2.5 Tetrachoric Correlation
- is a special case of Polychoric Correlation when both variables are binary
library(psych)
n = 100 # (sample size)
data = data.frame(A = sample(c(0, 1), replace = TRUE, size = n),
B = sample(c(0, 1), replace = TRUE, size = n))
#view table
head(data)
#> A B
#> 1 1 0
#> 2 1 0
#> 3 0 0
#> 4 1 0
#> 5 1 0
#> 6 1 0
table(data)
#> B
#> A 0 1
#> 0 21 23
#> 1 34 22
#calculate tetrachoric correlation
tetrachoric(data)
#> Call: tetrachoric(x = data)
#> tetrachoric correlation
#> A B
#> A 1.0
#> B -0.2 1.0
#>
#> with tau of
#> A B
#> -0.15 0.13
3.4.3.3.2.6 Polychoric Correlation
- between ordinal categorical variables (natural order).
- Assumption: Ordinal variable is a discrete representation of a latent normally distributed continuous variable. (Income = low, normal, high).
library(polycor)
n = 100 # (sample size)
data = data.frame(A = sample(1:4, replace = TRUE, size = n),
B = sample(1:6, replace = TRUE, size = n))
head(data)
#> A B
#> 1 1 3
#> 2 1 1
#> 3 3 5
#> 4 2 3
#> 5 3 5
#> 6 4 4
#calculate polychoric correlation between ratings
polychor(data$A, data$B)
#> [1] 0.01607982