3.4 Bivariate Statistics

Correlation between

Categorical Continuous
Categorical

Phi coefficient

Cramer’s V

Tschuprow’s T

Freeman’s Theta

Epsilon-squared

Goodman Kruskal’s Gamma

Somers’ D

Kendall’s Tau-b

Yule’s Q and Y

Tetrachoric Correlation

Polychoric Correlation

Continuous

Point-Biserial Correlation

Logistic Regression

Pearson Correlation

Spearman Correlation

Questions to keep in mind:

  1. Is the relationship linear or non-linear?
  2. If the variable is continuous, is it normal and homoskadastic?
  3. How big is your dataset?

3.4.1 Two Continuous

n = 100 # (sample size)

data = data.frame(A = sample(1:20, replace = TRUE, size = n),
                  B = sample(1:30, replace = TRUE, size = n))

3.4.1.1 Pearson Correlation

  • Good with linear relationship
library(Hmisc)
rcorr(data$A, data$B, type="pearson") 
#>      x    y
#> x 1.00 0.17
#> y 0.17 1.00
#> 
#> n= 100 
#> 
#> 
#> P
#>   x      y     
#> x        0.0878
#> y 0.0878

3.4.1.2 Spearman Correlation

library(Hmisc)
rcorr(data$A, data$B, type="spearman") 
#>      x    y
#> x 1.00 0.18
#> y 0.18 1.00
#> 
#> n= 100 
#> 
#> 
#> P
#>   x    y   
#> x      0.08
#> y 0.08

3.4.2 Categorical and Continuous

3.4.2.1 Point-Biserial Correlation

Similar to the Pearson correlation coefficient, the point-biserial correlation coefficient is between -1 and 1 where:

  • -1 means a perfectly negative correlation between two variables

  • 0 means no correlation between two variables

  • 1 means a perfectly positive correlation between two variables

x <- c(0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0)
y <- c(12, 14, 17, 17, 11, 22, 23, 11, 19, 8, 12)

#calculate point-biserial correlation
cor.test(x, y)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  x and y
#> t = 0.67064, df = 9, p-value = 0.5193
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.4391885  0.7233704
#> sample estimates:
#>       cor 
#> 0.2181635

Alternatively

ltm::biserial.cor(y,x, use = c("all.obs"), level = 2)
#> [1] 0.2181635

3.4.2.2 Logistic Regression

See 3.4.2.2

3.4.3 Two Discrete

3.4.3.1 Distance Metrics

Some consider distance is not a correlation metric because it isn’t unit independent (i.e., if you scale the distance, the metrics will change), but it’s still a useful proxy. Distance metrics are more likely to be used for similarity measure.

  • Euclidean Distance

  • Manhattan Distance

  • Chessboard Distance

  • Minkowski Distance

  • Canberra Distance

  • Hamming Distance

  • Cosine Distance

  • Sum of Absolute Distance

  • Sum of Squared Distance

  • Mean-Absolute Error

3.4.3.2 Statistical Metrics

3.4.3.2.1 Chi-squared test
3.4.3.2.1.1 Phi coefficient
  • 2 binary
dt = matrix(c(1,4,3,5), nrow = 2)
dt
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    4    5
psych::phi(dt)
#> [1] -0.18
3.4.3.2.1.2 Cramer’s V
  • between nominal categorical variables (no natural order)

\[ \text{Cramer's V} = \sqrt{\frac{\chi^2/n}{\min(c-1,r-1)}} \]

where

  • \(\chi^2\) = Chi-square statistic

  • \(n\) = sample size

  • \(r\) = # of rows

  • \(c\) = # of columns

library('lsr')
n = 100 # (sample size)
set.seed(1)
data = data.frame(A = sample(1:5, replace = TRUE, size = n),
                  B = sample(1:6, replace = TRUE, size = n))


cramersV(data$A, data$B)
#> [1] 0.1944616

Alternatively,

  • ncchisq noncentral Chi-square

  • nchisqadj Adjusted noncentral Chi-square

  • fisher Fisher Z transformation

  • fisheradj bias correction Fisher z transformation

DescTools::CramerV(data, conf.level = 0.95,method = "ncchisqadj")
#>  Cramer V    lwr.ci    upr.ci 
#> 0.3472325 0.3929964 0.4033053
3.4.3.2.1.3 Tschuprow’s T
  • 2 nominal variables
DescTools::TschuprowT(data)
#> [1] 0.1100808

3.4.3.3 Ordinal Association (Rank correlation)

  • Good with non-linear relationship
3.4.3.3.1 Ordinal and Nominal
n = 100 # (sample size)
set.seed(1)
dt = table(data.frame(
    A = sample(1:4, replace = TRUE, size = n), # ordinal
    B = sample(1:3, replace = TRUE, size = n)  # nominal
)) 
dt
#>    B
#> A    1  2  3
#>   1  7 11  9
#>   2 11  6 14
#>   3  7 11  4
#>   4  6  4 10
3.4.3.3.1.1 Freeman’s Theta
  • Ordinal and nominal
# this package is not available for R >= 4.0.0
rcompanion::freemanTheta(dt, group = "column") 
# because column is the grouping variable (i.e., nominal)
3.4.3.3.1.2 Epsilon-squared
  • Ordinal and nominal
# this package is not available for R >= 4.0.0
rcompanion::epsilonSquared(dt,group = "column" ) 
# because column is the grouping variable (i.e., nominal)
3.4.3.3.2 Two Ordinal
n = 100 # (sample size)
set.seed(1)
dt = table(data.frame(
    A = sample(1:4, replace = TRUE, size = n), # ordinal
    B = sample(1:3, replace = TRUE, size = n)  # ordinal
)) 
dt
#>    B
#> A    1  2  3
#>   1  7 11  9
#>   2 11  6 14
#>   3  7 11  4
#>   4  6  4 10
3.4.3.3.2.1 Goodman Kruskal’s Gamma
  • 2 ordinal variables
DescTools::GoodmanKruskalGamma(dt, conf.level = 0.95)
#>        gamma       lwr.ci       upr.ci 
#>  0.006781013 -0.229032069  0.242594095
3.4.3.3.2.2 Somers’ D
  • or Somers’ Delta

  • 2 ordinal variables

DescTools::SomersDelta(dt, conf.level = 0.95)
#>       somers       lwr.ci       upr.ci 
#>  0.005115859 -0.172800185  0.183031903
3.4.3.3.2.3 Kendall’s Tau-b
  • 2 ordinal variables
DescTools::KendallTauB(dt, conf.level = 0.95)
#>        tau_b       lwr.ci       upr.ci 
#>  0.004839732 -0.163472443  0.173151906
3.4.3.3.2.4 Yule’s Q and Y
  • 2 ordinal variables

Special version \((2 \times 2)\) of the Goodman Kruskal’s Gamma coefficient.

Variable 1
Variable 2 a b
c d

\[ \text{Yule's Q} = \frac{ad - bc}{ad + bc} \]

We typically use Yule’s \(Q\) in practice while Yule’s Y has the following relationship with \(Q\).

\[ \text{Yule's Y} = \frac{\sqrt{ad} - \sqrt{bc}}{\sqrt{ad} + \sqrt{bc}} \]

\[ Q = \frac{2Y}{1 + Y^2} \]

\[ Y = \frac{1 = \sqrt{1-Q^2}}{Q} \]

n = 100 # (sample size)
set.seed(1)
dt = table(data.frame(A = sample(c(0, 1), replace = TRUE, size = n),
                  B = sample(c(0, 1), replace = TRUE, size = n)))
dt
#>    B
#> A    0  1
#>   0 25 24
#>   1 28 23

DescTools::YuleQ(dt)
#> [1] -0.07778669
3.4.3.3.2.5 Tetrachoric Correlation
library(psych)

n = 100 # (sample size)

data = data.frame(A = sample(c(0, 1), replace = TRUE, size = n),
                  B = sample(c(0, 1), replace = TRUE, size = n))

#view table
head(data)
#>   A B
#> 1 1 0
#> 2 1 0
#> 3 0 0
#> 4 1 0
#> 5 1 0
#> 6 1 0

table(data)
#>    B
#> A    0  1
#>   0 21 23
#>   1 34 22


#calculate tetrachoric correlation
tetrachoric(data)
#> Call: tetrachoric(x = data)
#> tetrachoric correlation 
#>   A    B   
#> A  1.0     
#> B -0.2  1.0
#> 
#>  with tau of 
#>     A     B 
#> -0.15  0.13
3.4.3.3.2.6 Polychoric Correlation
  • between ordinal categorical variables (natural order).
  • Assumption: Ordinal variable is a discrete representation of a latent normally distributed continuous variable. (Income = low, normal, high).
library(polycor)

n = 100 # (sample size)

data = data.frame(A = sample(1:4, replace = TRUE, size = n),
                  B = sample(1:6, replace = TRUE, size = n))

head(data)
#>   A B
#> 1 1 3
#> 2 1 1
#> 3 3 5
#> 4 2 3
#> 5 3 5
#> 6 4 4


#calculate polychoric correlation between ratings
polychor(data$A, data$B)
#> [1] 0.01607982