## 3.4 Bivariate Statistics

Correlation between

Questions to keep in mind:

1. Is the relationship linear or non-linear?
2. If the variable is continuous, is it normal and homoskadastic?
3. How big is your dataset?

### 3.4.1 Two Continuous

n = 100 # (sample size)

data = data.frame(A = sample(1:20, replace = TRUE, size = n),
B = sample(1:30, replace = TRUE, size = n))

#### 3.4.1.1 Pearson Correlation

• Good with linear relationship
library(Hmisc)
rcorr(data$A, data$B, type="pearson")
#>      x    y
#> x 1.00 0.17
#> y 0.17 1.00
#>
#> n= 100
#>
#>
#> P
#>   x      y
#> x        0.0878
#> y 0.0878

#### 3.4.1.2 Spearman Correlation

library(Hmisc)
rcorr(data$A, data$B, type="spearman")
#>      x    y
#> x 1.00 0.18
#> y 0.18 1.00
#>
#> n= 100
#>
#>
#> P
#>   x    y
#> x      0.08
#> y 0.08

### 3.4.2 Categorical and Continuous

#### 3.4.2.1 Point-Biserial Correlation

Similar to the Pearson correlation coefficient, the point-biserial correlation coefficient is between -1 and 1 where:

• -1 means a perfectly negative correlation between two variables

• 0 means no correlation between two variables

• 1 means a perfectly positive correlation between two variables

x <- c(0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0)
y <- c(12, 14, 17, 17, 11, 22, 23, 11, 19, 8, 12)

#calculate point-biserial correlation
cor.test(x, y)
#>
#>  Pearson's product-moment correlation
#>
#> data:  x and y
#> t = 0.67064, df = 9, p-value = 0.5193
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.4391885  0.7233704
#> sample estimates:
#>       cor
#> 0.2181635

Alternatively

ltm::biserial.cor(y,x, use = c("all.obs"), level = 2)
#> [1] 0.2181635

See 3.4.2.2

### 3.4.3 Two Discrete

#### 3.4.3.1 Distance Metrics

Some consider distance is not a correlation metric because it isn’t unit independent (i.e., if you scale the distance, the metrics will change), but it’s still a useful proxy. Distance metrics are more likely to be used for similarity measure.

• Euclidean Distance

• Manhattan Distance

• Chessboard Distance

• Minkowski Distance

• Canberra Distance

• Hamming Distance

• Cosine Distance

• Sum of Absolute Distance

• Sum of Squared Distance

• Mean-Absolute Error

#### 3.4.3.2 Statistical Metrics

##### 3.4.3.2.1 Chi-squared test
###### 3.4.3.2.1.1 Phi coefficient
• 2 binary
dt = matrix(c(1,4,3,5), nrow = 2)
dt
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    4    5
psych::phi(dt)
#> [1] -0.18
###### 3.4.3.2.1.2 Cramer’s V
• between nominal categorical variables (no natural order)

$\text{Cramer's V} = \sqrt{\frac{\chi^2/n}{\min(c-1,r-1)}}$

where

• $$\chi^2$$ = Chi-square statistic

• $$n$$ = sample size

• $$r$$ = # of rows

• $$c$$ = # of columns

library('lsr')
n = 100 # (sample size)
set.seed(1)
data = data.frame(A = sample(1:5, replace = TRUE, size = n),
B = sample(1:6, replace = TRUE, size = n))

cramersV(data$A, data$B)
#> [1] 0.1944616

Alternatively,

• ncchisq noncentral Chi-square

• nchisqadj Adjusted noncentral Chi-square

• fisher Fisher Z transformation

• fisheradj bias correction Fisher z transformation

DescTools::CramerV(data, conf.level = 0.95,method = "ncchisqadj")
#>  Cramer V    lwr.ci    upr.ci
#> 0.3472325 0.3929964 0.4033053
###### 3.4.3.2.1.3 Tschuprow’s T
• 2 nominal variables
DescTools::TschuprowT(data)
#> [1] 0.1100808

#### 3.4.3.3 Ordinal Association (Rank correlation)

• Good with non-linear relationship
##### 3.4.3.3.1 Ordinal and Nominal
n = 100 # (sample size)
set.seed(1)
dt = table(data.frame(
A = sample(1:4, replace = TRUE, size = n), # ordinal
B = sample(1:3, replace = TRUE, size = n)  # nominal
))
dt
#>    B
#> A    1  2  3
#>   1  7 11  9
#>   2 11  6 14
#>   3  7 11  4
#>   4  6  4 10
###### 3.4.3.3.1.1 Freeman’s Theta
• Ordinal and nominal
# this package is not available for R >= 4.0.0
rcompanion::freemanTheta(dt, group = "column")
# because column is the grouping variable (i.e., nominal)
###### 3.4.3.3.1.2 Epsilon-squared
• Ordinal and nominal
# this package is not available for R >= 4.0.0
rcompanion::epsilonSquared(dt,group = "column" )
# because column is the grouping variable (i.e., nominal)
##### 3.4.3.3.2 Two Ordinal
n = 100 # (sample size)
set.seed(1)
dt = table(data.frame(
A = sample(1:4, replace = TRUE, size = n), # ordinal
B = sample(1:3, replace = TRUE, size = n)  # ordinal
))
dt
#>    B
#> A    1  2  3
#>   1  7 11  9
#>   2 11  6 14
#>   3  7 11  4
#>   4  6  4 10
###### 3.4.3.3.2.1 Goodman Kruskal’s Gamma
• 2 ordinal variables
DescTools::GoodmanKruskalGamma(dt, conf.level = 0.95)
#>        gamma       lwr.ci       upr.ci
#>  0.006781013 -0.229032069  0.242594095
###### 3.4.3.3.2.2 Somers’ D
• or Somers’ Delta

• 2 ordinal variables

DescTools::SomersDelta(dt, conf.level = 0.95)
#>       somers       lwr.ci       upr.ci
#>  0.005115859 -0.172800185  0.183031903
###### 3.4.3.3.2.3 Kendall’s Tau-b
• 2 ordinal variables
DescTools::KendallTauB(dt, conf.level = 0.95)
#>        tau_b       lwr.ci       upr.ci
#>  0.004839732 -0.163472443  0.173151906
###### 3.4.3.3.2.4 Yule’s Q and Y
• 2 ordinal variables

Special version $$(2 \times 2)$$ of the Goodman Kruskal’s Gamma coefficient.

Variable 1
Variable 2 a b
c d

$\text{Yule's Q} = \frac{ad - bc}{ad + bc}$

We typically use Yule’s $$Q$$ in practice while Yule’s Y has the following relationship with $$Q$$.

$\text{Yule's Y} = \frac{\sqrt{ad} - \sqrt{bc}}{\sqrt{ad} + \sqrt{bc}}$

$Q = \frac{2Y}{1 + Y^2}$

$Y = \frac{1 = \sqrt{1-Q^2}}{Q}$

n = 100 # (sample size)
set.seed(1)
dt = table(data.frame(A = sample(c(0, 1), replace = TRUE, size = n),
B = sample(c(0, 1), replace = TRUE, size = n)))
dt
#>    B
#> A    0  1
#>   0 25 24
#>   1 28 23

DescTools::YuleQ(dt)
#> [1] -0.07778669
###### 3.4.3.3.2.5 Tetrachoric Correlation
library(psych)

n = 100 # (sample size)

data = data.frame(A = sample(c(0, 1), replace = TRUE, size = n),
B = sample(c(0, 1), replace = TRUE, size = n))

#view table
#>   A B
#> 1 1 0
#> 2 1 0
#> 3 0 0
#> 4 1 0
#> 5 1 0
#> 6 1 0

table(data)
#>    B
#> A    0  1
#>   0 21 23
#>   1 34 22

#calculate tetrachoric correlation
tetrachoric(data)
#> Call: tetrachoric(x = data)
#> tetrachoric correlation
#>   A    B
#> A  1.0
#> B -0.2  1.0
#>
#>  with tau of
#>     A     B
#> -0.15  0.13
###### 3.4.3.3.2.6 Polychoric Correlation
• between ordinal categorical variables (natural order).
• Assumption: Ordinal variable is a discrete representation of a latent normally distributed continuous variable. (Income = low, normal, high).
library(polycor)

n = 100 # (sample size)

data = data.frame(A = sample(1:4, replace = TRUE, size = n),
B = sample(1:6, replace = TRUE, size = n))

polychor(data$A, data$B)
#> [1] 0.01607982