3.7 Two Discrete

3.7.1 Distance Metrics

Some consider distance is not a correlation metric because it isn’t unit independent (i.e., if you scale the distance, the metrics will change), but it’s still a useful proxy. Distance metrics are more likely to be used for similarity measure.

  • Euclidean Distance

  • Manhattan Distance

  • Chessboard Distance

  • Minkowski Distance

  • Canberra Distance

  • Hamming Distance

  • Cosine Distance

  • Sum of Absolute Distance

  • Sum of Squared Distance

  • Mean-Absolute Error

3.7.2 Statistical Metrics

3.7.2.1 Chi-squared test

3.7.2.1.1 Phi coefficient
  • 2 binary
dt = matrix(c(1,4,3,5), nrow = 2)
dt
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    4    5
psych::phi(dt)
#> [1] -0.18
3.7.2.1.2 Cramer’s V
  • between nominal categorical variables (no natural order)

\[ \text{Cramer's V} = \sqrt{\frac{\chi^2/n}{\min(c-1,r-1)}} \]

where

  • \(\chi^2\) = Chi-square statistic

  • \(n\) = sample size

  • \(r\) = # of rows

  • \(c\) = # of columns

library('lsr')
n = 100 # (sample size)
set.seed(1)
data = data.frame(A = sample(1:5, replace = TRUE, size = n),
                  B = sample(1:6, replace = TRUE, size = n))


cramersV(data$A, data$B)
#> [1] 0.1944616

Alternatively,

  • ncchisq noncentral Chi-square

  • nchisqadj Adjusted noncentral Chi-square

  • fisher Fisher Z transformation

  • fisheradj bias correction Fisher z transformation

DescTools::CramerV(data, conf.level = 0.95,method = "ncchisqadj")
#>  Cramer V    lwr.ci    upr.ci 
#> 0.3472325 0.3929964 0.4033053
3.7.2.1.3 Tschuprow’s T
  • 2 nominal variables
DescTools::TschuprowT(data)
#> [1] 0.1100808

3.7.3 Ordinal Association (Rank correlation)

  • Good with non-linear relationship

3.7.3.1 Ordinal and Nominal

n = 100 # (sample size)
set.seed(1)
dt = table(data.frame(
    A = sample(1:4, replace = TRUE, size = n), # ordinal
    B = sample(1:3, replace = TRUE, size = n)  # nominal
)) 
dt
#>    B
#> A    1  2  3
#>   1  7 11  9
#>   2 11  6 14
#>   3  7 11  4
#>   4  6  4 10
3.7.3.1.1 Freeman’s Theta
  • Ordinal and nominal
# this package is not available for R >= 4.0.0
rcompanion::freemanTheta(dt, group = "column") 
# because column is the grouping variable (i.e., nominal)
3.7.3.1.2 Epsilon-squared
  • Ordinal and nominal
# this package is not available for R >= 4.0.0
rcompanion::epsilonSquared(dt,group = "column" ) 
# because column is the grouping variable (i.e., nominal)

3.7.3.2 Two Ordinal

n = 100 # (sample size)
set.seed(1)
dt = table(data.frame(
    A = sample(1:4, replace = TRUE, size = n), # ordinal
    B = sample(1:3, replace = TRUE, size = n)  # ordinal
)) 
dt
#>    B
#> A    1  2  3
#>   1  7 11  9
#>   2 11  6 14
#>   3  7 11  4
#>   4  6  4 10
3.7.3.2.1 Goodman Kruskal’s Gamma
  • 2 ordinal variables
DescTools::GoodmanKruskalGamma(dt, conf.level = 0.95)
#>        gamma       lwr.ci       upr.ci 
#>  0.006781013 -0.229032069  0.242594095
3.7.3.2.2 Somers’ D
  • or Somers’ Delta

  • 2 ordinal variables

DescTools::SomersDelta(dt, conf.level = 0.95)
#>       somers       lwr.ci       upr.ci 
#>  0.005115859 -0.172800185  0.183031903
3.7.3.2.3 Kendall’s Tau-b
  • 2 ordinal variables
DescTools::KendallTauB(dt, conf.level = 0.95)
#>        tau_b       lwr.ci       upr.ci 
#>  0.004839732 -0.163472443  0.173151906
3.7.3.2.4 Yule’s Q and Y
  • 2 ordinal variables

Special version \((2 \times 2)\) of the Goodman Kruskal’s Gamma coefficient.

Variable 1
Variable 2 a b
c d

\[ \text{Yule's Q} = \frac{ad - bc}{ad + bc} \]

We typically use Yule’s \(Q\) in practice while Yule’s Y has the following relationship with \(Q\).

\[ \text{Yule's Y} = \frac{\sqrt{ad} - \sqrt{bc}}{\sqrt{ad} + \sqrt{bc}} \]

\[ Q = \frac{2Y}{1 + Y^2} \]

\[ Y = \frac{1 = \sqrt{1-Q^2}}{Q} \]

n = 100 # (sample size)
set.seed(1)
dt = table(data.frame(A = sample(c(0, 1), replace = TRUE, size = n),
                  B = sample(c(0, 1), replace = TRUE, size = n)))
dt
#>    B
#> A    0  1
#>   0 25 24
#>   1 28 23

DescTools::YuleQ(dt)
#> [1] -0.07778669
3.7.3.2.5 Tetrachoric Correlation
library(psych)

n = 100 # (sample size)

data = data.frame(A = sample(c(0, 1), replace = TRUE, size = n),
                  B = sample(c(0, 1), replace = TRUE, size = n))

#view table
head(data)
#>   A B
#> 1 1 0
#> 2 1 0
#> 3 0 0
#> 4 1 0
#> 5 1 0
#> 6 1 0

table(data)
#>    B
#> A    0  1
#>   0 21 23
#>   1 34 22


#calculate tetrachoric correlation
tetrachoric(data)
#> Call: tetrachoric(x = data)
#> tetrachoric correlation 
#>   A    B   
#> A  1.0     
#> B -0.2  1.0
#> 
#>  with tau of 
#>     A     B 
#> -0.15  0.13
3.7.3.2.6 Polychoric Correlation
  • between ordinal categorical variables (natural order).
  • Assumption: Ordinal variable is a discrete representation of a latent normally distributed continuous variable. (Income = low, normal, high).
library(polycor)

n = 100 # (sample size)

data = data.frame(A = sample(1:4, replace = TRUE, size = n),
                  B = sample(1:6, replace = TRUE, size = n))

head(data)
#>   A B
#> 1 1 3
#> 2 1 1
#> 3 3 5
#> 4 2 3
#> 5 3 5
#> 6 4 4


#calculate polychoric correlation between ratings
polychor(data$A, data$B)
#> [1] 0.01607982

3.7.4 Summary

library(tidyverse)

data("mtcars")
df = mtcars %>%
    dplyr::select(cyl, vs, carb)


df_factor = df %>%
    dplyr::mutate(
        cyl = factor(cyl),
        vs = factor(vs),
        carb = factor(carb)
    )
# summary(df)
str(df)
#> 'data.frame':    32 obs. of  3 variables:
#>  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#>  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#>  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
str(df_factor)
#> 'data.frame':    32 obs. of  3 variables:
#>  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
#>  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
#>  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

Get the correlation table for continuous variables only

cor(df)
#>             cyl         vs       carb
#> cyl   1.0000000 -0.8108118  0.5269883
#> vs   -0.8108118  1.0000000 -0.5696071
#> carb  0.5269883 -0.5696071  1.0000000

# only complete obs
# cor(df, use = "complete.obs")

Alternatively, you can also have the

Hmisc::rcorr(as.matrix(df), type = "pearson")
#>        cyl    vs  carb
#> cyl   1.00 -0.81  0.53
#> vs   -0.81  1.00 -0.57
#> carb  0.53 -0.57  1.00
#> 
#> n= 32 
#> 
#> 
#> P
#>      cyl    vs     carb  
#> cyl         0.0000 0.0019
#> vs   0.0000        0.0007
#> carb 0.0019 0.0007
modelsummary::datasummary_correlation(df)
cyl vs carb
cyl 1 . .
vs −.81 1 .
carb .53 −.57 1
ggcorrplot::ggcorrplot(cor(df))

Different comparison between different correlation between different types of variables (i.e., continuous vs. categorical) can be problematic. Moreover, the problem of detecting non-linear vs. linear relationship/correlation is another one. Hence, a solution is that using mutual information from information theory (i.e., knowing one variable can reduce uncertainty about the other).

To implement mutual information, we have the following approximations

\[ \downarrow \text{prediction error} \approx \downarrow \text{uncertainty} \approx \downarrow \text{association strength} \]

More specifically, following the X2Y metric, we have the following steps:

  1. Predict \(y\) without \(x\) (i.e., baseline model)

    1. Average of \(y\) when \(y\) is continuous

    2. Most frequent value when \(y\) is categorical

  2. Predict \(y\) with \(x\) (e.g., linear, random forest, etc.)

  3. Calculate the prediction error difference between 1 and 2

To have a comprehensive table that could handle

  • continuous vs. continuous

  • categorical vs. continuous

  • continuous vs. categorical

  • categorical vs. categorical

the suggested model would be Classification and Regression Trees (CART). But we can certainly use other models as well.

The downfall of this method is that you might suffer

  1. Symmetry: \((x,y) \neq (y,x)\)
  2. Comparability : Different pair of comparison might use different metrics (e.g., misclassification error vs. MAE)
library(ppsr)

iris <- iris %>% 
  select(1:3)

# ppsr::score_df(iris) # if you want a dataframe
ppsr::score_matrix(iris,
                   do_parallel = TRUE,
                   n_cores = parallel::detectCores() / 2)
#>              Sepal.Length Sepal.Width Petal.Length
#> Sepal.Length   1.00000000  0.04632352    0.5491398
#> Sepal.Width    0.06790301  1.00000000    0.2376991
#> Petal.Length   0.61608360  0.24263851    1.0000000

# if you want a similar correlation matrix
ppsr::score_matrix(df,
                   do_parallel = TRUE,
                   n_cores = parallel::detectCores() / 2)
#>             cyl        vs      carb
#> cyl  1.00000000 0.3982789 0.2092533
#> vs   0.02514286 1.0000000 0.2000000
#> carb 0.30798148 0.2537309 1.0000000

3.7.5 Visualization

corrplot::corrplot(cor(df))

Alternatively,

PerformanceAnalytics::chart.Correlation(df, histogram = T, pch = 19)

heatmap(as.matrix(df))

More general form,

ppsr::visualize_pps(
    df = iris,
    do_parallel = TRUE,
    n_cores = parallel::detectCores() / 2
)

ppsr::visualize_correlations(
    df = iris
)

Both heat map and correlation at the same time

ppsr::visualize_both(
    df = iris,
    do_parallel = TRUE,
    n_cores = parallel::detectCores() / 2
)

More elaboration with ggplot2

ppsr::visualize_pps(
    df = iris,
    color_value_high = 'red',
    color_value_low = 'yellow',
    color_text = 'black'
) +
    ggplot2::theme_classic() +
    ggplot2::theme(plot.background = 
                       ggplot2::element_rect(fill = "lightgrey")) +
    ggplot2::theme(title = ggplot2::element_text(size = 15)) +
    ggplot2::labs(
        title = 'Correlation aand Heatmap',
        subtitle = 'Subtitle',
        caption = 'Caption',
        x = 'More info'
    )