4.6 Divergence Metrics and Tests for Comparing Distributions

Divergence metrics are powerful tools used to measure the similarity or dissimilarity between probability distributions. Unlike deviation and deviance statistics, divergence metrics focus on the broader relationships between entire distributions, rather than individual data points or specific model fit metrics. Let’s clarify these differences:

Deviation Statistics: Measure the difference between the realization of a variable and some reference value (e.g., the mean). Common statistics derived from deviations include:
- Standard deviation
- Average absolute deviation
- Median absolute deviation
- Maximum absolute deviation
Deviance Statistics: Assess the goodness-of-fit of statistical models. These are analogous to the sum of squared residuals in ordinary least squares (OLS) but are generalized for use in cases with maximum likelihood estimation (MLE). Deviance statistics are frequently employed in generalized linear models (GLMs).

Divergence statistics differ fundamentally by focusing on statistical distances between entire probability distributions, rather than on individual data points or model errors.

1. Divergence Metrics

Definition: Divergence metrics measure how much one probability distribution differs from another.
Key Properties:
- Asymmetry: Many divergence metrics, such as Kullback-Leibler (KL) divergence, are not symmetric (i.e., $D(P \|\| Q) \neq D(Q \|\| P)$ ).
- Non-Metric: They don’t necessarily satisfy the properties of a metric (e.g., symmetry, triangle inequality).
- Unitless: Divergences are often expressed in terms of information (e.g., bits or nats).
When to Use:
- Use divergence metrics to assess the degree of mismatch between two probability distributions, especially in machine learning, statistical inference, or model evaluation.

2. Distance Metrics

Definition: Distance metrics measure the “distance” or dissimilarity between two objects, including probability distributions, datasets, or points in space.
Key Properties:
- Symmetry: $D(P, Q) = D(Q, P)$ .
- Triangle Inequality: $D(P, R) \leq D(P, Q) + D(Q, R)$ .
- Non-Negativity: $D(P, Q) \geq 0$ , with $D(P, Q) = 0$ only if $P=Q$ .
When to Use:
- Use distance metrics to compare datasets, distributions, or clustering outcomes where symmetry and geometric properties are important.

Comparison of divergence and distance metrics in distribution analysis
Aspect	Divergence Metrics	Distance Metrics
Symmetry	Often asymmetric (e.g., KL divergence).	Always symmetric (e.g., Wasserstein).
Triangle Inequality	Not satisfied.	Satisfied.
Use Case	Quantifying how different distributions are.	Measuring the dissimilarity or “cost” of transformation.

Applications of Divergence Metrics

Divergence metrics have found wide utility across domains, including:

Detecting Data Drift in Machine Learning: Used to monitor whether the distribution of incoming data differs significantly from training data.
Feature Selection: Employed to identify features with the most distinguishing power by comparing their distributions across different classes.
Variational Autoencoders (VAEs): Divergence metrics (such as Kullback-Leibler divergence) are central to the loss functions used in training VAEs.
Reinforcement Learning: Measure the similarity between policy distributions to improve decision-making processes.
Assessing Consistency: Compare the distributions of two variables representing constructs to test their relationship or agreement.

Divergence metrics are also highly relevant in business settings, providing insights and solutions for a variety of applications, such as:

Customer Segmentation and Targeting: Compare the distributions of customer demographics or purchase behavior across market segments to identify key differences and target strategies more effectively.
Market Basket Analysis: Measure divergence between distributions of product co-purchases across regions or customer groups to optimize product bundling and cross-selling strategies.
Marketing Campaign Effectiveness: Evaluate whether the distribution of customer responses (e.g., click-through rates or conversions) differs significantly before and after a marketing campaign, providing insights into its success.
Fraud Detection: Monitor divergence in transaction patterns over time to detect anomalies that may indicate fraudulent activities.
Supply Chain Optimization: Compare demand distributions across time periods or regions to optimize inventory allocation and reduce stock-outs or overstocking.
Pricing Strategy Evaluation: Analyze the divergence between pricing and purchase distributions across products or customer segments to refine pricing models and improve profitability.
Churn Prediction: Compare distributions of engagement metrics (e.g., frequency of transactions or usage time) between customers likely to churn and those who stay, to design retention strategies.
Financial Portfolio Analysis: Assess divergence between the expected returns and actual performance distributions of different asset classes to adjust investment strategies.

4.6.1 Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov (KS) test is a non-parametric test used to determine whether two distributions differ significantly or whether a sample distribution matches a reference distribution. It is applicable to continuous distributions and is widely used in hypothesis testing and model evaluation.

Mathematical Definition

The KS statistic is defined as:

$D = \max |F_P(x) - F_Q(x)|$

Where:

$F_P(x)$ is the cumulative distribution function (CDF) of the first distribution (or sample).
$F_Q(x)$ is the CDF of the second distribution (or theoretical reference distribution).
$D$ measures the maximum vertical distance between the two CDFs.

Hypotheses

Null Hypothesis ( $H_0$ ): The empirical distribution follows a specified distribution (or the two samples are drawn from the same distribution).
Alternative Hypothesis ( $H_1$ ): The empirical distribution does not follow the specified distribution (or the two samples are drawn from different distributions).

Properties of the KS Statistic

Range: $D \in [0, 1]$
- $D = 0$ : Perfect match between the distributions.
- $D = 1$ : Maximum dissimilarity between the distributions.
Non-parametric Nature: The KS test makes no assumptions about the underlying distribution of the data.

The KS test is useful in various scenarios, including:

Comparing two empirical distributions to evaluate similarity.
Testing goodness-of-fit for a sample against a theoretical distribution.
Detecting data drift or shifts in distributions over time.
Validating simulation outputs by comparing them to real-world data.

Example 1: Continuous Distributions

# Load necessary libraries
library(stats)

# Generate two sample distributions
set.seed(1)
sample_1 <- rnorm(100)        # Sample from a standard normal distribution
sample_2 <- rnorm(100, mean = 1)  # Sample with mean shifted to 1

# Perform Kolmogorov-Smirnov test
ks_test_result <- ks.test(sample_1, sample_2)
print(ks_test_result)
#> 
#>  Asymptotic two-sample Kolmogorov-Smirnov test
#> 
#> data:  sample_1 and sample_2
#> D = 0.36, p-value = 4.705e-06
#> alternative hypothesis: two-sided

This compares the CDFs of the two samples. The p-value indicates whether the null hypothesis (that the samples come from the same distribution) can be rejected.

Example 2: Discrete Data with Bootstrapped KS Test

For discrete data, a bootstrapped version of the KS test is often used to bypass the continuity requirement.

library(Matching)

# Define two discrete samples
discrete_sample_1 <- c(0:10)
discrete_sample_2 <- c(0:10)

# Perform bootstrapped KS test
ks_boot_result <- ks.boot(Tr = discrete_sample_1, Co = discrete_sample_2)
print(ks_boot_result)
#> $ks.boot.pvalue
#> [1] 1
#> 
#> $ks
#> 
#>  Exact two-sample Kolmogorov-Smirnov test
#> 
#> data:  Tr and Co
#> D = 0, p-value = 1
#> alternative hypothesis: two-sided
#> 
#> 
#> $nboots
#> [1] 1000
#> 
#> attr(,"class")
#> [1] "ks.boot"

This method performs a bootstrapped version of the KS test, suitable for discrete data. The p-value indicates whether the null hypothesis (that the samples come from the same distribution) can be rejected.

Example 3: Comparing Multiple Distributions with KL Divergence (Optional Enhancement)

If you wish to extend the analysis to include divergence measures like KL divergence, use the following:

library(entropy)
library(tidyverse)

# Define multiple samples
lst <- list(sample_1 = c(1:20), sample_2 = c(2:30), sample_3 = c(3:30))

# Compute KL divergence between all pairs of distributions
result <- expand.grid(1:length(lst), 1:length(lst)) %>%
    rowwise() %>%
    mutate(KL = KL.empirical(lst[[Var1]], lst[[Var2]]))

print(result)
#> # A tibble: 9 × 3
#> # Rowwise: 
#>    Var1  Var2     KL
#>   <int> <int>  <dbl>
#> 1     1     1 0     
#> 2     2     1 0.150 
#> 3     3     1 0.183 
#> 4     1     2 0.704 
#> 5     2     2 0     
#> 6     3     2 0.0679
#> 7     1     3 0.622 
#> 8     2     3 0.0870
#> 9     3     3 0

This calculates the KL divergence for all pairs of distributions in the list, offering additional insights into the relationships between the distributions.

4.6.2 Anderson-Darling Test

The Anderson-Darling (AD) test is a goodness-of-fit test that evaluates whether a sample of data comes from a specific distribution. It is an enhancement of the Kolmogorov-Smirnov test, with greater sensitivity to deviations in the tails of the distribution.

The Anderson-Darling test statistic is defined as:

$A^2 = -n - \frac{1}{n} \sum_{i=1}^n \left[ (2i - 1) \left( \log F(Y_i) + \log(1 - F(Y_{n+1-i})) \right) \right]$

Where:

$n$ is the sample size.
$F$ is the cumulative distribution function (CDF) of the theoretical distribution being tested.
$Y_i$ are the ordered sample values.

The AD test modifies the basic framework of the KS test by giving more weight to the tails of the distribution, making it particularly sensitive to tail discrepancies.

Hypotheses

Null Hypothesis ( $H_0$ ): The sample data follows the specified distribution.
Alternative Hypothesis ( $H_1$ ): The sample data does not follow the specified distribution.

Key Properties

Tail Sensitivity: Unlike the Kolmogorov-Smirnov test, the Anderson-Darling test emphasizes discrepancies in the tails of the distribution.
Distribution-Specific Critical Values: The AD test provides critical values tailored to the specific distribution being tested (e.g., normal, exponential).

The Anderson-Darling test is commonly used in:

Testing goodness-of-fit for a sample against theoretical distributions such as normal, exponential, or uniform.
Evaluating the appropriateness of parametric models in hypothesis testing.
Assessing distributional assumptions in quality control and reliability analysis.

Example: Testing Normality with the Anderson-Darling Test

library(nortest)

# Generate a sample from a normal distribution
set.seed(1)
sample_data <- rnorm(100, mean = 0, sd = 1)

# Perform the Anderson-Darling test for normality
ad_test_result <- ad.test(sample_data)
print(ad_test_result)
#> 
#>  Anderson-Darling normality test
#> 
#> data:  sample_data
#> A = 0.16021, p-value = 0.9471

If the p-value is below a chosen significance level (e.g., 0.05), the null hypothesis that the data is normally distributed is rejected.

Example: Comparing Two Empirical Distributions

The AD test can also be applied to compare two empirical distributions using resampling techniques.

# Define two samples
set.seed(1)
sample_1 <- rnorm(100, mean = 0, sd = 1)
sample_2 <- rnorm(100, mean = 1, sd = 1)

# Perform resampling-based Anderson-Darling test
library(twosamples)
ad_test_result_empirical <- ad_test(sample_1, sample_2)
print(ad_test_result_empirical)
#>  Test Stat    P-Value 
#> 6796.70454    0.00025

This evaluates whether the two empirical distributions differ significantly.

4.6.3 Chi-Square Goodness-of-Fit Test

The Chi-Square Goodness-of-Fit Test is a non-parametric statistical test used to evaluate whether a sample data set comes from a population with a specific distribution. It compares observed frequencies with expected frequencies under a hypothesized distribution.

Null Hypothesis ( $H_0$ ): The data follow the specified distribution.
Alternative Hypothesis ( $H_a$ ): The data do not follow the specified distribution.

The Chi-Square test statistic is computed as:

$\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}$

Where:

$O_i$ : Observed frequency for category $i$ .
$E_i$ : Expected frequency for category $i$ .
$k$ : Number of categories.

The test statistic follows a Chi-Square distribution with degrees of freedom:

$\nu = k - 1 - p$

Where $p$ is the number of parameters estimated from the data.

Assumptions of the Test

Random Sampling: The sample data are drawn randomly from the population.
Minimum Expected Frequency: The expected frequencies $E_i$ are sufficiently large (typically $E_i \geq 5$ ).
Independence: Observations in the sample are independent of each other.

Decision Rule

Compute the test statistic $\chi^2$ using the observed and expected frequencies.
Determine the critical value $\chi^2_{\alpha, \nu}$ for the chosen significance level $\alpha$ and degrees of freedom $\nu$ .
Compare $\chi^2$ to $\chi^2_{\alpha, \nu}$ :
- Reject $H_0$ if $\chi^2 > \chi^2_{\alpha, \nu}$ .
- Alternatively, use the p-value approach:
  - Reject $H_0$ if $p \leq \alpha$ .
  - Fail to reject $H_0$ if $p > \alpha$ .

Steps for the Chi-Square Goodness-of-Fit Test

Define the expected frequencies based on the hypothesized distribution.
Compute the observed frequencies from the data.
Calculate the test statistic $\chi^2$ .
Determine the degrees of freedom $\nu$ .
Compare $\chi^2$ with the critical value or use the p-value for decision-making.

Example: Testing a Fair Die

Suppose you are testing whether a six-sided die is fair. The die is rolled 60 times, and the observed frequencies of the outcomes are:

Observed Frequencies: $[10, 12, 8, 11, 9, 10]$
Expected Frequencies: A fair die has equal probability for each face, so $E_i = 60 / 6 = 10$ for each face.

# Observed frequencies
observed <- c(10, 12, 8, 11, 9, 10)

# Expected frequencies under a fair die
expected <- rep(10, 6)

# Perform Chi-Square Goodness-of-Fit Test
chisq_test <- chisq.test(x = observed, p = expected / sum(expected))

# Display results
chisq_test
#> 
#>  Chi-squared test for given probabilities
#> 
#> data:  observed
#> X-squared = 1, df = 5, p-value = 0.9626

Example: Testing a Loaded Die

For a die with unequal probabilities (e.g., a loaded die), the expected probabilities are defined explicitly:

# Observed frequencies
observed <- c(10, 12, 8, 11, 9, 10)

# Expected probabilities (e.g., for a loaded die)
probabilities <- c(0.1, 0.2, 0.3, 0.1, 0.2, 0.1)

# Expected frequencies
expected <- probabilities * sum(observed)

# Perform Chi-Square Goodness-of-Fit Test
chisq_test_loaded <- chisq.test(x = observed, p = probabilities)

# Display results
chisq_test_loaded
#> 
#>  Chi-squared test for given probabilities
#> 
#> data:  observed
#> X-squared = 15.806, df = 5, p-value = 0.007422

Limitations of the Chi-Square Test

Minimum Expected Frequency: If $E_i < 5$ for any category, the test may lose power. Consider merging categories to meet this criterion.
Independence: Assumes observations are independent. Violations of this assumption can invalidate the test.
Sample Size Sensitivity: Large sample sizes may result in significant $\chi\^2$ values even for minor deviations from the expected distribution.

The Chi-Square Goodness-of-Fit Test is a versatile tool for evaluating the fit of observed data to a hypothesized distribution, widely used in fields like quality control, genetics, and market research.

4.6.4 Cramér-von Mises Test

The Cramér-von Mises (CvM) Test is a goodness-of-fit test that evaluates whether a sample data set comes from a specified distribution. Similar to the Kolmogorov-Smirnov Test (KS) and Anderson-Darling Test (AD), it assesses the discrepancy between the empirical and theoretical cumulative distribution functions (CDFs). However, the CvM test has equal sensitivity across the entire distribution, unlike the KS test (focused on the maximum difference) or the AD test (emphasizing the tails).

The Cramér-von Mises test statistic is defined as:

$W^2 = n \int_{-\infty}^{\infty} \left( F_n(x) - F(x) \right)^2 dF(x)$

Where:

$n$ is the sample size.
$F_n(x)$ is the empirical cumulative distribution function (ECDF) of the sample.
$F(x)$ is the CDF of the specified theoretical distribution.

For practical implementation, the test statistic is often computed as:

$W^2 = \sum_{i=1}^n \left[ F(X_i) - \frac{2i - 1}{2n} \right]^2 + \frac{1}{12n}$

Where $X_i$ are the ordered sample values.

Hypotheses

Null Hypothesis ( $H_0$ ): The sample data follow the specified distribution.
Alternative Hypothesis ( $H_a$ ): The sample data do not follow the specified distribution.

Key Properties

Equal Sensitivity:
- The CvM test gives equal weight to discrepancies across all parts of the distribution, unlike the AD test, which emphasizes the tails.
Non-parametric:
- The test makes no strong parametric assumptions about the data, aside from the specified distribution.
Complementary to KS and AD Tests:
- While the KS test focuses on the maximum distance between CDFs and the AD test emphasizes tails, the CvM test provides a balanced sensitivity across the entire range of the distribution.

The Cramér-von Mises test is widely used in:

Goodness-of-Fit Testing: Assessing whether data follow a specified theoretical distribution (e.g., normal, exponential).
Model Validation: Evaluating the fit of probabilistic models in statistical and machine learning contexts.
Complementary Testing: Used alongside KS and AD tests for a comprehensive analysis of distributional assumptions.

Example 1: Testing Normality

library(nortest)

# Generate a sample from a normal distribution
set.seed(1)
sample_data <- rnorm(100, mean = 0, sd = 1)

# Perform the Cramér-von Mises test for normality
cvm_test_result <- cvm.test(sample_data)
print(cvm_test_result)
#> 
#>  Cramer-von Mises normality test
#> 
#> data:  sample_data
#> W = 0.026031, p-value = 0.8945

The test evaluates whether the sample data follow a normal distribution.

Example 2: Goodness-of-Fit for Custom Distributions

For distributions other than normal, you can use resampling techniques or custom implementations. Here’s a pseudo-implementation for a custom theoretical distribution:

# Custom ECDF and theoretical CDF comparison
set.seed(1)
sample_data <-
    rexp(100, rate = 1)  # Sample from exponential distribution
theoretical_cdf <-
    function(x) {
        pexp(x, rate = 1)
    }  # Exponential CDF

# Compute empirical CDF
empirical_cdf <- ecdf(sample_data)

# Compute CvM statistic
cvm_statistic <-
    sum((empirical_cdf(sample_data) - theoretical_cdf(sample_data)) ^ 2) / 
    length(sample_data)
print(paste("Cramér-von Mises Statistic (Custom):",
            round(cvm_statistic, 4)))
#> [1] "Cramér-von Mises Statistic (Custom): 0.0019"

This demonstrates a custom calculation of the CvM statistic for testing goodness-of-fit to an exponential distribution.

Normality Test:
- The cvm.test function evaluates whether the sample data follow a normal distribution.
- A small p-value indicates significant deviation from normality.
Custom Goodness-of-Fit:
- Custom implementation allows testing for distributions other than normal.
- The statistic measures the squared differences between the empirical and theoretical CDFs.

Advantages and Limitations

Advantages:
- Balanced sensitivity across the entire distribution.
- Complements KS and AD tests by providing a different perspective on goodness-of-fit.
Limitations:
- Critical values are distribution-specific.
- The test may be less sensitive to tail deviations compared to the AD test.

The Cramér-von Mises test is a robust and versatile goodness-of-fit test, offering balanced sensitivity across the entire distribution. Its complementarity to KS and AD tests makes it an essential tool for validating distributional assumptions in both theoretical and applied contexts.

4.6.5 Kullback-Leibler Divergence

Kullback-Leibler (KL) divergence, also known as relative entropy, is a measure used to quantify the similarity between two probability distributions. It plays a critical role in statistical inference, machine learning, and information theory. However, KL divergence is not a true metric as it does not satisfy the triangle inequality.

Key Properties of KL Divergence

Not a Metric: KL divergence fails to meet the triangle inequality requirement, and it is not symmetric, meaning: $D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$
Generalization to Multivariate Case: KL divergence can be extended for multivariate distributions, making it flexible for complex analyses.
Quantifies Information Loss: It measures the “information loss” when approximating the true distribution $P$ with the predicted distribution $Q$ . Thus, smaller values indicate closer similarity between the distributions.

Mathematical Definitions

KL divergence is defined differently for discrete and continuous distributions.

1. Discrete Case
For two discrete probability distributions $P = \{P_i\}$ and $Q = \{Q_i\}$ , the KL divergence is given by: $D_{KL}(P \| Q) = \sum_i P_i \log\left(\frac{P_i}{Q_i}\right)$

2. Continuous Case
For continuous probability density functions $P(x)$ and $Q(x)$ : $D_{KL}(P \| Q) = \int P(x) \log\left(\frac{P(x)}{Q(x)}\right) dx$

Range: $D_{KL}(P \| Q) \in [0, \infty)$
- $D_{KL} = 0$ indicates identical distributions ( $P = Q$ ).
- Larger values indicate greater dissimilarity between $P$ and $Q$ .
Non-Symmetric Nature: As noted, $D_{KL}(P \| Q)$ and $D_{KL}(Q \| P)$ are not equal, emphasizing its directed nature.

library(philentropy)

# Example 1: Continuous case
# Define two continuous probability distributions with distinct patterns
X_continuous <- c(0.1, 0.2, 0.3, 0.4)  # Normalized to sum to 1
Y_continuous <- c(0.4, 0.3, 0.2, 0.1)  # Normalized to sum to 1

# Calculate KL divergence (logarithm base 2)
KL_continuous <- KL(rbind(X_continuous, Y_continuous), unit = "log2")
print(paste("KL divergence (continuous):", round(KL_continuous, 2)))
#> [1] "KL divergence (continuous): 0.66"

# Example 2: Discrete case
# Define two discrete probability distributions
X_discrete <- c(5, 10, 15, 20)  # Counts for events
Y_discrete <- c(20, 15, 10, 5)  # Counts for events

# Estimate probabilities empirically and compute KL divergence
KL_discrete <- KL(rbind(X_discrete, Y_discrete), est.prob = "empirical")
print(paste("KL divergence (discrete):", round(KL_discrete, 2)))
#> [1] "KL divergence (discrete): 0.66"

Insights:

Continuous case uses normalized probability values explicitly provided.
Discrete case relies on empirical estimation of probabilities from counts.
Observe how KL divergence quantifies the “distance” between the two distributions.

4.6.6 Jensen-Shannon Divergence

Jensen-Shannon (JS) divergence is a symmetric and bounded measure of the similarity between two probability distributions. It is derived from the Kullback-Leibler Divergence (KL) but addresses its asymmetry and unboundedness by incorporating a mixed distribution.

The Jensen-Shannon divergence is defined as: $D_{JS}(P \| Q) = \frac{1}{2} \left( D_{KL}(P \| M) + D_{KL}(Q \| M) \right)$ where:

$M = \frac{1}{2}(P + Q)$ is the mixed distribution, representing the average of $P$ and $Q$ .
$D_{KL}$ is the Kullback-Leibler divergence.

Key Properties

Symmetry: Unlike KL divergence, JS divergence is symmetric: $D_{JS}(P \| Q) = D_{JS}(Q \| P)$
Boundedness:
- For base-2 logarithms: $D_{JS} \in [0, 1]$
- For natural logarithms (base- $e$ ): $D_{JS} \in [0, \ln(2)]$
Interpretability: The JS divergence measures the average information gain when moving from the mixed distribution $M$ to either $P$ or $Q$ . Its bounded nature makes it easier to compare across datasets.

# Load the required library
library(philentropy)

# Example 1: Continuous case
# Define two continuous distributions
X_continuous <- 1:10  # Continuous sequence
Y_continuous <- 1:20  # Continuous sequence

# Compute JS divergence (logarithm base 2)
JS_continuous <- JSD(rbind(X_continuous, Y_continuous), unit = "log2")
print(paste("JS divergence (continuous):", round(JS_continuous, 2)))
#> [1] "JS divergence (continuous): 20.03"

# X_continuous and Y_continuous represent continuous distributions.
# The mixed distribution (M) is computed internally as 
# the average of the two distributions.

# Example 2: Discrete case
# Define two discrete distributions
X_discrete <- c(5, 10, 15, 20)  # Observed counts for events
Y_discrete <- c(20, 15, 10, 5)  # Observed counts for events

# Compute JS divergence with empirical probability estimation
JS_discrete <- JSD(rbind(X_discrete, Y_discrete), est.prob = "empirical")
print(paste("JS divergence (discrete):", round(JS_discrete, 2)))
#> [1] "JS divergence (discrete): 0.15"

# X_discrete and Y_discrete represent event counts.

Probabilities are estimated empirically before calculating the divergence.

4.6.7 Hellinger Distance

The Hellinger distance is a bounded and symmetric measure of similarity between two probability distributions. It is widely used in statistics and machine learning to quantify how “close” two distributions are, with values ranging between 0 (identical distributions) and 1 (completely disjoint distributions).

Mathematical Definition

The Hellinger distance between two probability distributions $P$ and $Q$ is defined as:

$H(P, Q) = \frac{1}{\sqrt{2}} \sqrt{\sum_x \left(\sqrt{P(x)} - \sqrt{Q(x)}\right)^2}$

Where:

$P(x)$ and $Q(x)$ are the probability densities or probabilities at point $x$ for the distributions $P$ and $Q$ .
The term $\sqrt{P(x)}$ is the square root of the probabilities, emphasizing geometric comparisons between the distributions.

Alternatively, for continuous distributions, the Hellinger distance can be expressed as:

$H(P, Q) = \frac{1}{\sqrt{2}} \sqrt{\int \left(\sqrt{P(x)} - \sqrt{Q(x)}\right)^2 dx}$

Key Properties

Symmetry: $H(P, Q) = H(Q, P)$ The distance is symmetric, unlike Kullback-Leibler divergence.
Boundedness: $H(P, Q) \in [0, 1]$
- $H = 0$ : The distributions are identical ( $P(x) = Q(x)$ for all $x$ ).
- $H = 1$ : The distributions have no overlap ( $P(x) \neq Q(x)$ ).
Interpretability:
- Hellinger distance provides a scale-invariant measure, making it suitable for comparing distributions in various contexts.

Hellinger distance is widely used in:

Hypothesis Testing: Comparing empirical distributions to theoretical models.
Machine Learning: Feature selection, classification, and clustering tasks.
Bayesian Analysis: Quantifying differences between prior and posterior distributions.
Economics and Ecology: Measuring dissimilarity in distributions like income, species abundance, or geographical data.

library(philentropy)

# Example 1: Compute Hellinger Distance for Discrete Distributions
# Define two discrete distributions as probabilities
P_discrete <- c(0.1, 0.2, 0.3, 0.4)  # Normalized probabilities
Q_discrete <- c(0.3, 0.3, 0.2, 0.2)  # Normalized probabilities

# Compute Hellinger distance
hellinger_discrete <-
    distance(rbind(P_discrete, Q_discrete), method = "hellinger")
print(paste("Hellinger Distance (Discrete):", round(hellinger_discrete, 4)))
#> [1] "Hellinger Distance (Discrete): 0.465"

# Example 2: Compute Hellinger Distance for Empirical Distributions
# Define two empirical distributions (counts)
P_empirical <- c(10, 20, 30, 40)  # Counts for distribution P
Q_empirical <- c(30, 30, 20, 20)  # Counts for distribution Q

# Normalize counts to probabilities
P_normalized <- P_empirical / sum(P_empirical)
Q_normalized <- Q_empirical / sum(Q_empirical)

# Compute Hellinger distance
hellinger_empirical <-
    distance(rbind(P_normalized, Q_normalized), method = "hellinger")
print(paste("Hellinger Distance (Empirical):", round(hellinger_empirical, 4)))
#> [1] "Hellinger Distance (Empirical): 0.465"

4.6.8 Bhattacharyya Distance

The Bhattacharyya Distance is a statistical measure used to quantify the similarity or overlap between two probability distributions. It is commonly used in pattern recognition, signal processing, and statistics to evaluate how closely related two distributions are. The Bhattacharyya distance is particularly effective for comparing both discrete and continuous distributions.

The Bhattacharyya distance between two probability distributions $P$ and $Q$ is defined as:

$D_B(P, Q) = -\ln \left( \sum_x \sqrt{P(x) Q(x)} \right)$

For continuous distributions, the Bhattacharyya distance is expressed as:

$D_B(P, Q) = -\ln \left( \int \sqrt{P(x) Q(x)} dx \right)$

Where:

$P(x)$ and $Q(x)$ are the probability densities or probabilities for the distributions $P$ and $Q$ .
The term $\int \sqrt{P(x) Q(x)} dx$ is known as the Bhattacharyya coefficient.

Key Properties

Symmetry: $D_B(P, Q) = D_B(Q, P)$
Range: $D_B(P, Q) \in [0, \infty)$
- $D_B = 0$ : The distributions are identical ( $P = Q$ ).
- Larger values indicate less overlap and greater dissimilarity between $P$ and $Q$ .
Relation to Hellinger Distance:
- The Bhattacharyya coefficient is related to the Hellinger distance: $H(P, Q) = \sqrt{1 - \sum_x \sqrt{P(x) Q(x)}}$

The Bhattacharyya distance is widely used in:

Classification: Measuring the similarity between feature distributions in machine learning.
Hypothesis Testing: Evaluating the closeness of observed data to a theoretical model.
Image Processing: Comparing pixel intensity distributions or color histograms.
Economics and Ecology: Assessing similarity in income distributions or species abundance.

Example 1: Discrete Distributions

# Define two discrete probability distributions
P_discrete <- c(0.1, 0.2, 0.3, 0.4)  # Normalized probabilities
Q_discrete <- c(0.3, 0.3, 0.2, 0.2)  # Normalized probabilities

# Compute Bhattacharyya coefficient
bhattacharyya_coefficient <- sum(sqrt(P_discrete * Q_discrete))

# Compute Bhattacharyya distance
bhattacharyya_distance <- -log(bhattacharyya_coefficient)

# Display results
print(paste(
    "Bhattacharyya Coefficient:",
    round(bhattacharyya_coefficient, 4)
))
#> [1] "Bhattacharyya Coefficient: 0.9459"
print(paste(
    "Bhattacharyya Distance (Discrete):",
    round(bhattacharyya_distance, 4)
))
#> [1] "Bhattacharyya Distance (Discrete): 0.0556"

A smaller Bhattacharyya distance indicates greater similarity between the two distributions.

Example 2: Continuous Distributions (Approximation)

For continuous distributions, the Bhattacharyya distance can be approximated using numerical integration or discretization.

# Generate two continuous distributions
set.seed(1)
P_continuous <-
    rnorm(1000, mean = 0, sd = 1)  # Standard normal distribution
Q_continuous <-
    rnorm(1000, mean = 1, sd = 1)  # Normal distribution with mean 1

# Create histograms to approximate probabilities
hist_P <- hist(P_continuous, breaks = 50, plot = FALSE)
hist_Q <- hist(Q_continuous, breaks = 50, plot = FALSE)

# Normalize histograms to probabilities
prob_P <- hist_P$counts / sum(hist_P$counts)
prob_Q <- hist_Q$counts / sum(hist_Q$counts)

# Compute Bhattacharyya coefficient
bhattacharyya_coefficient_continuous <- sum(sqrt(prob_P * prob_Q))

# Compute Bhattacharyya distance
bhattacharyya_distance_continuous <-
    -log(bhattacharyya_coefficient_continuous)

# Display results
print(paste(
    "Bhattacharyya Coefficient (Continuous):",
    round(bhattacharyya_coefficient_continuous, 4)
))
#> [1] "Bhattacharyya Coefficient (Continuous): 0.9823"
print(paste(
    "Bhattacharyya Distance (Continuous Approximation):",
    round(bhattacharyya_distance_continuous, 4)
))
#> [1] "Bhattacharyya Distance (Continuous Approximation): 0.0178"

Continuous distributions are discretized into histograms to compute the Bhattacharyya coefficient and distance.

Discrete Case:
1. The Bhattacharyya coefficient quantifies the overlap between $P$ and $Q$ .
2. The Bhattacharyya distance translates this overlap into a logarithmic measure of dissimilarity.
Continuous Case:
1. Distributions are discretized into histograms to approximate the Bhattacharyya coefficient and distance.

4.6.9 Wasserstein Distance

The Wasserstein distance, also known as the Earth Mover’s Distance (EMD), is a measure of similarity between two probability distributions. It quantifies the “cost” of transforming one distribution into another, making it particularly suitable for continuous data and applications where the geometry of the data matters.

Mathematical Definition

The Wasserstein distance between two probability distributions $P$ and $Q$ over a domain $\mathcal{X}$ is defined as:

$W_p(P, Q) = \left( \int_{\mathcal{X}} |F_P(x) - F_Q(x)|^p dx \right)^{\frac{1}{p}}$

Where:

$F_P(x)$ and $F_Q(x)$ are the cumulative distribution functions (CDFs) of $P$ and $Q$ .
$p \geq 1$ is the order of the Wasserstein distance (commonly $p = 1$ ).
$|\cdot|^p$ is the absolute difference raised to the power $p$ .

For the case of $p = 1$ , the formula simplifies to:

$W_1(P, Q) = \int_{\mathcal{X}} |F_P(x) - F_Q(x)| dx$

This represents the minimum “cost” of transforming the distribution $P$ into $Q$ , where cost is proportional to the distance a “unit of mass” must move.

Key Properties

Interpretability: Represents the “effort” required to morph one distribution into another.
Metric: Wasserstein distance satisfies the properties of a metric, including symmetry, non-negativity, and the triangle inequality.
Flexibility: Can handle both empirical and continuous distributions.

Wasserstein distance is widely used in various fields, including:

Machine Learning:
- Training generative models such as Wasserstein GANs.
- Monitoring data drift in online systems.
Statistics:
- Comparing empirical distributions derived from observed data.
- Robustness testing under distributional shifts.
Economics:
- Quantifying disparities in income or wealth distributions.
Image Processing:
- Measuring structural differences between image distributions.

library(transport)
library(twosamples)

# Example 1: Compute Wasserstein Distance (1D case)
set.seed(1)
# Generate a sample from a standard normal distribution
dist_1 <- rnorm(100)
# Generate a sample with mean shifted to 1
dist_2 <- rnorm(100, mean = 1)     

# Calculate the Wasserstein distance
wass_distance <- wasserstein1d(dist_1, dist_2)
print(paste("1D Wasserstein Distance:", round(wass_distance, 4)))
#> [1] "1D Wasserstein Distance: 0.8533"

# Example 2: Wasserstein Metric as a Statistic
set.seed(1)
wass_stat_value <- wass_stat(dist_1, dist_2)
print(paste("Wasserstein Statistic:", round(wass_stat_value, 4)))
#> [1] "Wasserstein Statistic: 0.8533"

# Example 3: Wasserstein Test (Permutation-based Two-sample Test)
set.seed(1)
wass_test_result <- wass_test(dist_1, dist_2)
print(wass_test_result)
#> Test Stat   P-Value 
#> 0.8533046 0.0002500

Example 1 calculates the simple Wasserstein distance between two distributions.
Example 2 computes the Wasserstein distance as a statistical metric.
Example 3 performs a permutation-based two-sample test using the Wasserstein metric.

4.6.10 Energy Distance

The Energy Distance is a statistical metric used to quantify the similarity between two probability distributions. It is particularly effective for comparing multi-dimensional distributions.

The Energy Distance between two distributions $P$ and $Q$ is defined as:

$E(P, Q) = 2 \mathbb{E}[||X - Y||] - \mathbb{E}[||X - X'||] - \mathbb{E}[||Y - Y'||]$

Where:

$X$ and $X'$ are independent and identically distributed (i.i.d.) random variables from $P$ .
$Y$ and $Y'$ are i.i.d. random variables from $Q$ .
$||\cdot||$ denotes the Euclidean distance.

Alternatively, for empirical distributions, the Energy Distance can be approximated as:

$E(P, Q) = \frac{2}{mn} \sum_{i=1}^m \sum_{j=1}^n ||X_i - Y_j|| - \frac{1}{m^2} \sum_{i=1}^m \sum_{j=1}^m ||X_i - X_j|| - \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n ||Y_i - Y_j||$

Where:

$m$ and $n$ are the sample sizes from distributions $P$ and $Q$ respectively.
$X_i$ and $Y_j$ are samples from $P$ and $Q$ .

Key Properties

Metric:
- Energy distance satisfies the properties of a metric: symmetry, non-negativity, and the triangle inequality.
Range: $E(P, Q) \geq 0$
- $E(P, Q) = 0$ : The distributions are identical.
- Larger values indicate greater dissimilarity.
Effectiveness for Multi-dimensional Data:
- Energy distance is designed to work well in higher-dimensional spaces, unlike some traditional metrics.

The Energy Distance is widely used in:

Hypothesis Testing: Testing whether two distributions are the same.
Energy Test for equality of distributions.
Clustering: Measuring dissimilarity between clusters in multi-dimensional data.
Feature Selection: Comparing distributions of features across different classes to evaluate their discriminative power.

Example 1: Comparing Two Distributions

# Load the 'energy' package
library(energy)

# Generate two sample distributions
set.seed(1)
X <- matrix(rnorm(1000, mean = 0, sd = 1), ncol = 2)  # Distribution P
Y <- matrix(rnorm(1000, mean = 1, sd = 1), ncol = 2)  # Distribution Q

# Combine X and Y and create a group identifier
combined <- rbind(X, Y)
groups <- c(rep(1, nrow(X)), rep(2, nrow(Y)))

# Compute Energy Distance
energy_dist <- edist(combined, sizes = table(groups))

# Print the Energy Distance
print(paste("Energy Distance:", round(energy_dist, 4)))
#> [1] "Energy Distance: 201.9202"

This calculates the energy distance between two multi-dimensional distributions.

Example 2: Energy Test for Equality of Distributions

# Perform the Energy Test
energy_test <-
    eqdist.etest(rbind(X, Y), sizes = c(nrow(X), nrow(Y)), R = 999)
print(energy_test)
#> 
#>  Multivariate 2-sample E-test of equal distributions
#> 
#> data:  sample sizes 500 500, replicates 999
#> E-statistic = 201.92, p-value = 0.001

The energy test evaluates the null hypothesis that the two distributions are identical.

Energy Distance:
- Provides a single metric to quantify the dissimilarity between two distributions, considering all dimensions of the data.
Energy Test:
- Tests for equality of distributions using Energy Distance.
- The p-value indicates whether the distributions are significantly different.

Advantages of Energy Distance

Multi-dimensional Applicability:
- Works seamlessly with high-dimensional data, unlike some divergence metrics which may suffer from dimensionality issues.
Non-parametric:
- Makes no assumptions about the form of the distributions.
Robustness:
- Effective even with complex data structures.

4.6.11 Total Variation Distance

The Total Variation (TV) Distance is a measure of the maximum difference between two probability distributions. It is widely used in probability theory, statistics, and machine learning to quantify how dissimilar two distributions are.

The Total Variation Distance between two probability distributions $P$ and $Q$ is defined as:

$D_{TV}(P, Q) = \frac{1}{2} \sum_x |P(x) - Q(x)|$

Where:

$P(x)$ and $Q(x)$ are the probabilities assigned to the outcome $x$ by the distributions $P$ and $Q$ .
The factor $\frac{1}{2}$ ensures that the distance lies within the range $[0, 1]$ .

Alternatively, for continuous distributions, the TV distance can be expressed as:

$D_{TV}(P, Q) = \frac{1}{2} \int |P(x) - Q(x)| dx$

Key Properties

Range: $D_{TV}(P, Q) \in [0, 1]$
- $D_{TV} = 0$ : The distributions are identical ( $P = Q$ ).
- $D_{TV} = 1$ : The distributions are completely disjoint (no overlap).
Symmetry: $D_{TV}(P, Q) = D_{TV}(Q, P)$
Interpretability:
- $D_{TV}(P, Q)$ represents the maximum probability mass that needs to be shifted to transform $P$ into $Q$ .

The Total Variation Distance is used in:

Hypothesis Testing: Quantifying the difference between observed and expected distributions.
Machine Learning: Evaluating similarity between predicted and true distributions.
Information Theory: Comparing distributions in contexts like communication and cryptography.

Example 1: Discrete Distributions

# Define two discrete probability distributions
P_discrete <- c(0.1, 0.2, 0.3, 0.4)  # Normalized probabilities
Q_discrete <- c(0.3, 0.3, 0.2, 0.2)  # Normalized probabilities

# Compute Total Variation Distance
tv_distance <- sum(abs(P_discrete - Q_discrete)) / 2
print(paste("Total Variation Distance (Discrete):", round(tv_distance, 4)))
#> [1] "Total Variation Distance (Discrete): 0.3"

This calculates the maximum difference between the two distributions, scaled to lie between 0 and 1.

Example 2: Continuous Distributions (Approximation)

For continuous distributions, the TV distance can be approximated using discretization or numerical integration. Here’s an example using random samples:

# Generate two continuous distributions
set.seed(1)
P_continuous <-
    rnorm(1000, mean = 0, sd = 1)  # Standard normal distribution
Q_continuous <-
    rnorm(1000, mean = 1, sd = 1)  # Normal distribution with mean 1

# Create histograms to approximate probabilities
hist_P <- hist(P_continuous, breaks = 50, plot = FALSE)
hist_Q <- hist(Q_continuous, breaks = 50, plot = FALSE)

# Normalize histograms to probabilities
prob_P <- hist_P$counts / sum(hist_P$counts)
prob_Q <- hist_Q$counts / sum(hist_Q$counts)

# Compute Total Variation Distance
tv_distance_continuous <- sum(abs(prob_P - prob_Q)) / 2
print(paste(
    "Total Variation Distance (Continuous Approximation):",
    round(tv_distance_continuous, 4)
))
#> [1] "Total Variation Distance (Continuous Approximation): 0.125"

The continuous distributions are discretized into histograms, and TV distance is computed based on the resulting probabilities.

Discrete Case:
- The TV distance quantifies the maximum difference between $P$ and $Q$ in terms of probability mass.
- In this example, it highlights how much $P$ and $Q$ diverge.
Continuous Case:
- For continuous distributions, TV distance is approximated using discretized probabilities from histograms.
- This approach provides an intuitive measure of similarity for large samples.

The Total Variation Distance provides an intuitive and interpretable measure of the maximum difference between two distributions. Its symmetry and bounded nature make it a versatile tool for comparing both discrete and continuous distributions.

4.6.12 Summary

Tests for Comparing Distributions
Test Name	Purpose	Type of Data	Advantages	Limitations
Kolmogorov-Smirnov Test	Tests if two distributions are the same or if a sample matches a reference distribution.	Empirical Distributions (Continuous)	Non-parametric, detects global differences.	Less sensitive to tail differences, limited to one-dimensional data.
Anderson-Darling Test	Tests goodness-of-fit with emphasis on the tails.	Continuous Data	Strong sensitivity to tail behavior.	Requires specifying a reference distribution.
Chi-Square Goodness-of-Fit Test	Tests if observed frequencies match expected frequencies.	Categorical Data	Simple, intuitive for discrete data.	Requires large sample sizes and sufficiently large expected frequencies.
Cramér-von Mises Test	Evaluates goodness-of-fit using cumulative distribution functions.	Empirical Distributions (Continuous)	Sensitive across the entire distribution.	Limited to one-dimensional data; requires cumulative distribution functions.

Divergence Metrics
Metric Name	Purpose	Type of Data	Advantages	Limitations
Kullback-Leibler Divergence	Measures how one probability distribution diverges from another.	Probability Distributions (Continuous/Discrete)	Provides a clear measure of information loss.	Asymmetric, sensitive to zero probabilities.
Jensen-Shannon Divergence	Symmetric measure of similarity between two probability distributions.	Probability Distributions (Continuous/Discrete)	Symmetric and bounded; intuitive for comparison.	Less sensitive to tail differences.
Hellinger Distance	Measures geometric similarity between two probability distributions.	Discrete or Continuous Probability Distributions	Easy to interpret; bounded between 0 and 1.	Computationally expensive for large datasets.
Bhattacharyya Distance	Quantifies overlap between two statistical distributions.	Probability Distributions (Continuous/Discrete)	Useful for classification and clustering tasks.	Less interpretable in large-scale applications.

Distance Metrics
Metric Name	Purpose	Type of Data	Advantages	Limitations
Wasserstein Distance	Measures the “effort” or “cost” to transform one distribution into another.	Continuous or Empirical Distributions	Provides geometric interpretation; versatile.	Computationally expensive for large-scale data.
Energy Distance	Measures statistical dissimilarity between multivariate distributions.	Multivariate Empirical Distributions	Non-parametric, works well for high-dimensional data.	Requires pairwise calculations; sensitive to outliers.
Total Variation Distance	Measures the maximum absolute difference between probabilities of two distributions.	Probability Distributions (Discrete/Continuous)	Intuitive and strict divergence measure.	Ignores structural differences beyond the largest deviation.