Chapter 2 General Statistics

2.1 Descriptive Statistics

2.1.1 Measure of Central Tendency

Measures of central tendency can represent a set of data. In this section, three measures of central tendency will be discussed, namely mean, median, and mode.

2.1.1.1 Mean

1. Arithmetic Mean

The arithmetic mean of a sample is calculated by summing all values, then dividing by the number of samples. The formula commonly used is as follows:

$\bar{x}=\frac{1}{n} (x_1+x_2+ ... + x_n)$

R function to calculate the arithmetic mean : mean function.
Here is an example calculation using this function:

data <- c(168,167,178,170,169,175,165,173,170,168,189, 156, 167, 173)
mean(data)

## [1] 170.5714

For categorical data, we can compute the arithmetic mean for each category with the following code:

category <- c(rep(1,5), rep(2,3), rep(3,6))
number <- c(168,167,178,170,169,175,165,173,170,168,189,156,167,173)
(data <- data.frame(category, number))

##    category number
## 1         1    168
## 2         1    167
## 3         1    178
## 4         1    170
## 5         1    169
## 6         2    175
## 7         2    165
## 8         2    173
## 9         3    170
## 10        3    168
## 11        3    189
## 12        3    156
## 13        3    167
## 14        3    173

aggregate(data$number, list(data$category), mean)

##   Group.1     x
## 1       1 170.4
## 2       2 171.0
## 3       3 170.5

2. Geometric Mean

The geometric mean of a sample is calculated by multiplying all values, then taking the n root of the product. The formula commonly used is as follows:

$\bar{x}_{geom}=(x_1 * x_2* ... * x_n)^{\frac{1}{n}}$

R does not provide a function to calculate the geometric mean. Here’s an example calculation using a custom function in R:

mean.geom <- function(data){
  exp(mean(log(data)))
}

data <- c(168, 167, 178, 170, 169, 175, 165, 173, 170, 168, 189, 156, 167, 173)
mean.geom(data)

## [1] 170.4247

3. Harmonic Mean

The formula commonly used for harmonic mean is as follows:

$\bar{x}_{H}=\frac{n}{(x_1 + x_2+ ... + x_n)}$ R does not provide a function to calculate the harmonic mean. Here’s an example calculation using a custom function in R:

mean.harm <- function(data){
  1/(mean(1/data))
}

data <- c(168,167,178,170,169,175,165,173,170,168,189, 156, 167, 173)
mean.harm(data)

## [1] 170.28

2.1.1.2 Median

The median value is the middle value of the sorted data. If the number of data points is odd, the median is the $\frac{n+1}{2}$ -th data point. If the number of data points is even, the median is the average of the $\frac{n}{2}$ -th and $\frac{n}{2}+1$ -th data points. The median is often used to represent skewed data distributions. R provides a function to calculate the median : median() . Here’s an example calculation using this function:

data <- c(168,167,178,170,169,175,165,173,170,168,189, 156, 167, 173)
median(data)

## [1] 169.5

2.1.1.3 Mode

Mode is the value that appears most frequently in a dataset. R does not provide a function to calculate the mode. Here’s an example calculation using a custom function in R:

mode <- function(data){
  uqx <- unique(data)
  tab <- table(data)
  sort(uqx)[tab == max(tab)]
}

data <- c(168,167,178,170,169,175,165,173,170,168,189, 156, 167, 173, 167)
mode(data)

## [1] 167

2.1.2 Measures of Data Dispersion

Measures of data dispersion indicate how far data spreads from its mean. This section will discuss several measures of data dispersion, including range, variance, and standard deviation.

2.1.2.1 Range

Range is the difference between the largest and smallest data values, and it can be written with the formula as follows:

$Range = x_{n} - x_{1}$

Here’s an example calculation of Range :

data <- c(168,167,178,170,169,175,165,173,170,168,189, 156, 167, 173)

# Range function to get maximum and minimum value 
range(data)

## [1] 156 189

# Custom function to get the range value
range.data <- function(data){
  max(data) - min(data)
}
range.data(data)

## [1] 33

2.1.2.2 Variance nand Standard Deviation

Variance is the average of the squared deviations from the mean of the data. The variance of a population can be formulated as follows:

$var(x)=\frac{1}{N} \sum_{i-1}^{N} {(x_i - \mu)}^2$

The variance of a sample can be formulated as follows:

$var(x)=\frac{1}{n-1} \sum_{i-1}^{n} {(x_i - \bar{x})}^2$

Standard deviation is the square root of the variance.

R provides functions to calculate variance and standard deviation in a sampel dataset : the var and sd functions. Here’s an example calculation using these functions:

data <- c(168,167,178,170,169,175,165,173,170,168,189, 156, 167, 173)
var(data)

## [1] 54.72527

sd(data)

## [1] 7.397653

2.1.2.3 Interquartile Range and Quartile Deviation

Interquartile range is the difference between Quartile 3 and Quartile 1. Meanwhile, quartile deviation, also known as semi-interquartile range, is half of the interquartile range. Both statistics can be written with the formulas as follows:

$IQR = Q_3 - Q1$ $QD = \frac{(Q_3 - Q1)}{2}$

R provides a function to find quartile values quantile . Using the obtained values of Q1 and Q3, we can calculate the interquartile range and quartile deviation using a custom function as follows:

data <- c(168,167,178,170,169,175,165,173,170,168,189, 156, 167, 173)

IQR <- function(data){
  as.numeric(quantile(data, 0.75) - quantile(data,0.25))
}
IQR(data)

## [1] 5.75

QD <- function(data){
  as.numeric(0.5*(quantile(data, 0.75) - quantile(data,0.25)))
}
QD(data)

## [1] 2.875

2.2 Inference Statistics

on progress~

Inference statistics means using statistical methods to draw conclusions about a population based on sample data. In this section, we will focus on hypoteis testing and confidence interval.

2.2.1 Confidence Interval

A confidence interval (CI) is a range of values, derived from sample statistics, that is likely to contain the true population parameter. It provides an estimate of where the true parameter lies with a certain level of confidence (usually 95%). A 95% Confidence Level means that if you were to take 100 different samples and compute a confidence interval for each sample, approximately 95 of the intervals would contain the true population parameter.

Relation to Hypothesis Testing : If the confidence interval does not include the null hypothesis value, we reject the null hypothesis.

2.2.2 Hypothesis Testing

The aim of an hypothesis testing is to decide whether there is enough evidence to reject a null hypothesis about a population parameter based in sample data. Steps in hypotesis testing :

Formulate hypoteses : Null Hypothesis (H0) and Alternative Hypothesis (H1 or Ha)
- Null Hypothesis (H0) : A statement of no effect or no difference, which we aim to test. It represents the status quo or a baseline assumption.
- Alternative Hypothesis (H1 or Ha) : A statement that contradicts the null hypothesis, indicating the presence of an effect or a difference.
Select the significant level : $\alpha$
- The significance level represents the probability of rejecting the null hypothesis when it is actually true (Type I error).
- Commonly used values are 0.05, 0.01, or 0.10.
Choose the appropiate test :
- The choice depends on the data type, distribution, sample size, and whether the samples are independent or paired.
Calculate the test statistic, determine the p-value, and make a decision.
- If p-value ≤ $\alpha$ : Reject the null hypothesis (evidence suggests H1 is true).
- If p-value > $\alpha$ : Fail to reject the null hypothesis (insufficient evidence to support H1).

2.2.2.1 One sample t-test

To determine whether the mean of a single sample is significantly different from a known or hypothesized population mean. Example of hypotheses used:

Null Hypothesis (H0) : The mean of the sample is equal to 5.5.
Alternative Hypothesis (H1) : The mean of the sample is not equal to 5.5.

# Generating sample data
data <- c(5.1, 5.5, 6.3, 5.8, 6.1, 5.9, 6.2, 5.6, 5.7, 6.0)

# Perform a one-sample t-test
# H0 : The mean of the sample is equal to 5.5.
result <- t.test(data, mu = 5.5)
print(result)

## 
##  One Sample t-test
## 
## data:  data
## t = 2.7994, df = 9, p-value = 0.02073
## alternative hypothesis: true mean is not equal to 5.5
## 95 percent confidence interval:
##  5.561414 6.078586
## sample estimates:
## mean of x 
##      5.82

Since the p-value (0.02073) is less than the significance level of 0.05, we reject the null hypothesis . This indicates that there is significant statistical evidence to suggest that the true mean of the population is different from 5.5. Additionally, the 95% confidence interval for the mean [5.561414, 6.078586] does not include 5.5, further supporting the conclusion that the population mean is significantly different from the hypothesized value of 5.5. The sample mean is estimated to be 5.82.

2.2.2.2 Two-sample t-test

To compares the means of two independent samples to determine if they are significantly different from each other. Hypotheses used :

Null Hypothesis (H0) : The means of the two samples are equal.
Alternative Hypothesis (H1) : The means of the two samples are not equal.

# Generating sample data for two groups
group1 <- c(5.1, 5.5, 6.3, 5.8, 6.1)
group2 <- c(5.9, 6.2, 5.6, 5.7, 6.0)

# Perform a two-sample t-test
result <- t.test(group1, group2)
print(result)

## 
##  Welch Two Sample t-test
## 
## data:  group1 and group2
## t = -0.50262, df = 5.8824, p-value = 0.6335
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7070361  0.4670361
## sample estimates:
## mean of x mean of y 
##      5.76      5.88

Since the p-value (0.6335) is much greater than the significance level of 0.05, we fail to reject the null hypothesis . This indicates that there is insufficient statistical evidence to suggest a significant difference between the means of the two groups. Additionally, the 95% confidence interval for the difference in means includes 0, further supporting the conclusion that the difference in means is not statistically significant.

2.2.2.3 Paired t-test

Tests whether the means of two related samples (e.g., before and after measurements) are different. Example of hypotheses used :

Null Hypothesis (H0) : There is no difference in the mean before and after treatment.
Alternative Hypothesis (H1) : There is a significant difference in the mean weight before and after the diet program.

# Weights of participants before and after the diet program
before <- c(80, 82, 78, 75, 77, 85, 90, 88, 82, 84)
after <- c(78, 80, 76, 73, 75, 82, 87, 85, 80, 82)

# Perform a paired t-test
result <- t.test(before, after, paired = TRUE)
print(result)

## 
##  Paired t-test
## 
## data:  before and after
## t = 15.057, df = 9, p-value = 1.092e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.95445 2.64555
## sample estimates:
## mean of the differences 
##                     2.3

Since the p-value (1.092e-07) is much smaller than the significance level of 0.05, we reject the null hypothesis . This indicates that there is significant statistical evidence to suggest a difference in mean before and after the treatment. The mean different is estimated to be 2.3 units, with a 95% confidence interval of [1.95445, 2.64555]. This substantial p-value and confidence interval indicate a highly significant difference in mean before and after the treatment.

2.3 Correlations

Correlation is a statistical measure that describes the strength and direction of a relationship between two variables.

Positive Correlation: As one variable increases, the other variable also increases. *Example: Height and weight typically have a positive correlation.
Negative Correlation: As one variable increases, the other variable decreases.
No Correlation: No predictable relationship between the variables.

Before calculating the correlation coefficient, we can visualize the data to visually inspect the relationship. A scatter plot is a graphical representation of the relationship between two variables. Each point on the scatter plot represents an observation. (details about corrplot will be explained on Data Visualization Section)

2.3.1 Pearson Correlation Coefficient (𝑟)

Measures the linear relationship between two continuous variables . It ranges from -1 to 1, and The correlation coefficient between X and Y is the same as between Y and X. But, it’s important to note that correlation does not imply causation. A high correlation between two variables does not mean that one variable causes the other to change.

Formulation of Pearson Correlation Coefficient :

$r = \frac{\sum (X - \bar{X}) (Y - \bar{Y})} {\sqrt{\sum (X - \bar{X})^2 \sum (Y - \bar{Y})^2}}$

$0.0 < ∣r∣ ≤ 0.3$ : Weak linear relationship.
$0.3 < |r| ≤ 0.7$ : Moderate linear relationship.
$0.7 < |r| ≤ 1.0$ : Strong linear relationship.

# Sample data
x <- c(10, 20, 30, 40, 50)
y <- c(12, 24, 33, 48, 55)

# Calculating Pearson correlation
correlation <- cor(x, y)
print(correlation)

## [1] 0.9954038

2.3.2 Spearman’s Rank Correlation Coefficient 𝜌

Measures the strength and direction of the monotonic relationship between two ranked variables. Spearman’s correlation works with ranks of the data rather than the raw data values, making it less sensitive to outliers and applicable to ordinal data. It ranges 1 and -1, where closer to 1 or -1 means stronger the monotonic relationship, both positive or negative relationship.

Formulation of Spearman’s Rank Correlation Coefficient :

$p = 1- \frac{6\sum{d_i}^2}{n(n^2 - 1)}$

where $d_i$ is the difference between the ranks of each pair of observations.

# Sample data
x <- c(10, 20, 30, 40, 50)
y <- c(1, 2, 3, 5, 4)

# Calculating Spearman correlation
correlation <- cor(x, y, method = "spearman")
print(correlation)

## [1] 0.9