Chapter 8 Descriptive statistics
8.1 Introduction
Descriptive statistics is the branch of statistics that deals with summarizing and organizing data in a meaningful way. It provides simple summaries about the sample and the measures, enabling researchers to present the data in an understandable and interpretable manner.
Key objectives of descriptive statistics include:
- Summarization: Reduce complex datasets to simpler summaries.
- Visualization: Present data in an easily interpretable graphical format.
- Comparison: Enable a straightforward comparison between datasets or groups.
The primary tools used in descriptive statistics are measures of central tendency (mean, median, and mode), measures of variability (range, variance, and standard deviation), and data visualization techniques (histograms, box plots, scatter plots).
In this chapter, we will explore descriptive statistics using R, a powerful open-source statistical programming language.
8.2 Measures of Central Tendency
The central tendency of a dataset provides insight into the typical or average value.
- Mean: The arithmetic average of a dataset.
- Median: The middle value when data is ordered.
- Mode: The most frequently occurring value.
8.2.1 Example in R:
# Sample data
data <- c(7, 8, 10, 15, 10, 20, 25)
# Calculate mean, median, and mode
mean_value <- mean(data) # Mean
median_value <- median(data) # Median
# Mode function
get_mode <- function(x) {
unique_values <- unique(x)
unique_values[which.max(tabulate(match(x, unique_values)))]
}
mode_value <- get_mode(data) # Mode
# Print results
cat("Mean:", mean_value, "\n")
## Mean: 13.57143
## Median: 10
## Mode: 10
8.3 Measures of Variability
Variability measures describe the spread or dispersion of data.
- Range: Difference between the maximum and minimum values.
- Variance: Average squared deviation from the mean.
- Standard Deviation: Square root of variance, indicating average deviation from the mean.
8.3.1 Example in R:
# Calculate range, variance, and standard deviation
range_value <- range(data)
variance_value <- var(data)
sd_value <- sd(data)
# Print results
cat("Range: ", range_value[1], "to", range_value[2], "\n")
## Range: 7 to 25
## Variance: 45.61905
## Standard Deviation: 6.754187
8.3.2 Example in R:
# Histogram
hist(data, main = "Histogram of Data", xlab = "Values", col = "lightblue", border = "black")
# Scatter Plot
set.seed(123) # For reproducibility
data2 <- data + rnorm(length(data), 0, 2) # Simulated second variable
plot(data, data2, main = "Scatter Plot of Data", xlab = "Variable 1", ylab = "Variable 2", col = "blue", pch = 19)
Each plot provides a unique perspective:
- The histogram shows frequency distribution.
- The boxplot reveals quartiles and outliers.
- The scatter plot shows relationships between variables.