Chapter 8 Descriptive statistics

8.1 Introduction

Descriptive statistics is the branch of statistics that deals with summarizing and organizing data in a meaningful way. It provides simple summaries about the sample and the measures, enabling researchers to present the data in an understandable and interpretable manner.

Key objectives of descriptive statistics include:
- Summarization: Reduce complex datasets to simpler summaries.
- Visualization: Present data in an easily interpretable graphical format.
- Comparison: Enable a straightforward comparison between datasets or groups.

The primary tools used in descriptive statistics are measures of central tendency (mean, median, and mode), measures of variability (range, variance, and standard deviation), and data visualization techniques (histograms, box plots, scatter plots).

In this chapter, we will explore descriptive statistics using R, a powerful open-source statistical programming language.

8.2 Measures of Central Tendency

The central tendency of a dataset provides insight into the typical or average value.

Mean: The arithmetic average of a dataset.
Median: The middle value when data is ordered.
Mode: The most frequently occurring value.

8.2.1 Example in R:

# Sample data
data <- c(7, 8, 10, 15, 10, 20, 25)

# Calculate mean, median, and mode
mean_value <- mean(data)       # Mean
median_value <- median(data)   # Median

# Mode function
get_mode <- function(x) {
  unique_values <- unique(x)
  unique_values[which.max(tabulate(match(x, unique_values)))]
}

mode_value <- get_mode(data)   # Mode

# Print results
cat("Mean:", mean_value, "\n")

## Mean: 13.57143

cat("Median:", median_value, "\n")

## Median: 10

cat("Mode:", mode_value, "\n")

## Mode: 10

8.3 Measures of Variability

Variability measures describe the spread or dispersion of data.

Range: Difference between the maximum and minimum values.
Variance: Average squared deviation from the mean.
Standard Deviation: Square root of variance, indicating average deviation from the mean.

8.3.1 Example in R:

# Calculate range, variance, and standard deviation
range_value <- range(data)
variance_value <- var(data)
sd_value <- sd(data)

# Print results
cat("Range: ", range_value[1], "to", range_value[2], "\n")

## Range:  7 to 25

cat("Variance:", variance_value, "\n")

## Variance: 45.61905

cat("Standard Deviation:", sd_value, "\n")

## Standard Deviation: 6.754187

8.3.1.1 Data Visualization

Data visualization provides an intuitive way to understand data distributions and patterns.

Histogram: Shows the frequency distribution.
Boxplot: Highlights data spread and potential outliers.
Scatter Plot: Displays relationships between two variables.

8.3.2 Example in R:

# Histogram
hist(data, main = "Histogram of Data", xlab = "Values", col = "lightblue", border = "black")

# Boxplot
boxplot(data, main = "Boxplot of Data", ylab = "Values", col = "lightgreen")

# Scatter Plot
set.seed(123) # For reproducibility
data2 <- data + rnorm(length(data), 0, 2) # Simulated second variable
plot(data, data2, main = "Scatter Plot of Data", xlab = "Variable 1", ylab = "Variable 2", col = "blue", pch = 19)

Each plot provides a unique perspective:
- The histogram shows frequency distribution.
- The boxplot reveals quartiles and outliers.
- The scatter plot shows relationships between variables.

8.4 Summary

This chapter provides an overview of descriptive statistics with practical examples in R. By combining numerical summaries with visualizations, we can better understand data and make informed decisions.