23.2 Probability Sampling
Probability sampling methods ensure that every element in the population has a known, nonzero probability of being selected. These methods are preferred in inferential statistics since they allow for the estimation of sampling error.
23.2.1 Simple Random Sampling
Simple Random Sampling (SRS) ensures that every element in the population has an equal chance of being selected. This can be done with replacement or without replacement, impacting whether an element can be chosen more than once.
Below is an example of drawing a simple random sample without replacement from a population of 100 elements:
set.seed(123)
population <- 1:100 # A population of 100 elements
sample_srs <- sample(population, size = 10, replace = FALSE)
sample_srs
#> [1] 31 79 51 14 67 42 50 43 97 25
Advantages:
Simple and easy to implement
Ensures unbiased selection
Disadvantages:
May not represent subgroups well, especially in heterogeneous populations
Requires access to a complete list of the population
23.2.1.1 Using dplyr
The sample_n()
function in dplyr
allows for simple random sampling from a dataset:
library(dplyr)
iris_df <- iris
set.seed(1)
sample_n(iris_df, 5) # Randomly selects 5 rows from the iris dataset
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.8 2.7 4.1 1.0 versicolor
#> 2 6.4 2.8 5.6 2.1 virginica
#> 3 4.4 3.2 1.3 0.2 setosa
#> 4 4.3 3.0 1.1 0.1 setosa
#> 5 7.0 3.2 4.7 1.4 versicolor
23.2.1.2 Using the sampling
Package
The sampling
package provides functions for random sampling with and without replacement.
library(sampling)
# Assign a unique ID to each row in the dataset
iris_df$id <- 1:nrow(iris_df)
# Simple random sampling without replacement
srs_sample <- srswor(10, length(iris_df$id))
# srs_sample
# Simple random sampling with replacement
srs_sample_wr <- srswr(10, length(iris_df$id))
# srs_sample_wr
23.2.1.3 Using the sampler
Package
The sampler
package provides additional functionality, such as oversampling to account for non-response.
23.2.2 Stratified Sampling
Stratified sampling involves dividing the population into distinct strata based on a characteristic (e.g., age, income level, region). A random sample is then drawn from each stratum, often in proportion to its size within the population. This method ensures that all subgroups are adequately represented, improving the precision of estimates.
The following example demonstrates stratified sampling where individuals belong to three different groups (A, B, C), and a random sample is drawn from each.
library(dplyr)
set.seed(123)
data <- data.frame(
ID = 1:100,
Group = sample(c("A", "B", "C"), 100, replace = TRUE)
)
# Stratified random sampling: selecting 10 elements per group
stratified_sample <- data %>%
group_by(Group) %>%
sample_n(size = 10)
# stratified_sample
Advantages:
Ensures representation of all subgroups
More precise estimates compared to Simple Random Sampling
Reduces sampling error by accounting for population variability
Disadvantages:
Requires prior knowledge of population strata
More complex to implement than SRS
23.2.2.1 Using dplyr
for Stratified Sampling
Sampling by Fixed Number of Rows
Here, we extract 5 random observations from each species in the iris
dataset.
library(dplyr)
set.seed(123)
sample_iris <- iris %>%
group_by(Species) %>%
sample_n(5) # Selects 5 samples per species
# sample_iris
Sampling by Fraction of Each Stratum
Instead of selecting a fixed number, we can sample 15% of each species:
23.2.2.2 Using the sampler
Package
The sampler
package allows stratified sampling with proportional allocation:
23.2.3 Systematic Sampling
Selects every kth element after a random starting point.
k <- 10 # Select every 10th element
start <- sample(1:k, 1) # Random start point
sample_systematic <- population[seq(start, length(population), by = k)]
Advantages:
Simple to implement
Ensures even coverage
Disadvantages:
- If data follows a pattern, bias may be introduced
23.2.4 Cluster Sampling
Instead of selecting individuals, entire clusters (e.g., cities, schools) are randomly chosen, and all members of selected clusters are included.
data$Cluster <- sample(1:10, 100, replace = TRUE) # Assign 10 clusters
chosen_clusters <- sample(1:10, size = 3) # Select 3 clusters
cluster_sample <- filter(data, Cluster %in% chosen_clusters)
Advantages:
Cost-effective when the population is large
Useful when the population is naturally divided into groups
Disadvantages:
Higher variability
Risk of unrepresentative clusters