23.2 Probability Sampling

Probability sampling methods ensure that every element in the population has a known, nonzero probability of being selected. These methods are preferred in inferential statistics since they allow for the estimation of sampling error.

23.2.1 Simple Random Sampling

Simple Random Sampling (SRS) ensures that every element in the population has an equal chance of being selected. This can be done with replacement or without replacement, impacting whether an element can be chosen more than once.

Below is an example of drawing a simple random sample without replacement from a population of 100 elements:

set.seed(123)
population <- 1:100  # A population of 100 elements
sample_srs <- sample(population, size = 10, replace = FALSE)
sample_srs
#>  [1] 31 79 51 14 67 42 50 43 97 25

Advantages:

Simple and easy to implement
Ensures unbiased selection

Disadvantages:

May not represent subgroups well, especially in heterogeneous populations
Requires access to a complete list of the population

23.2.1.1 Using `dplyr`

The sample_n() function in dplyr allows for simple random sampling from a dataset:

library(dplyr)
iris_df <- iris
set.seed(1)
sample_n(iris_df, 5)  # Randomly selects 5 rows from the iris dataset
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 1          5.8         2.7          4.1         1.0 versicolor
#> 2          6.4         2.8          5.6         2.1  virginica
#> 3          4.4         3.2          1.3         0.2     setosa
#> 4          4.3         3.0          1.1         0.1     setosa
#> 5          7.0         3.2          4.7         1.4 versicolor

23.2.1.2 Using the `sampling` Package

The sampling package provides functions for random sampling with and without replacement.

library(sampling)
# Assign a unique ID to each row in the dataset
iris_df$id <- 1:nrow(iris_df)

# Simple random sampling without replacement
srs_sample <- srswor(10, length(iris_df$id))  
# srs_sample

# Simple random sampling with replacement
srs_sample_wr <- srswr(10, length(iris_df$id))
# srs_sample_wr

23.2.1.3 Using the `sampler` Package

The sampler package provides additional functionality, such as oversampling to account for non-response.

library(sampler)
rsamp(albania, n = 260, over = 0.1, rep = FALSE)

23.2.1.4 Handling Missing Data in Sample Collection

To compare a sample with received (collected) data and identify missing elements:

alsample <- rsamp(df = albania, 544)  # Initial sample
alreceived <- rsamp(df = alsample, 390)  # Collected data
rmissing(sampdf = alsample, colldf = alreceived, col_name = qvKod)

23.2.2 Stratified Sampling

Stratified sampling involves dividing the population into distinct strata based on a characteristic (e.g., age, income level, region). A random sample is then drawn from each stratum, often in proportion to its size within the population. This method ensures that all subgroups are adequately represented, improving the precision of estimates.

The following example demonstrates stratified sampling where individuals belong to three different groups (A, B, C), and a random sample is drawn from each.

library(dplyr)

set.seed(123)
data <- data.frame(
  ID = 1:100,
  Group = sample(c("A", "B", "C"), 100, replace = TRUE)
)

# Stratified random sampling: selecting 10 elements per group
stratified_sample <- data %>%
  group_by(Group) %>%
  sample_n(size = 10)

# stratified_sample

Advantages:

Ensures representation of all subgroups
More precise estimates compared to Simple Random Sampling
Reduces sampling error by accounting for population variability

Disadvantages:

Requires prior knowledge of population strata
More complex to implement than SRS

23.2.2.1 Using `dplyr` for Stratified Sampling

Sampling by Fixed Number of Rows

Here, we extract 5 random observations from each species in the iris dataset.

library(dplyr)

set.seed(123)
sample_iris <- iris %>%
  group_by(Species) %>%
  sample_n(5)  # Selects 5 samples per species

# sample_iris

Sampling by Fraction of Each Stratum

Instead of selecting a fixed number, we can sample 15% of each species:

set.seed(123)
sample_iris <- iris %>%
  group_by(Species) %>%
  sample_frac(size = 0.15)  # Selects 15% of each species

# sample_iris

23.2.2.2 Using the `sampler` Package

The sampler package allows stratified sampling with proportional allocation:

library(sampler)

# Stratified sample using proportional allocation without replacement
ssamp(df = albania, n = 360, strata = qarku, over = 0.1)

23.2.2.3 Handling Missing Data in Stratified Sampling

To identify the number of missing values by stratum between the initial sample and the collected data:

alsample <- rsamp(df = albania, 544)  # Initial sample
alreceived <- rsamp(df = alsample, 390)  # Collected data

smissing(
  sampdf = alsample,
  colldf = alreceived,
  strata = qarku,   # Strata column
  col_name = qvKod  # Column for checking missing values
)

23.2.3 Systematic Sampling

Selects every $k$ th element after a random starting point.

k <- 10  # Select every 10th element
start <- sample(1:k, 1)  # Random start point
sample_systematic <- population[seq(start, length(population), by = k)]

Advantages:

Simple to implement
Ensures even coverage

Disadvantages:

If data follows a pattern, bias may be introduced

23.2.4 Cluster Sampling

Instead of selecting individuals, entire clusters (e.g., cities, schools) are randomly chosen, and all members of selected clusters are included.

data$Cluster <- sample(1:10, 100, replace = TRUE)  # Assign 10 clusters
chosen_clusters <- sample(1:10, size = 3)  # Select 3 clusters
cluster_sample <- filter(data, Cluster %in% chosen_clusters)

Advantages:

Cost-effective when the population is large
Useful when the population is naturally divided into groups

Disadvantages:

Higher variability
Risk of unrepresentative clusters