23 Sampling
Sampling allows us to draw conclusions about a population without analyzing every individual in it. In business applications—such as marketing research, and financial forecasting—sampling enables efficient decision-making while reducing costs and effort.
23.1 Population and Sample
This is a refresher on terminology regarding sampling.
- Population (N): The complete set of all elements under study.
- Sample (n): A subset of the population selected for analysis.
- Parameter: A numerical measure that describes a characteristic of a population (e.g., population mean μ, population variance σ2).
- Statistic: A numerical measure computed from a sample, used to estimate a population parameter (e.g., sample mean ˉx, sample variance s2).
A well-chosen sample ensures that results generalize to the population, reducing sampling bias.
23.2 Probability Sampling
Probability sampling methods ensure that every element in the population has a known, nonzero probability of being selected. These methods are preferred in inferential statistics since they allow for the estimation of sampling error.
23.2.1 Simple Random Sampling
Simple Random Sampling (SRS) ensures that every element in the population has an equal chance of being selected. This can be done with replacement or without replacement, impacting whether an element can be chosen more than once.
Below is an example of drawing a simple random sample without replacement from a population of 100 elements:
set.seed(123)
population <- 1:100 # A population of 100 elements
sample_srs <- sample(population, size = 10, replace = FALSE)
sample_srs
#> [1] 31 79 51 14 67 42 50 43 97 25
Advantages:
Simple and easy to implement
Ensures unbiased selection
Disadvantages:
May not represent subgroups well, especially in heterogeneous populations
Requires access to a complete list of the population
23.2.1.1 Using dplyr
The sample_n()
function in dplyr
allows for simple random sampling from a dataset:
library(dplyr)
iris_df <- iris
set.seed(1)
sample_n(iris_df, 5) # Randomly selects 5 rows from the iris dataset
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.8 2.7 4.1 1.0 versicolor
#> 2 6.4 2.8 5.6 2.1 virginica
#> 3 4.4 3.2 1.3 0.2 setosa
#> 4 4.3 3.0 1.1 0.1 setosa
#> 5 7.0 3.2 4.7 1.4 versicolor
23.2.1.2 Using the sampling
Package
The sampling
package provides functions for random sampling with and without replacement.
library(sampling)
# Assign a unique ID to each row in the dataset
iris_df$id <- 1:nrow(iris_df)
# Simple random sampling without replacement
srs_sample <- srswor(10, length(iris_df$id))
# srs_sample
# Simple random sampling with replacement
srs_sample_wr <- srswr(10, length(iris_df$id))
# srs_sample_wr
23.2.1.3 Using the sampler
Package
The sampler
package provides additional functionality, such as oversampling to account for non-response.
23.2.1.4 Handling Missing Data in Sample Collection
To compare a sample with received (collected) data and identify missing elements:
alsample <- rsamp(df = albania, 544) # Initial sample
alreceived <- rsamp(df = alsample, 390) # Collected data
rmissing(sampdf = alsample, colldf = alreceived, col_name = qvKod)
23.2.2 Stratified Sampling
Stratified sampling involves dividing the population into distinct strata based on a characteristic (e.g., age, income level, region). A random sample is then drawn from each stratum, often in proportion to its size within the population. This method ensures that all subgroups are adequately represented, improving the precision of estimates.
The following example demonstrates stratified sampling where individuals belong to three different groups (A, B, C), and a random sample is drawn from each.
library(dplyr)
set.seed(123)
data <- data.frame(
ID = 1:100,
Group = sample(c("A", "B", "C"), 100, replace = TRUE)
)
# Stratified random sampling: selecting 10 elements per group
stratified_sample <- data %>%
group_by(Group) %>%
sample_n(size = 10)
# stratified_sample
Advantages:
Ensures representation of all subgroups
More precise estimates compared to Simple Random Sampling
Reduces sampling error by accounting for population variability
Disadvantages:
Requires prior knowledge of population strata
More complex to implement than SRS
23.2.2.1 Using dplyr
for Stratified Sampling
Sampling by Fixed Number of Rows
Here, we extract 5 random observations from each species in the iris
dataset.
library(dplyr)
set.seed(123)
sample_iris <- iris %>%
group_by(Species) %>%
sample_n(5) # Selects 5 samples per species
# sample_iris
Sampling by Fraction of Each Stratum
Instead of selecting a fixed number, we can sample 15% of each species:
set.seed(123)
sample_iris <- iris %>%
group_by(Species) %>%
sample_frac(size = 0.15) # Selects 15% of each species
# sample_iris
23.2.2.2 Using the sampler
Package
The sampler
package allows stratified sampling with proportional allocation:
23.2.2.3 Handling Missing Data in Stratified Sampling
To identify the number of missing values by stratum between the initial sample and the collected data:
alsample <- rsamp(df = albania, 544) # Initial sample
alreceived <- rsamp(df = alsample, 390) # Collected data
smissing(
sampdf = alsample,
colldf = alreceived,
strata = qarku, # Strata column
col_name = qvKod # Column for checking missing values
)
23.2.3 Systematic Sampling
Selects every kth element after a random starting point.
k <- 10 # Select every 10th element
start <- sample(1:k, 1) # Random start point
sample_systematic <- population[seq(start, length(population), by = k)]
Advantages:
Simple to implement
Ensures even coverage
Disadvantages:
- If data follows a pattern, bias may be introduced
23.2.4 Cluster Sampling
Instead of selecting individuals, entire clusters (e.g., cities, schools) are randomly chosen, and all members of selected clusters are included.
data$Cluster <- sample(1:10, 100, replace = TRUE) # Assign 10 clusters
chosen_clusters <- sample(1:10, size = 3) # Select 3 clusters
cluster_sample <- filter(data, Cluster %in% chosen_clusters)
Advantages:
Cost-effective when the population is large
Useful when the population is naturally divided into groups
Disadvantages:
Higher variability
Risk of unrepresentative clusters
23.3 Non-Probability Sampling
These methods do not give all elements a known probability of selection. They are used in exploratory research but are not suitable for making formal statistical inferences.
23.3.1 Convenience Sampling
Selecting individuals who are easiest to reach (e.g., mall surveys).
Pros: Quick and inexpensive
Cons: High risk of bias, not generalizable
23.4 Unequal Probability Sampling
Unequal probability sampling assigns different selection probabilities to elements in the population. This approach is often used when certain units are more important, have higher variability, or require higher precision in estimation.
Common methods for unequal probability sampling include:
Probability Proportional to Size (PPS): Selection probability is proportional to a given auxiliary variable (e.g., revenue, population size).
Poisson Sampling: Independent selection of each unit with a given probability.
Systematic Sampling with Unequal Probabilities: Uses a systematic approach while ensuring different probabilities.
The following functions from the sampling
package implement various unequal probability sampling methods:
library(sampling)
# Different methods for unequal probability sampling
UPbrewer() # Brewer's method
UPmaxentropy() # Maximum entropy method
UPmidzuno() # Midzuno’s method
UPmidzunopi2() # Midzuno’s method with second-order inclusion probabilities
UPmultinomial() # Multinomial method
UPpivotal() # Pivotal method
UPrandompivotal() # Randomized pivotal method
UPpoisson() # Poisson sampling
UPsampford() # Sampford’s method
UPsystematic() # Systematic sampling
UPrandomsystematic() # Randomized systematic sampling
UPsystematicpi2() # Systematic sampling with second-order probabilities
UPtille() # Tillé’s method
UPtillepi2() # Tillé’s method with second-order inclusion probabilities
Each of these methods has specific use cases and theoretical justifications. For example:
Poisson sampling allows flexible control over sample size but may lead to variable sample sizes.
Systematic sampling is useful when population elements are arranged in a meaningful order.
Tillé’s method ensures better control over the sample’s second-order inclusion probabilities.
23.5 Balanced Sampling
Balanced sampling ensures that the means of auxiliary variables in the sample match those in the population. This method improves estimation efficiency and reduces variability without introducing bias.
Balanced sampling differs from purposive selection because it still involves randomization, ensuring statistical validity.
The balancing equation is given by: ∑k∈Sxkπk=∑k∈Uxk where:
xk is a vector of auxiliary variables (e.g., income, age, household size).
πk is the inclusion probability of unit k.
S is the sample, and U is the population.
This ensures that the total weighted sum of auxiliary variables in the sample matches the total sum in the population.
23.5.1 Cube Method for Balanced Sampling
The Cube Method is a widely used approach for balanced sampling, consisting of two phases:
Flight Phase: Ensures initial balance on auxiliary variables.
Landing Phase: Adjusts the sample to meet constraints while keeping randomness.
library(sampling)
# Cube method functions
samplecube() # Standard cube method
fastflightcube() # Optimized flight phase
landingcube() # Landing phase method
23.5.2 Balanced Sampling with Stratification
Stratification attempts to replicate the population structure in the sample by preserving the original multivariate histogram.
library(survey)
data("api")
# Stratified design with proportional allocation
srs_design <- svydesign(data = apistrat,
weights = ~pw,
fpc = ~fpc,
strata = ~stype,
id = ~1)
An additional method for balanced stratification is:
This method ensures that within each stratum, the sample retains the original proportions of auxiliary variables.
23.5.3 Balanced Sampling in Cluster Sampling
Cluster sampling involves selecting entire groups (clusters) instead of individual units. A balanced approach ensures that the sampled clusters preserve the overall distribution of auxiliary variables.
library(survey)
data("api")
# Cluster sampling design
srs_design <- svydesign(data = apiclus1,
weights = ~pw,
fpc = ~fpc,
id = ~dnum)
For explicitly balanced cluster sampling:
This method ensures that the cluster-level characteristics of the sample match those of the population.
23.5.4 Balanced Sampling in Two-Stage Sampling
Two-stage sampling first selects primary units (e.g., schools, cities) and then samples within them. A balanced approach ensures representative selection at both stages.
library(survey)
data("api")
# Two-stage sampling design
srs_design <- svydesign(data = apiclus2,
fpc = ~fpc1 + fpc2,
id = ~dnum + snum)
For explicitly balanced two-stage sampling:
This method ensures that auxiliary variables remain balanced across both selection stages, reducing variability while maintaining randomness.
23.6 Sample Size Determination
The appropriate sample size depends on the margin of error, confidence level, and population variability. A commonly used formula for estimating the required sample size for a proportion is:
n=Z2p(1−p)E2 where:
Z is the Z-score corresponding to the confidence level
p is the estimated proportion
E is the margin of error