4 Margin of Error
Margin of error (MoE) is a statistical concept that quantifies the uncertainty in survey results or sample-based estimates. It provides a range within which the true population parameter is likely to fall.
4.1 Why is MoE Important?
When conducting surveys or experiments, we rarely measure an entire population. Instead, we take a sample and use it to infer information about the whole group. However, because a sample is only a subset of the population, there is always some level of error. The margin of error helps account for this uncertainty.
4.2 Importance of Sample Size
In statistics, sample size plays a crucial role in determining the reliability and stability of an estimate. A larger sample size leads to more precise and consistent estimates of the population parameters. Here’s why:
- Reduces Variability: Smaller samples tend to exhibit greater fluctuations in their estimates, whereas larger samples provide more stable and reliable results.
- Improves Accuracy: A small sample may fail to accurately represent the overall population. However, as the sample size increases, the sample mean converges toward the true population mean (Law of Large Numbers).
- Minimizes Sampling Error: Larger samples result in smaller sampling errors, making conclusions more generalizable to the broader population.
- Enhances Statistical Power: A greater sample size increases the statistical power of tests, improving the ability to detect true differences or effects.
To gain a clearer understanding of this concept, let’s visualize how different sample sizes impact the distribution of sample means:
4.2.1 Python code
import numpy as np
import pandas as pd
import plotly.express as px
# Set random seed for reproducibility
123)
np.random.seed(
# Generate a normally distributed population
= np.random.normal(loc=50, scale=15, size=10000) # Mean=50, SD=15
population
# Function to take samples and compute mean
def sample_means(sample_size, n_samples=1000):
= [np.mean(np.random.choice(population,
means =True)) for _ in range(n_samples)]
sample_size, replacereturn pd.DataFrame({'SampleSize': f"n = {sample_size}", 'Mean': means})
# Create sample distributions for different sizes
= sample_means(20)
samples_20 = sample_means(50)
samples_50 = sample_means(100)
samples_100 = sample_means(500)
samples_500
# Combine all into one dataset
= pd.concat([samples_20, samples_50, samples_100, samples_500])
sample_data
# Define high-contrast colors for readability
= {
custom_colors "n = 20": "#D72638", # Bright Red
"n = 50": "#F49D37", # Deep Orange
"n = 100": "#3F88C5", # Strong Blue
"n = 500": "#2E933C" # Bold Green
}
# Create an interactive violin plot (without jitter)
= px.violin(sample_data, x="SampleSize", y="Mean", color="SampleSize",
fig =True, hover_data=["SampleSize"],
box=custom_colors)
color_discrete_map
# Update layout for readability
fig.update_layout(="Effect of Sample Size on Stability of Estimates",
title="Sample Size",
xaxis_title="Sample Mean",
yaxis_title="plotly_white",
template=dict(size=16),
font="Sample Size"
legend_title_text )
# Display the plot
# fig.show()
4.2.2 R code
# Load required libraries
library(ggplot2)
library(dplyr)
library(plotly)
set.seed(123) # For reproducibility
# Generate a normally distributed population
<- rnorm(10000, mean = 50, sd = 15)
population
# Function to take samples and compute mean
<- function(sample_size, n_samples = 1000) {
sample_means <- replicate(n_samples, mean(sample(population,
means replace = TRUE)))
sample_size, return(data.frame(SampleSize = paste0("n = ", sample_size), Mean = means))
}
# Create sample distributions for different sizes
<- sample_means(20)
samples_20 <- sample_means(50)
samples_50 <- sample_means(100)
samples_100 <- sample_means(500)
samples_500
# Combine all into one dataset
<- bind_rows(samples_20, samples_50, samples_100, samples_500)
sample_data
# Convert SampleSize to a factor for better visualization
$SampleSize <- factor(sample_data$SampleSize,
sample_datalevels = c("n = 20",
"n = 50",
"n = 100",
"n = 500"))
# Define high-contrast colors for readability
<- c("n = 20" = "#D72638", # Bright Red
custom_colors "n = 50" = "#F49D37", # Deep Orange
"n = 100" = "#3F88C5", # Strong Blue
"n = 500" = "#2E933C") # Bold Green
# Create violin plot with matching outline, avoiding duplicate legends
<- ggplot(sample_data, aes(x = SampleSize,
p y = Mean,
fill = SampleSize)) +
geom_violin(alpha = 0.6, color = NA) + # Remove outline from legend
geom_boxplot(width = 0.1, fill = "white",
outlier.shape = NA, aes(color = SampleSize), size = 0.6) +
scale_fill_manual(values = custom_colors) + # Custom fill colors
scale_color_manual(values = custom_colors, guide = "none") +
labs(
title = "Effect of Sample Size on Stability of Estimates",
x = "Sample Size", y = "Sample Mean",
fill = "Sample Size"
+
) theme_minimal(base_size = 14) +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.title = element_text(size = 12),
legend.position = "top"
)
# Convert ggplot to interactive Plotly plot
ggplotly(p)
4.3 Factors Affecting Sample Size
Determining the appropriate sample size is crucial for obtaining accurate and meaningful results in statistical analysis. Several key factors influence the required sample size:
- Population Variability (Standard Deviation, SD)
The more diverse a population, the larger the sample needed to capture its full characteristics. If a population has high variability, a small sample may not be representative, leading to unreliable estimates.
- Confidence Level
The confidence level (e.g., 90%, 95%, or 99%) indicates how certain we want to be that the sample accurately represents the population. A higher confidence level requires a larger sample size to reduce uncertainty.
- Margin of Error (MoE)
The margin of error measures how much the sample estimate is expected to vary from the true population value. A smaller margin of error (e.g., ±1% instead of ±5%) requires a larger sample to ensure higher precision.
- Population Size
For small populations, a higher percentage needs to be sampled to achieve reliable results. However, beyond a certain point, increasing sample size provides diminishing returns in accuracy.
- Study Complexity and Statistical Power
More complex studies (e.g., subgroup analysis, machine learning models) require larger sample sizes to ensure meaningful results. In hypothesis testing, a larger sample size increases statistical power, making it easier to detect true effects.
While a larger sample size improves accuracy, it also requires more time, cost, and resources. The goal is to find the optimal balance that ensures reliable results without unnecessary effort.
4.3.1 Python Code
import numpy as np
import pandas as pd
import plotly.express as px
# Sample Size Calculation Function
def calculate_sample_size(std_dev, confidence_z, margin_of_error):
return ((confidence_z * std_dev) / margin_of_error) ** 2
# Define Parameters
= np.arange(5, 35, 5) # Population standard deviations
sd_values = [1.645, 1.96, 2.576] # 90%, 95%, 99% Z-scores
confidence_levels = np.arange(1, 11, 1) # Different margin of errors
margin_errors
# Generate Data for Population Variability Impact
= pd.DataFrame([(sd, moe, calculate_sample_size(sd, 1.96, moe))
var_data for sd in sd_values for moe in margin_errors],
=["SD", "MarginError", "SampleSize"])
columns"MarginError"] = var_data["MarginError"].astype(str)
var_data[
# Generate Data for Confidence Level Impact
= pd.DataFrame([(z, moe, calculate_sample_size(15, z, moe))
conf_data for z in confidence_levels for moe in margin_errors],
=["ConfidenceZ", "MarginError", "SampleSize"])
columns"ConfidenceZ"] = conf_data["ConfidenceZ"].map({1.645: "90%",
conf_data[1.96: "95%", 2.576: "99%"})
"MarginError"] = conf_data["MarginError"].astype(str)
conf_data[
# Generate Data for Margin of Error Impact
= pd.DataFrame({"MarginError": margin_errors,
me_data "SampleSize": [calculate_sample_size(15, 1.96,
for moe in margin_errors]})
moe)
# Plot 1: Effect of Population Variability
= px.line(var_data, x="SD", y="SampleSize", color="MarginError",
fig1 ="Effect of Population Variability on Sample Size",
title={"SD": "Population Standard Deviation (SD)",
labels"SampleSize": "Required Sample Size"},
=px.colors.qualitative.Set1)
color_discrete_sequence
# Plot 2: Effect of Confidence Level
= px.line(conf_data, x="ConfidenceZ", y="SampleSize", color="MarginError",
fig2 ="Effect of Confidence Level on Sample Size",
title={"ConfidenceZ": "Confidence Level",
labels"SampleSize": "Required Sample Size"},
=px.colors.qualitative.Set2,
color_discrete_sequence=True)
markers
# Plot 3: Effect of Margin of Error
= px.line(me_data, x="MarginError", y="SampleSize",
fig3 ="Effect of Margin of Error on Sample Size",
title={"MarginError": "Margin of Error",
labels"SampleSize": "Required Sample Size"},
=True, line_shape="spline",
markers=["blue"]) # Single color for clarity
color_discrete_sequence
# Show Plots
fig1.show()
fig2.show()
fig3.show()
4.3.2 R Code
# Load required libraries
library(plotly)
library(dplyr)
# Sample Size Calculation Function
<- function(sd, confidence_z, margin_error) {
calculate_sample_size return(((confidence_z * sd) / margin_error)^2)
}
# Define Parameters
<- seq(5, 30, by = 5) # Population standard deviations
sd_values <- c(1.645, 1.96, 2.576) # 90%, 95%, 99% Z-scores
confidence_levels <- seq(1, 10, by = 1) # Different margin of errors
margin_errors
# Generate Data for Variability Impact
<- expand.grid(SD = sd_values, MarginError = margin_errors)
var_data $SampleSize <- mapply(calculate_sample_size, var_data$SD, 1.96, var_data$MarginError)
var_data
# Generate Data for Confidence Level Impact
<- expand.grid(ConfidenceZ = confidence_levels, MarginError = margin_errors)
conf_data $SampleSize <- mapply(calculate_sample_size, 15, conf_data$ConfidenceZ, conf_data$MarginError)
conf_data$ConfidenceZ <- factor(conf_data$ConfidenceZ, levels = confidence_levels, labels = c("90%", "95%", "99%"))
conf_data
# Generate Data for Margin of Error Impact
<- data.frame(
me_data MarginError = margin_errors,
SampleSize = calculate_sample_size(15, 1.96, margin_errors)
)
# Create Plotly Plots
# Plot 1: Effect of Population Variability
<- plot_ly(var_data, x = ~SD, y = ~SampleSize, color = ~as.factor(MarginError),
p1 type = 'scatter', mode = 'lines', line = list(width = 2)) %>%
layout(title = "Effect of Population Variability on Sample Size",
xaxis = list(title = "Population Standard Deviation (SD)"),
yaxis = list(title = "Required Sample Size"),
legend = list(title = list(text = "Margin of Error")))
# Plot 2: Effect of Confidence Level
<- plot_ly(conf_data, x = ~ConfidenceZ, y = ~SampleSize, color = ~as.factor(MarginError),
p2 type = 'scatter', mode = 'lines+markers', line = list(width = 2)) %>%
layout(title = "Effect of Confidence Level on Sample Size",
xaxis = list(title = "Confidence Level"),
yaxis = list(title = "Required Sample Size"),
legend = list(title = list(text = "Margin of Error")))
# Plot 3: Effect of Margin of Error
<- plot_ly(me_data, x = ~MarginError, y = ~SampleSize,
p3 type = 'scatter', mode = 'lines+markers', line = list(color = 'blue', width = 2)) %>%
layout(title = "Effect of Margin of Error on Sample Size",
xaxis = list(title = "Margin of Error"),
yaxis = list(title = "Required Sample Size"))
# Display Interactive Plots
p1
p2
p3
4.4 Probability Sample Size
Probability sampling ensures that every member of the population has a known, non-zero chance of being selected. Here are examples of different probability sampling methods and how to calculate the required sample size.
4.4.1 Simple Random Sampling (SRS)
Every individual in the population has an equal chance of being selected.
Example: Estimating Average Test Scores A school wants to estimate the average math test score of students.
- Population size (N) = 2,000 students
- Standard deviation (σ) = 15
- Desired confidence level = 95% (Z = 1.96)
- Margin of error (E) = 2 points
Conclusion: The school should randomly select 217 students for an accurate estimate.
4.4.2 Stratified Random Sampling
The population is divided into subgroups (strata) based on characteristics (e.g., gender, grade level), and a random sample is taken from each.
Example: Employee Satisfaction Survey A company wants to assess employee satisfaction across three departments.
- Departments: HR (100 employees), IT (200 employees), Sales (300 employees)
- Total Population (N) = 600
- Required Sample Size (n) = 200
To ensure proportional representation:
Conclusion: The company should survey 33 HR, 67 IT, and 100 Sales employees.
4.4.3 Systematic Sampling
Every k-th individual is selected from an ordered list.
Example: Checking Product Quality
A factory produces 10,000 smartphones daily. The quality team inspects 500 of them.
To select samples systematically:
A random starting point is chosen (e.g., 7), then every 20th phone is selected:
7, 27, 47, 67, 87…
Conclusion: A total of 500 phones will be inspected systematically.
4.4.4 Cluster Sampling
The population is divided into clusters, and entire clusters are randomly selected.
Example: Measuring Household Electricity Consumption
A city has 10 districts, each with 5,000 households. Instead of surveying individuals across all districts, researchers randomly select 3 districts and survey all households within them.
Conclusion: If each district has 5,000 households, then 3 districts × 5,000 = 15,000 households will be surveyed.
4.5 Non-Probability Sample Size
Non-probability sampling is used when random selection is not feasible, often due to time, cost, or accessibility constraints. In these methods, sample size determination is more subjective, as it depends on practical considerations rather than statistical formulas. Below are common non-probability sampling methods and how sample sizes are determined.
4.5.1 Convenience Sampling
Selection is based on availability and willingness of participants.
Example: Coffee Shop Customer Survey
A coffee shop wants to understand customer preferences. The manager surveys the first 100 customers who visit in the morning.
Sample Size Consideration:
- No fixed formula, depends on time constraints and available respondents.
- The manager stops at 100 responses, assuming this is sufficient for insights.
4.5.2 Purposive (Judgmental) Sampling
Participants are handpicked based on specific criteria.
Example: Expert Opinion on Climate Change
A researcher wants insights from climate scientists. They select 50 experts based on qualifications and experience.
Sample Size Consideration:
- Depends on the research objective.
- Common practice: 10-50 experts for specialized studies.
4.5.3 Quota Sampling
The researcher sets quotas for different subgroups.
Example: Market Research for a New Product
A company surveys 500 people, ensuring:
- 250 males, 250 females
- 100 young adults (18-25), 200 middle-aged (26-45), 200 older adults (46+)
Sample Size Consideration:
- Based on market representation and business needs.
- Typically uses proportional allocation.
4.5.4 Snowball Sampling
Used for hard-to-reach populations, where participants refer others.
Example: Studying Homeless Individuals
A researcher interviews 10 homeless individuals, who refer others, growing the sample to 100 participants.
Sample Size Consideration:
- Sample grows organically until data saturation (new responses add little value).
- Often 50-200 participants.
4.6 Real-World Examples
In this study, you will compare Probability Sampling and Non-Probability Sampling in handling Margin of Error (MoE) when estimating university students’ monthly food expenses. Please, apply all methods from both Probability Sampling and Non-Probability Sampling to gain a comprehensive understanding of their differences and effectiveness.
4.6.1 Selecting Sampling Methods
A. Probability Sampling
- Simple Random Sampling (SRS)
- Randomly select students from the entire population using a random number generator or a random number table.
- Stratified Sampling
- Divide the population into groups (strata), such as faculty or academic year.
- Randomly select students from each stratum proportionally.
- Divide the population into groups (strata), such as faculty or academic year.
- Systematic Sampling
- Select every k-th student from a sorted list of students.
- Example: If there are 10,000 students and a sample of 200 is needed, select every 50th student (10,000/200 = 50).
- Select every k-th student from a sorted list of students.
- Cluster Sampling
- Randomly select some classes or student groups and survey all students in those groups.
- Example: Choose 5 random classes and interview all students in those classes.
- Randomly select some classes or student groups and survey all students in those groups.
- Multi-Stage Sampling
- Combine multiple techniques, such as:
- Stage 1: Randomly select faculties
- Stage 2: Randomly select classes within faculties
- Stage 3: Randomly select students within those classes
- Stage 1: Randomly select faculties
- Combine multiple techniques, such as:
B. Non-Probability Sampling
- Convenience Sampling
- Interview students who are easily accessible, such as in the cafeteria or library.
- Quota Sampling
- Ensure a fixed number of students are surveyed in each category (e.g., 50 students per faculty) without random selection.
- Judgmental (Purposive) Sampling
- Select students who are believed to be representative, such as dormitory residents who may have more consistent food expenses.
- Snowball Sampling
- Start with a few students and ask them to recommend other students for the survey.
4.6.2 Data Collection
- Apply each sampling method to select student samples.
- Record their monthly food expenses.
- Store data in a spreadsheet or statistical software for analysis.
4.6.3 Calculate Margin of Error
For Probability Sampling, calculate the Margin of Error (MoE) using the formula:
Where:
= 1.96 (for 95% confidence level)
= Sample standard deviation (calculated from data)
= Sample size
Perform MoE calculations for each Probability Sampling method and compare the results.
4.6.4 Bias Analysis
- Explain the sources of bias in each Non-Probability Sampling method.
- Discuss how this bias affects survey results and the difference compared to Probability Sampling.
4.6.5 Determine Sample Size
Use the following formula to determine required Sample Size for MoE =
Where:
-
-
-
Calculate the minimum required sample size and interpret the results.
4.6.6 Create a Study Report
The final report should include:
- Introduction – Purpose of the study and importance of MoE in sampling
- Sampling Methods Used – Explanation of all methods applied
- MoE Calculations for Probability Sampling
- Bias Analysis in Non-Probability Sampling
- Comparison of All Methods
- Required Sample Size for MoE =
- Conclusion and Recommendations
4.6.7 Additional Instructions
- Use Excel, R, or Python to calculate MoE and generate comparison charts.
- If using Python, use numpy, pandas, and scipy.stats for analysis.
- If using R, use qnorm(), sqrt(), and mean() functions.
- Ensure all calculations and results are clearly interpreted.
- Submit your answer on RPubs or Google Colab.