Sampling is the process of selecting a subset of individuals from a larger population to make statistical inferences. It can be broadly categorized into Probability Sampling and Non-Probability Sampling.
3.1 Probability Sampling
Probability sampling ensures that every individual in the population has a known, nonzero chance of being selected. This allows for generalizable and unbiased results.
3.1.1 Simple Random Sampling
Simple Random Sampling is a method where each individual in the population has an equal chance of being chosen. This technique ensures that the sample is random and unbiased by using random selection methods.
Characteristics:
Equal Probability → Every individual has the same chance of being selected.
No Specific Pattern → The selection process is entirely random.
Objective Representation → The method avoids bias and ensures a fair representation of the population.
There are two primary ways to perform random sampling:
Using a Random Number Generator
Suppose we have 1000 students in a university, and we need a random sample of 100 students. The steps are:
Assign numbers from 1 to 1000 to each student.
Use a random number generator to select 100 unique numbers.
The students corresponding to those numbers will be included in the sample.
Python Code:
import pandas as pd# Create a dataset (example: student data)students = pd.DataFrame({'ID': range(1, 1001), 'Name': ['Student '+str(i) for i inrange(1, 1001)]})# Set seed for reproducibilityrandom_state =123# Randomly select 100 students from the datasetsample_students = students.sample(n=100, random_state=random_state)# Print the selected sampleprint(sample_students)
# Load dataset (contoh: data mahasiswa)students <-data.frame(ID =1:1000, Name =paste("Student", 1:1000))# Set seed for reproducibilityset.seed(123)# Randomly select 100 students from the datasetsample_students <- students[sample(nrow(students), 100, replace =FALSE), ]# Print the selected sampleprint(head(sample_students))
The Lottery Method is one of the Simple Random Sampling techniques where each individual in the population has an equal chance of being selected. This method is called “Lottery” because it resembles a lottery system, such as a raffle or prize draw, where names or numbers are placed in a container, shuffled, and randomly drawn.
Python Code Lottery Method:
import random# List of students (example names)students = ["Syifa","Nabila", "Alya", "Isnaini", "Bagas","Alfayed","Shalfa","Olivia","Nabila", "Fika","Luthfi","Nabil", "Joans", "Riyadh", "Rachelia","Nova", "Zain", "Ragil", "Dadan", "Dwi", "Chello", "Siti"]# Number of samples to drawnum_samples =5# Shuffle the list (simulating shuffling the papers)random.shuffle(students)# Randomly draw the required number of samplesselected_students = random.sample(students, num_samples)# Print the selected namesprint("Selected students:", selected_students)
# List of students (example names)students <-c("Syifa","Nabila", "Alya", "Isnaini", "Bagas","Alfayed","Shalfa","Olivia","Nabila", "Fika","Luthfi","Nabil", "Joans", "Riyadh", "Rachelia","Nova", "Zain", "Ragil", "Dadan", "Dwi", "Chello", "Siti")# Number of samples to drawnum_samples <-5# Shuffle the list (simulating shuffling the papers)students <-sample(students) # Randomly draw the required number of samplesselected_students <-sample(students, num_samples)# Print the selected namesprint(selected_students)
[1] "Nabila" "Luthfi" "Syifa" "Alya" "Alfayed"
Here is the advantages and disadvantages in Simple Random Sampling:
Advantages
Disadvantages
Minimizes Bias → Every individual has an equal chance, making the process fair.
Requires a Complete Population List → A full database of individuals is needed.
Simple to Implement → Especially with software tools.
Inefficient for Large Populations → If done manually, it can be time-consuming.
Applicable to Large Populations → Works well with technology-assisted selection.
Might Not Ensure Proportional Representation → Some subgroups may be underrepresented by pure randomness.
Simple Random Sampling is a fair, easy, and objective method for selecting representative samples in research, surveys, and experiments. It is highly effective when a complete list of the population is available.
3.1.2 Systematic Sampling
Systematic Sampling is a probabilistic sampling technique where elements are selected from a population at fixed intervals after choosing a random starting point. Instead of selecting samples purely at random, this method follows a structured approach, making it more efficient and easier to implement than Simple Random Sampling.
Python Code: Systematic Sampling
import numpy as npimport pandas as pd# Create a sample population datasetdata = pd.DataFrame({'Student_ID': np.arange(1, 101), 'Name': ['Student_'+str(i) for i inrange(1, 101)]})# Define sample size and intervalN =len(data) # Population sizen =10# Desired sample sizek = N // n # Sampling interval# Randomly choose a starting pointnp.random.seed(42)start = np.random.randint(0, k)# Select every k-th elementsystematic_sample = data.iloc[start::k]# Display resultsprint("Selected Sample:")
If the population follows a specific pattern or cycle, Systematic Sampling may introduce bias. For example, if an employee work schedule is arranged in a repeating morning-afternoon-night shift pattern and we select every 3rd employee, we might only sample morning-shift workers, leading to biased results. To mitigate this risk, researchers should check for patterns in the population before applying Systematic Sampling.
3.1.3 Stratified Sampling
Stratified Sampling is a probability sampling technique in which the population is divided into subgroups (strata) based on shared characteristics. A sample is then drawn proportionally from each stratum to ensure that all groups are adequately represented.
This method is particularly useful when the population is heterogeneous and contains distinct categories, such as gender, age groups, income levels, or education levels.
Example Scenario:
Imagine a university with 10,000 students divided into three faculties:
Faculty
Population
Proportion (%)
Sample Size (out of 500)
Science
5,000
50%
250
Arts
3,000
30%
150
Business
2,000
20%
100
The total sample size is 500 students, and the number of students from each faculty is selected proportionally to its representation in the population.
Faculty
Science 8
Arts 7
Business 7
Name: count, dtype: int64
# Stratified Sampling (30% from each group)stratified_sample, _ = train_test_split(df, test_size=0.7, stratify=df['Faculty'], random_state=42)# Display sample group sizesprint("\nSampled data distribution (should be ~30% of each group):")
Sampled data distribution (should be ~30% of each group):
Faculty
Science 2
Business 2
Arts 2
Name: count, dtype: int64
# Show sampled dataprint("\nStratified Sample:")
Stratified Sample:
print(stratified_sample)
Name Faculty
16 Zain Science
18 Dadan Business
5 Alfayed Arts
14 Rachelia Science
17 Ragil Arts
3 Isnaini Business
R Code: Stratified Sampling
library(dplyr)library(purrr)# Dataset with given namesdata <-data.frame(Name =c("Syifa", "Nabila", "Alya", "Isnaini", "Rizky", "Alfayed", "Whirdyana", "Olivia", "Nabila A", "Fika","Luthfi", "Nabil", "Joans", "Riyadh", "Rachelia","Nova", "Zain", "Ragil", "Dadan", "Dwi", "Chello", "Siti"),Faculty =c("Science", "Arts", "Science", "Business", "Science", "Arts", "Business", "Arts", "Science", "Business","Science", "Arts", "Business", "Arts", "Science", "Business", "Science", "Arts", "Business", "Arts","Science", "Business"))# Show original data distributioncat("Original data distribution:\n")
Original data distribution:
print(table(data$Faculty))
Arts Business Science
7 7 8
# Determine sample size per strata (30% per group, rounded down)sample_sizes <- data %>%count(Faculty) %>%mutate(sample_size =floor(n *0.3))# Perform stratified sampling with exact count per groupset.seed(42)stratified_sample <- sample_sizes %>%split(.$Faculty) %>%map2(.x = ., .y = sample_sizes$sample_size, ~ data %>%filter(Faculty == .x$Faculty) %>%slice_sample(n = .y)) %>%bind_rows() %>%select(Name, Faculty)# Show sampled data distributioncat("\nSampled data distribution (should be exactly 30% of each group):\n")
Sampled data distribution (should be exactly 30% of each group):
print(table(stratified_sample$Faculty))
Arts Business Science
2 2 2
# Display the sampled dataprint(stratified_sample)
Name Faculty
1 Nabila Arts
2 Riyadh Arts
3 Isnaini Business
4 Siti Business
5 Alya Science
6 Nabila A Science
3.1.4 Cluster Sampling
Cluster Sampling is a probabilistic sampling technique where instead of selecting individuals randomly, we select entire groups (clusters). Once a cluster is selected, all individuals within that cluster are included in the sample.
This method is widely used for large populations where individual random selection is costly or impractical. It is especially useful in geographically spread-out populations or organizational structures like schools, hospitals, or companies.
# Randomly select clusters (e.g., choose 1 out of 3 schools)set.seed(42)selected_clusters <-sample(unique(data$School), size =1)# Select all individuals from the chosen clusterscluster_sample <- data %>%filter(School %in% selected_clusters)# Display resultscat("\nSelected Cluster(s):", selected_clusters, "\n")
Selected Cluster(s): School A
cat("\nCluster Sample:\n")
Cluster Sample:
print(cluster_sample)
Name School
1 Syifa School A
2 Nabila School A
3 Alya School A
4 Isnaini School A
5 Rizky School A
3.2 Non-Probability Sampling
Non-probability sampling does not provide every individual with a known chance of selection, making it prone to bias but useful in exploratory research.
3.2.1 Convenience Sampling
Convenience Sampling is a non-probability sampling method where subjects are selected based on ease of access, availability, and proximity rather than randomness. It is commonly used in exploratory research, pilot studies, or situations where time and resources are limited.
Instead of carefully choosing a representative sample, researchers select participants who are easiest to reach—such as nearby students, colleagues, or online survey respondents.
Python Code: Convenience Sampling
import pandas as pd# Example dataset of students data = pd.DataFrame({'Student_ID': range(1, 21),'Name': ['Student_'+str(i) for i inrange(1, 21)],'Location': ['Campus'] *10+ ['Online'] *10}) # 10 from campus, 10 online# Selecting the first 5 students available (e.g., from campus)convenience_sample = data.head(5)# Display selected sampleprint(convenience_sample)
Quota Sampling is a non-probability sampling method where researchers divide the population into subgroups (quotas) based on specific characteristics (e.g., age, gender, occupation) and select participants non-randomly to meet a predefined quota for each subgroup.
Unlike stratified random sampling, where individuals are randomly selected within each subgroup, quota sampling allows researchers to handpick individuals within quotas based on convenience or judgment, which introduces potential bias.
Python Code: Quota Sampling
import pandas as pd# Creating a dataset with 100 individuals (50 males, 50 females)data = pd.DataFrame({'ID': range(1, 101),'Name': ['Person_'+str(i) for i inrange(1, 101)],'Gender': ['Male'] *50+ ['Female'] *50,})# Defining quotas: 5 males and 5 femalesquota_male = data[data['Gender'] =='Male'].head(5)quota_female = data[data['Gender'] =='Female'].head(5)# Combining quota-based samplequota_sample = pd.concat([quota_male, quota_female])# Displaying the selected sampleprint(quota_sample)
ID Name Gender
0 1 Person_1 Male
1 2 Person_2 Male
2 3 Person_3 Male
3 4 Person_4 Male
4 5 Person_5 Male
50 51 Person_51 Female
51 52 Person_52 Female
52 53 Person_53 Female
53 54 Person_54 Female
54 55 Person_55 Female
Snowball Sampling is a non-probability sampling method used to study hard-to-reach or hidden populations (e.g., drug users, undocumented immigrants, people with rare diseases).
Instead of selecting participants randomly, researchers start with a small group of known individuals (seeds), who then recruit others from their social networks, creating a “snowball” effect.
Python Code: Snowball Sampling
import pandas as pdimport random# Creating a dataset of 100 individualsdata = pd.DataFrame({'ID': range(1, 101),'Name': ['Person_'+str(i) for i inrange(1, 101)],'Group': ['Hidden Population'] *100})# Start with 2 "seed" participants, 42 is just a common convention.initial_sample = data.sample(n=2, random_state=42) # Any number can be used # Snowball effect: Each seed recruits 2 more participantssnowball_sample = initial_sample.copy()for _ inrange(3): # Repeat recruitment process new_recruits = data.sample(n=len(snowball_sample) *2, random_state=random.randint(1, 100)) snowball_sample = pd.concat([snowball_sample, new_recruits]).drop_duplicates()# Displaying the selected sampleprint(snowball_sample.head())
ID Name Group
83 84 Person_84 Hidden Population
53 54 Person_54 Hidden Population
1 2 Person_2 Hidden Population
55 56 Person_56 Hidden Population
9 10 Person_10 Hidden Population
R Code: Snowball Sampling
# Load necessary librarylibrary(dplyr)# Creating a datasetdata <-data.frame(ID =1:100,Name =paste("Person", 1:100, sep ="_"),Group =rep("Hidden Population", 100))# Start with 2 "seed" participantsset.seed(42)initial_sample <-sample_n(data, 2)# Snowball effect: Each seed recruits 2 more participantssnowball_sample <- initial_samplefor (i in1:3) { # Repeat recruitment process new_recruits <-sample_n(data, nrow(snowball_sample) *2) snowball_sample <-distinct(bind_rows(snowball_sample, new_recruits))}# Displaying the selected sampleprint(head(snowball_sample))
ID Name Group
1 49 Person_49 Hidden Population
2 65 Person_65 Hidden Population
3 25 Person_25 Hidden Population
4 74 Person_74 Hidden Population
5 18 Person_18 Hidden Population
6 47 Person_47 Hidden Population
3.3 Hybrid Sampling
In some research scenarios, Probability Sampling (random selection) and Non-Probability Sampling (subjective selection) can be combined to balance representation and practicality, it is called as a Hybrid Sampling.
3.3.1 Python Code: Hybrid Sampling
Here, we randomly select faculties (Probability Sampling), then select experienced members within each faculty (Judgement Sampling).