5 Data Cleaning
Data cleaning is the process of identifying and correcting or removing inaccurate, incomplete, irrelevant, or erroneous data in a dataset. It is a crucial step in data analysis and machine learning to ensure accurate and reliable results.
5.1 Data Cleaning Process
In the data analysis process, the first and most critical step is ensuring that the data is clean and ready for further processing. Poor-quality data can lead to errors in analysis and generate misleading insights. Therefore, Data Cleaning is an essential component of the data preprocessing pipeline.
The following Mind Map illustrates the key stages of the Data Cleaning process, including duplicate removal, handling of missing values, normalization, and the treatment of anomalies such as outliers and inconsistencies. By following these steps, data quality can be significantly improved, leading to more accurate and reliable analytical outcomes.

5.2 Study Case Example
5.2.1 About Dataset
This is created dummy dataset, covers the past three years with 12 variables, including missing values, duplicates, and outliers.
| Feature | Description |
|---|---|
Date |
Observation date (with missing & duplicate values) |
CPO_Price |
CPO price per ton (includes outliers) |
Production |
Production volume (contains missing values) |
Exports |
CPO export volume |
Imports |
CPO import volume |
USD_IDR |
Exchange rate (USD to IDR) |
Market_Sentiment |
Market sentiment (text inconsistencies) |
Demand_Index |
Synthetic demand index (100-200 range) |
Supply_Index |
Synthetic supply index (90-190 range) |
Rainfall |
Rainfall in mm |
Temperature |
Temperature in Celsius |
Political_Stability |
Political stability score (1-10 scale) |
5.2.2 Generate Raw Dataset
Python Code
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Generate 1000 dates
start_date = datetime.today() - timedelta(days=3*365)
dates = [start_date + timedelta(days=i) for i in range(1000)]
# Random Data Generation
np.random.seed(42)
cpo_prices = np.random.normal(800, 50, 1000)
cpo_prices[np.random.randint(0, 1000, 5)] = np.random.choice([2000, 2500, 50, 100], 5)
production = np.random.randint(90, 130, 1000).astype(float)
production[np.random.randint(0, 1000, 20)] = np.nan
exports = np.random.randint(200, 500, 1000)
imports = np.random.randint(50, 150, 1000)
usd_idr = np.random.normal(14500, 300, 1000)
demand_index = np.random.randint(100, 200, 1000)
supply_index = np.random.randint(90, 190, 1000)
rainfall = np.random.randint(50, 200, 1000)
temperature = np.random.uniform(25, 35, 1000)
political_stability = np.random.randint(1, 10, 1000).astype(float)
political_stability[np.random.randint(0, 1000, 15)] = np.nan
sentiments = np.random.choice(["Positive", "neutral", "NEGATIVE"], 1000)
dates[100] = dates[99]
dates[500] = None
# Create DataFrame
df = pd.DataFrame({
'Date': dates, 'CPO_Price': cpo_prices, 'Production': production,
'Exports': exports, 'Imports': imports, 'USD_IDR': usd_idr,
'Market_Sentiment': sentiments, 'Demand_Index': demand_index,
'Supply_Index': supply_index, 'Rainfall': rainfall,
'Temperature': temperature, 'Political_Stability': political_stability
})
# Ensure dataset is sorted by Date for Time Series modeling
df = df.sort_values(by='Date')
print(df.head()) Date CPO_Price ... Temperature Political_Stability
0 2022-07-04 13:53:53.569358 824.835708 ... 31.604697 3.0
1 2022-07-05 13:53:53.569358 793.086785 ... 29.708164 5.0
2 2022-07-06 13:53:53.569358 832.384427 ... 26.989298 3.0
3 2022-07-07 13:53:53.569358 876.151493 ... 31.016866 5.0
4 2022-07-08 13:53:53.569358 788.292331 ... 25.606831 3.0
[5 rows x 12 columns]
R Code
library(dplyr)
library(lubridate)
library(tidyr)
# Generate 1000 dates
set.seed(42)
start_date <- Sys.Date() - years(3)
dates <- seq.Date(from = start_date, by = "day", length.out = 1000)
# Random Data Generation
cpo_prices <- rnorm(1000, mean = 800, sd = 50)
cpo_prices[sample(1:1000, 5)] <- sample(c(2000, 2500, 50, 100), 5, replace = TRUE)
production <- as.numeric(sample(90:130, 1000, replace = TRUE))
production[sample(1:1000, 20)] <- NA
exports <- sample(200:500, 1000, replace = TRUE)
imports <- sample(50:150, 1000, replace = TRUE)
usd_idr <- rnorm(1000, mean = 14500, sd = 300)
demand_index <- sample(100:200, 1000, replace = TRUE)
supply_index <- sample(90:190, 1000, replace = TRUE)
rainfall <- sample(50:200, 1000, replace = TRUE)
temperature <- runif(1000, min = 25, max = 35)
political_stability <- as.numeric(sample(1:10, 1000, replace = TRUE))
political_stability[sample(1:1000, 15)] <- NA
sentiments <- sample(c("Positive", "neutral", "NEGATIVE"), 1000, replace = TRUE)
# Introduce duplicate and missing values in Date
dates[100] <- dates[99]
dates[500] <- NA
# Create DataFrame
df <- data.frame(
Date = dates, CPO_Price = cpo_prices, Production = production,
Exports = exports, Imports = imports, USD_IDR = usd_idr,
Market_Sentiment = sentiments, Demand_Index = demand_index,
Supply_Index = supply_index, Rainfall = rainfall,
Temperature = temperature, Political_Stability = political_stability
)
# Ensure dataset is sorted by Date for Time Series modeling
df <- df %>% arrange(Date)
print(head(df)) Date CPO_Price Production Exports Imports USD_IDR Market_Sentiment
1 2022-07-03 868.5479 102 244 60 14442.79 Positive
2 2022-07-04 771.7651 115 287 50 14765.80 neutral
3 2022-07-05 818.1564 99 208 77 14457.90 Positive
4 2022-07-06 50.0000 90 467 107 14033.63 NEGATIVE
5 2022-07-07 820.2134 130 305 148 14881.60 NEGATIVE
6 2022-07-08 794.6938 92 490 111 14127.73 Positive
Demand_Index Supply_Index Rainfall Temperature Political_Stability
1 159 188 124 26.44452 9
2 131 182 74 32.59184 3
3 198 143 118 26.40042 2
4 144 185 158 25.83467 4
5 198 96 97 27.25106 3
6 132 177 192 27.59168 7
5.3 Data Cleaning Process
5.3.1 Handling Missing Values
- Remove rows/columns with excessive missing data (
dropna()in Pandas,na.omit()in R) - Fill missing values using mean, median, mode, or interpolation (
fillna()in Pandas,mutate()in R)
Python Code
# Convert 'Date' to datetime and fix missing values
df['Date'] = pd.to_datetime(df['Date']).interpolate()
# Fill missing numerical values
df['Production'] = df['Production'].ffill()
df['Political_Stability'] = df['Political_Stability'].fillna(df['Political_Stability'].mean())
# Fill categorical missing values
df['Market_Sentiment'] = df['Market_Sentiment'].fillna(df['Market_Sentiment'].mode()[0])
# Drop remaining missing values
df_dropna = df.dropna()R Code
library(zoo)
# Convert 'Date' to datetime and fix missing values
df$Date <- as.Date(df$Date)
df$Date <- zoo::na.approx(df$Date, na.rm = FALSE)
# Fill missing numerical values
df$Production <- tidyr::fill(df, Production, .direction = "down")$Production
df$Political_Stability[is.na(df$Political_Stability)] <- mean(df$Political_Stability, na.rm = TRUE)
# Fill categorical missing values
df$Market_Sentiment[is.na(df$Market_Sentiment)] <- names(sort(table(df$Market_Sentiment), decreasing = TRUE))[1]
# Drop remaining missing values
df_dropna <- na.omit(df)
print(head(df_dropna)) Date CPO_Price Production Exports Imports USD_IDR Market_Sentiment
1 19176 868.5479 102 244 60 14442.79 Positive
2 19177 771.7651 115 287 50 14765.80 neutral
3 19178 818.1564 99 208 77 14457.90 Positive
4 19179 50.0000 90 467 107 14033.63 NEGATIVE
5 19180 820.2134 130 305 148 14881.60 NEGATIVE
6 19181 794.6938 92 490 111 14127.73 Positive
Demand_Index Supply_Index Rainfall Temperature Political_Stability
1 159 188 124 26.44452 9
2 131 182 74 32.59184 3
3 198 143 118 26.40042 2
4 144 185 158 25.83467 4
5 198 96 97 27.25106 3
6 132 177 192 27.59168 7
5.3.2 Removing Duplicates
- Use
drop_duplicates()in Pandas (Python) - Use
distinct()in R (dplyr)
Python Code
df_duplicates= df_dropna.drop_duplicates()R Code
# Remove duplicate rows
df_duplicates <- df_dropna %>% distinct()
print(head(df_duplicates)) Date CPO_Price Production Exports Imports USD_IDR Market_Sentiment
1 19176 868.5479 102 244 60 14442.79 Positive
2 19177 771.7651 115 287 50 14765.80 neutral
3 19178 818.1564 99 208 77 14457.90 Positive
4 19179 50.0000 90 467 107 14033.63 NEGATIVE
5 19180 820.2134 130 305 148 14881.60 NEGATIVE
6 19181 794.6938 92 490 111 14127.73 Positive
Demand_Index Supply_Index Rainfall Temperature Political_Stability
1 159 188 124 26.44452 9
2 131 182 74 32.59184 3
3 198 143 118 26.40042 2
4 144 185 158 25.83467 4
5 198 96 97 27.25106 3
6 132 177 192 27.59168 7
5.3.3 Fixing Text Formatting Issues
- Ensure uniform formats for dates, numbers, and text
- Use regex to clean text data
Python Code
df_duplicates['Market_Sentiment'] = df_duplicates['Market_Sentiment'].str.capitalize().str.strip()R Code
# Remove duplicate rows
df_duplicates <- df_dropna %>% distinct()
print(head(df_duplicates)) Date CPO_Price Production Exports Imports USD_IDR Market_Sentiment
1 19176 868.5479 102 244 60 14442.79 Positive
2 19177 771.7651 115 287 50 14765.80 neutral
3 19178 818.1564 99 208 77 14457.90 Positive
4 19179 50.0000 90 467 107 14033.63 NEGATIVE
5 19180 820.2134 130 305 148 14881.60 NEGATIVE
6 19181 794.6938 92 490 111 14127.73 Positive
Demand_Index Supply_Index Rainfall Temperature Political_Stability
1 159 188 124 26.44452 9
2 131 182 74 32.59184 3
3 198 143 118 26.40042 2
4 144 185 158 25.83467 4
5 198 96 97 27.25106 3
6 132 177 192 27.59168 7
5.3.4 Handling Outliers
- Visualize with boxplots to detect anomalies
- Use IQR (Interquartile Range) or Z-score method
Python Code
# Remove outliers using IQR
Q1 = df_duplicates['CPO_Price'].quantile(0.25)
Q3 = df_duplicates['CPO_Price'].quantile(0.75)
IQR = Q3 - Q1
df_oulier1 = df_duplicates[(df_duplicates['CPO_Price'] >= (Q1 - 1.5 * IQR)) &
(df_duplicates['CPO_Price'] <= (Q3 + 1.5 * IQR))]# Remove outliers using Z-Score for 'USD_IDR'
from scipy import stats
df_outlier2 = df_duplicates[(np.abs(stats.zscore(df_duplicates['USD_IDR'])) < 3)]R Code
# Remove outliers using IQR
Q1 <- quantile(df_duplicates$CPO_Price, 0.25)
Q3 <- quantile(df_duplicates$CPO_Price, 0.75)
IQR <- Q3 - Q1
df_outlier1 <- df_duplicates[df_duplicates$CPO_Price >= (Q1 - 1.5 * IQR) &
df_duplicates$CPO_Price <= (Q3 + 1.5 * IQR), ]# Remove outliers using Z-Score for 'USD_IDR'
z_scores <- scale(df_duplicates$USD_IDR)
df_outlier2 <- df_duplicates[abs(z_scores) < 3, ]5.3.5 Standardizing & Normalizing
Apply Min-Max Scaling or Standard Scaling for statistical models and ML
Python Code
# Normalize 'Temperature' & 'Rainfall'
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['Temperature', 'Rainfall']] = scaler.fit_transform(df[['Temperature', 'Rainfall']])# Standardize 'Demand_Index' & 'Supply_Index'
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Demand_Index', 'Supply_Index']] = scaler.fit_transform(df[['Demand_Index', 'Supply_Index']])R Code
# Normalize 'Temperature' & 'Rainfall' using Min-Max Scaling
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
df$Temperature <- normalize(df$Temperature)
df$Rainfall <- normalize(df$Rainfall)# Standardize 'Demand_Index' & 'Supply_Index' using Z-score Standardization
df$Demand_Index <- scale(df$Demand_Index)
df$Supply_Index <- scale(df$Supply_Index)5.3.6 Ensure Dataset
- Drop unnecessary columns
- Filter data that does not align with the analysis context
Python Code
# Ensure there are no missing values
df.dropna(inplace=True)
# Convert categorical variables into numerical format (if needed for regression)
df = pd.get_dummies(df, columns=['Market_Sentiment'], drop_first=True)
# Final Cleaned Dataset
print(df.head()) Date ... Market_Sentiment_neutral
0 2022-07-04 13:53:53.569358 ... True
1 2022-07-05 13:53:53.569358 ... False
2 2022-07-06 13:53:53.569358 ... False
3 2022-07-07 13:53:53.569358 ... True
4 2022-07-08 13:53:53.569358 ... True
[5 rows x 13 columns]
R Code
# Ensure there are no missing values
df <- na.omit(df)
# Convert categorical variables into numerical format (if needed for regression)
df <- model.matrix(~ Market_Sentiment - 1, data = df) %>% as.data.frame()
# Bind back to original dataframe (excluding the original categorical column)
df <- cbind(df, df[, !colnames(df) %in% "Market_Sentiment"])
# Print first few rows of the cleaned dataset
head(df) Market_SentimentNEGATIVE Market_Sentimentneutral Market_SentimentPositive
1 0 0 1
2 0 1 0
3 0 0 1
4 1 0 0
5 1 0 0
6 0 0 1
Market_SentimentNEGATIVE Market_Sentimentneutral Market_SentimentPositive
1 0 0 1
2 0 1 0
3 0 0 1
4 1 0 0
5 1 0 0
6 0 0 1