5 Data Cleaning
Data cleaning is the process of identifying and correcting or removing inaccurate, incomplete, irrelevant, or erroneous data in a dataset. It is a crucial step in data analysis and machine learning to ensure accurate and reliable results.
5.1 Data Cleaning Process
In the data analysis process, the first and most critical step is ensuring that the data is clean and ready for further processing. Poor-quality data can lead to errors in analysis and generate misleading insights. Therefore, Data Cleaning is an essential component of the data preprocessing pipeline.
The following Mind Map illustrates the key stages of the Data Cleaning process, including duplicate removal, handling of missing values, normalization, and the treatment of anomalies such as outliers and inconsistencies. By following these steps, data quality can be significantly improved, leading to more accurate and reliable analytical outcomes.
5.2 Study Case Example
5.2.1 About Dataset
This is created dummy dataset, covers the past three years with 12 variables, including missing values, duplicates, and outliers.
Feature | Description |
---|---|
Date |
Observation date (with missing & duplicate values) |
CPO_Price |
CPO price per ton (includes outliers) |
Production |
Production volume (contains missing values) |
Exports |
CPO export volume |
Imports |
CPO import volume |
USD_IDR |
Exchange rate (USD to IDR) |
Market_Sentiment |
Market sentiment (text inconsistencies) |
Demand_Index |
Synthetic demand index (100-200 range) |
Supply_Index |
Synthetic supply index (90-190 range) |
Rainfall |
Rainfall in mm |
Temperature |
Temperature in Celsius |
Political_Stability |
Political stability score (1-10 scale) |
5.2.2 Generate Raw Dataset
Python Code
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Generate 1000 dates
= datetime.today() - timedelta(days=3*365)
start_date = [start_date + timedelta(days=i) for i in range(1000)]
dates
# Random Data Generation
42)
np.random.seed(= np.random.normal(800, 50, 1000)
cpo_prices 0, 1000, 5)] = np.random.choice([2000, 2500, 50, 100], 5)
cpo_prices[np.random.randint(
= np.random.randint(90, 130, 1000).astype(float)
production 0, 1000, 20)] = np.nan
production[np.random.randint(
= np.random.randint(200, 500, 1000)
exports = np.random.randint(50, 150, 1000)
imports
= np.random.normal(14500, 300, 1000)
usd_idr = np.random.randint(100, 200, 1000)
demand_index = np.random.randint(90, 190, 1000)
supply_index
= np.random.randint(50, 200, 1000)
rainfall = np.random.uniform(25, 35, 1000)
temperature
= np.random.randint(1, 10, 1000).astype(float)
political_stability 0, 1000, 15)] = np.nan
political_stability[np.random.randint(
= np.random.choice(["Positive", "neutral", "NEGATIVE"], 1000)
sentiments
100] = dates[99]
dates[500] = None
dates[
# Create DataFrame
= pd.DataFrame({
df 'Date': dates, 'CPO_Price': cpo_prices, 'Production': production,
'Exports': exports, 'Imports': imports, 'USD_IDR': usd_idr,
'Market_Sentiment': sentiments, 'Demand_Index': demand_index,
'Supply_Index': supply_index, 'Rainfall': rainfall,
'Temperature': temperature, 'Political_Stability': political_stability
})
# Ensure dataset is sorted by Date for Time Series modeling
= df.sort_values(by='Date')
df
print(df.head())
Date CPO_Price ... Temperature Political_Stability
0 2022-03-18 10:28:05.851223 824.835708 ... 31.604697 3.0
1 2022-03-19 10:28:05.851223 793.086785 ... 29.708164 5.0
2 2022-03-20 10:28:05.851223 832.384427 ... 26.989298 3.0
3 2022-03-21 10:28:05.851223 876.151493 ... 31.016866 5.0
4 2022-03-22 10:28:05.851223 788.292331 ... 25.606831 3.0
[5 rows x 12 columns]
R Code
library(dplyr)
library(lubridate)
library(tidyr)
# Generate 1000 dates
set.seed(42)
<- Sys.Date() - years(3)
start_date <- seq.Date(from = start_date, by = "day", length.out = 1000)
dates
# Random Data Generation
<- rnorm(1000, mean = 800, sd = 50)
cpo_prices sample(1:1000, 5)] <- sample(c(2000, 2500, 50, 100), 5, replace = TRUE)
cpo_prices[
<- as.numeric(sample(90:130, 1000, replace = TRUE))
production sample(1:1000, 20)] <- NA
production[
<- sample(200:500, 1000, replace = TRUE)
exports <- sample(50:150, 1000, replace = TRUE)
imports
<- rnorm(1000, mean = 14500, sd = 300)
usd_idr <- sample(100:200, 1000, replace = TRUE)
demand_index <- sample(90:190, 1000, replace = TRUE)
supply_index
<- sample(50:200, 1000, replace = TRUE)
rainfall <- runif(1000, min = 25, max = 35)
temperature
<- as.numeric(sample(1:10, 1000, replace = TRUE))
political_stability sample(1:1000, 15)] <- NA
political_stability[
<- sample(c("Positive", "neutral", "NEGATIVE"), 1000, replace = TRUE)
sentiments
# Introduce duplicate and missing values in Date
100] <- dates[99]
dates[500] <- NA
dates[
# Create DataFrame
<- data.frame(
df Date = dates, CPO_Price = cpo_prices, Production = production,
Exports = exports, Imports = imports, USD_IDR = usd_idr,
Market_Sentiment = sentiments, Demand_Index = demand_index,
Supply_Index = supply_index, Rainfall = rainfall,
Temperature = temperature, Political_Stability = political_stability
)
# Ensure dataset is sorted by Date for Time Series modeling
<- df %>% arrange(Date)
df
print(head(df))
Date CPO_Price Production Exports Imports USD_IDR Market_Sentiment
1 2022-03-17 868.5479 102 244 60 14442.79 Positive
2 2022-03-18 771.7651 115 287 50 14765.80 neutral
3 2022-03-19 818.1564 99 208 77 14457.90 Positive
4 2022-03-20 50.0000 90 467 107 14033.63 NEGATIVE
5 2022-03-21 820.2134 130 305 148 14881.60 NEGATIVE
6 2022-03-22 794.6938 92 490 111 14127.73 Positive
Demand_Index Supply_Index Rainfall Temperature Political_Stability
1 159 188 124 26.44452 9
2 131 182 74 32.59184 3
3 198 143 118 26.40042 2
4 144 185 158 25.83467 4
5 198 96 97 27.25106 3
6 132 177 192 27.59168 7
5.3 Data Cleaning Process
5.3.1 Handling Missing Values
- Remove rows/columns with excessive missing data (
dropna()
in Pandas,na.omit()
in R) - Fill missing values using mean, median, mode, or interpolation (
fillna()
in Pandas,mutate()
in R)
Python Code
# Convert 'Date' to datetime and fix missing values
'Date'] = pd.to_datetime(df['Date']).interpolate()
df[
# Fill missing numerical values
'Production'] = df['Production'].ffill()
df['Political_Stability'] = df['Political_Stability'].fillna(df['Political_Stability'].mean())
df[
# Fill categorical missing values
'Market_Sentiment'] = df['Market_Sentiment'].fillna(df['Market_Sentiment'].mode()[0])
df[
# Drop remaining missing values
= df.dropna() df_dropna
R Code
library(zoo)
# Convert 'Date' to datetime and fix missing values
$Date <- as.Date(df$Date)
df$Date <- zoo::na.approx(df$Date, na.rm = FALSE)
df
# Fill missing numerical values
$Production <- tidyr::fill(df, Production, .direction = "down")$Production
df$Political_Stability[is.na(df$Political_Stability)] <- mean(df$Political_Stability, na.rm = TRUE)
df
# Fill categorical missing values
$Market_Sentiment[is.na(df$Market_Sentiment)] <- names(sort(table(df$Market_Sentiment), decreasing = TRUE))[1]
df
# Drop remaining missing values
<- na.omit(df)
df_dropna
print(head(df_dropna))
Date CPO_Price Production Exports Imports USD_IDR Market_Sentiment
1 19068 868.5479 102 244 60 14442.79 Positive
2 19069 771.7651 115 287 50 14765.80 neutral
3 19070 818.1564 99 208 77 14457.90 Positive
4 19071 50.0000 90 467 107 14033.63 NEGATIVE
5 19072 820.2134 130 305 148 14881.60 NEGATIVE
6 19073 794.6938 92 490 111 14127.73 Positive
Demand_Index Supply_Index Rainfall Temperature Political_Stability
1 159 188 124 26.44452 9
2 131 182 74 32.59184 3
3 198 143 118 26.40042 2
4 144 185 158 25.83467 4
5 198 96 97 27.25106 3
6 132 177 192 27.59168 7
5.3.2 Removing Duplicates
- Use
drop_duplicates()
in Pandas (Python) - Use
distinct()
in R (dplyr)
Python Code
= df_dropna.drop_duplicates() df_duplicates
R Code
# Remove duplicate rows
<- df_dropna %>% distinct()
df_duplicates
print(head(df_duplicates))
Date CPO_Price Production Exports Imports USD_IDR Market_Sentiment
1 19068 868.5479 102 244 60 14442.79 Positive
2 19069 771.7651 115 287 50 14765.80 neutral
3 19070 818.1564 99 208 77 14457.90 Positive
4 19071 50.0000 90 467 107 14033.63 NEGATIVE
5 19072 820.2134 130 305 148 14881.60 NEGATIVE
6 19073 794.6938 92 490 111 14127.73 Positive
Demand_Index Supply_Index Rainfall Temperature Political_Stability
1 159 188 124 26.44452 9
2 131 182 74 32.59184 3
3 198 143 118 26.40042 2
4 144 185 158 25.83467 4
5 198 96 97 27.25106 3
6 132 177 192 27.59168 7
5.3.3 Fixing Text Formatting Issues
- Ensure uniform formats for dates, numbers, and text
- Use regex to clean text data
Python Code
'Market_Sentiment'] = df_duplicates['Market_Sentiment'].str.capitalize().str.strip() df_duplicates[
R Code
# Remove duplicate rows
<- df_dropna %>% distinct()
df_duplicates
print(head(df_duplicates))
Date CPO_Price Production Exports Imports USD_IDR Market_Sentiment
1 19068 868.5479 102 244 60 14442.79 Positive
2 19069 771.7651 115 287 50 14765.80 neutral
3 19070 818.1564 99 208 77 14457.90 Positive
4 19071 50.0000 90 467 107 14033.63 NEGATIVE
5 19072 820.2134 130 305 148 14881.60 NEGATIVE
6 19073 794.6938 92 490 111 14127.73 Positive
Demand_Index Supply_Index Rainfall Temperature Political_Stability
1 159 188 124 26.44452 9
2 131 182 74 32.59184 3
3 198 143 118 26.40042 2
4 144 185 158 25.83467 4
5 198 96 97 27.25106 3
6 132 177 192 27.59168 7
5.3.4 Handling Outliers
- Visualize with boxplots to detect anomalies
- Use IQR (Interquartile Range) or Z-score method
Python Code
# Remove outliers using IQR
= df_duplicates['CPO_Price'].quantile(0.25)
Q1 = df_duplicates['CPO_Price'].quantile(0.75)
Q3 = Q3 - Q1
IQR = df_duplicates[(df_duplicates['CPO_Price'] >= (Q1 - 1.5 * IQR)) &
df_oulier1 'CPO_Price'] <= (Q3 + 1.5 * IQR))] (df_duplicates[
# Remove outliers using Z-Score for 'USD_IDR'
from scipy import stats
= df_duplicates[(np.abs(stats.zscore(df_duplicates['USD_IDR'])) < 3)] df_outlier2
R Code
# Remove outliers using IQR
<- quantile(df_duplicates$CPO_Price, 0.25)
Q1 <- quantile(df_duplicates$CPO_Price, 0.75)
Q3 <- Q3 - Q1
IQR <- df_duplicates[df_duplicates$CPO_Price >= (Q1 - 1.5 * IQR) &
df_outlier1 $CPO_Price <= (Q3 + 1.5 * IQR), ] df_duplicates
# Remove outliers using Z-Score for 'USD_IDR'
<- scale(df_duplicates$USD_IDR)
z_scores <- df_duplicates[abs(z_scores) < 3, ] df_outlier2
5.3.5 Standardizing & Normalizing
Apply Min-Max Scaling or Standard Scaling for statistical models and ML
Python Code
# Normalize 'Temperature' & 'Rainfall'
from sklearn.preprocessing import MinMaxScaler
= MinMaxScaler()
scaler 'Temperature', 'Rainfall']] = scaler.fit_transform(df[['Temperature', 'Rainfall']]) df[[
# Standardize 'Demand_Index' & 'Supply_Index'
from sklearn.preprocessing import StandardScaler
= StandardScaler()
scaler 'Demand_Index', 'Supply_Index']] = scaler.fit_transform(df[['Demand_Index', 'Supply_Index']]) df[[
R Code
# Normalize 'Temperature' & 'Rainfall' using Min-Max Scaling
<- function(x) {
normalize return((x - min(x)) / (max(x) - min(x)))
}
$Temperature <- normalize(df$Temperature)
df$Rainfall <- normalize(df$Rainfall) df
# Standardize 'Demand_Index' & 'Supply_Index' using Z-score Standardization
$Demand_Index <- scale(df$Demand_Index)
df$Supply_Index <- scale(df$Supply_Index) df
5.3.6 Ensure Dataset
- Drop unnecessary columns
- Filter data that does not align with the analysis context
Python Code
# Ensure there are no missing values
=True)
df.dropna(inplace
# Convert categorical variables into numerical format (if needed for regression)
= pd.get_dummies(df, columns=['Market_Sentiment'], drop_first=True)
df # Final Cleaned Dataset
print(df.head())
Date ... Market_Sentiment_neutral
0 2022-03-18 10:28:05.851223 ... True
1 2022-03-19 10:28:05.851223 ... False
2 2022-03-20 10:28:05.851223 ... False
3 2022-03-21 10:28:05.851223 ... True
4 2022-03-22 10:28:05.851223 ... True
[5 rows x 13 columns]
R Code
# Ensure there are no missing values
<- na.omit(df)
df
# Convert categorical variables into numerical format (if needed for regression)
<- model.matrix(~ Market_Sentiment - 1, data = df) %>% as.data.frame()
df
# Bind back to original dataframe (excluding the original categorical column)
<- cbind(df, df[, !colnames(df) %in% "Market_Sentiment"])
df
# Print first few rows of the cleaned dataset
head(df)
Market_SentimentNEGATIVE Market_Sentimentneutral Market_SentimentPositive
1 0 0 1
2 0 1 0
3 0 0 1
4 1 0 0
5 1 0 0
6 0 0 1
Market_SentimentNEGATIVE Market_Sentimentneutral Market_SentimentPositive
1 0 0 1
2 0 1 0
3 0 0 1
4 1 0 0
5 1 0 0
6 0 0 1