5 Data Cleaning

Data cleaning is the process of identifying and correcting or removing inaccurate, incomplete, irrelevant, or erroneous data in a dataset. It is a crucial step in data analysis and machine learning to ensure accurate and reliable results.

5.1 Data Cleaning Process

In the data analysis process, the first and most critical step is ensuring that the data is clean and ready for further processing. Poor-quality data can lead to errors in analysis and generate misleading insights. Therefore, Data Cleaning is an essential component of the data preprocessing pipeline.

The following Mind Map illustrates the key stages of the Data Cleaning process, including duplicate removal, handling of missing values, normalization, and the treatment of anomalies such as outliers and inconsistencies. By following these steps, data quality can be significantly improved, leading to more accurate and reliable analytical outcomes.

5.2 Study Case Example

5.2.1 About Dataset

This is created dummy dataset, covers the past three years with 12 variables, including missing values, duplicates, and outliers.

Feature	Description
`Date`	Observation date (with missing & duplicate values)
`CPO_Price`	CPO price per ton (includes outliers)
`Production`	Production volume (contains missing values)
`Exports`	CPO export volume
`Imports`	CPO import volume
`USD_IDR`	Exchange rate (USD to IDR)
`Market_Sentiment`	Market sentiment (text inconsistencies)
`Demand_Index`	Synthetic demand index (100-200 range)
`Supply_Index`	Synthetic supply index (90-190 range)
`Rainfall`	Rainfall in mm
`Temperature`	Temperature in Celsius
`Political_Stability`	Political stability score (1-10 scale)

5.2.2 Generate Raw Dataset

Python Code

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Generate 1000 dates
start_date = datetime.today() - timedelta(days=3*365)
dates = [start_date + timedelta(days=i) for i in range(1000)]

# Random Data Generation
np.random.seed(42)
cpo_prices = np.random.normal(800, 50, 1000)
cpo_prices[np.random.randint(0, 1000, 5)] = np.random.choice([2000, 2500, 50, 100], 5)

production = np.random.randint(90, 130, 1000).astype(float)
production[np.random.randint(0, 1000, 20)] = np.nan

exports = np.random.randint(200, 500, 1000)
imports = np.random.randint(50, 150, 1000)

usd_idr = np.random.normal(14500, 300, 1000)
demand_index = np.random.randint(100, 200, 1000)
supply_index = np.random.randint(90, 190, 1000)

rainfall = np.random.randint(50, 200, 1000)
temperature = np.random.uniform(25, 35, 1000)

political_stability = np.random.randint(1, 10, 1000).astype(float)
political_stability[np.random.randint(0, 1000, 15)] = np.nan

sentiments = np.random.choice(["Positive", "neutral", "NEGATIVE"], 1000)

dates[100] = dates[99]
dates[500] = None

# Create DataFrame
df = pd.DataFrame({
    'Date': dates, 'CPO_Price': cpo_prices, 'Production': production,
    'Exports': exports, 'Imports': imports, 'USD_IDR': usd_idr,
    'Market_Sentiment': sentiments, 'Demand_Index': demand_index,
    'Supply_Index': supply_index, 'Rainfall': rainfall,
    'Temperature': temperature, 'Political_Stability': political_stability
})

# Ensure dataset is sorted by Date for Time Series modeling
df = df.sort_values(by='Date')

print(df.head())

                        Date   CPO_Price  ...  Temperature  Political_Stability
0 2022-07-04 13:53:53.569358  824.835708  ...    31.604697                  3.0
1 2022-07-05 13:53:53.569358  793.086785  ...    29.708164                  5.0
2 2022-07-06 13:53:53.569358  832.384427  ...    26.989298                  3.0
3 2022-07-07 13:53:53.569358  876.151493  ...    31.016866                  5.0
4 2022-07-08 13:53:53.569358  788.292331  ...    25.606831                  3.0

[5 rows x 12 columns]

R Code

library(dplyr)
library(lubridate)
library(tidyr)

# Generate 1000 dates
set.seed(42)
start_date <- Sys.Date() - years(3)
dates <- seq.Date(from = start_date, by = "day", length.out = 1000)

# Random Data Generation
cpo_prices <- rnorm(1000, mean = 800, sd = 50)
cpo_prices[sample(1:1000, 5)] <- sample(c(2000, 2500, 50, 100), 5, replace = TRUE)

production <- as.numeric(sample(90:130, 1000, replace = TRUE))
production[sample(1:1000, 20)] <- NA

exports <- sample(200:500, 1000, replace = TRUE)
imports <- sample(50:150, 1000, replace = TRUE)

usd_idr <- rnorm(1000, mean = 14500, sd = 300)
demand_index <- sample(100:200, 1000, replace = TRUE)
supply_index <- sample(90:190, 1000, replace = TRUE)

rainfall <- sample(50:200, 1000, replace = TRUE)
temperature <- runif(1000, min = 25, max = 35)

political_stability <- as.numeric(sample(1:10, 1000, replace = TRUE))
political_stability[sample(1:1000, 15)] <- NA

sentiments <- sample(c("Positive", "neutral", "NEGATIVE"), 1000, replace = TRUE)

# Introduce duplicate and missing values in Date
dates[100] <- dates[99]
dates[500] <- NA

# Create DataFrame
df <- data.frame(
  Date = dates, CPO_Price = cpo_prices, Production = production,
  Exports = exports, Imports = imports, USD_IDR = usd_idr,
  Market_Sentiment = sentiments, Demand_Index = demand_index,
  Supply_Index = supply_index, Rainfall = rainfall,
  Temperature = temperature, Political_Stability = political_stability
)

# Ensure dataset is sorted by Date for Time Series modeling
df <- df %>% arrange(Date)

print(head(df))

        Date CPO_Price Production Exports Imports  USD_IDR Market_Sentiment
1 2022-07-03  868.5479        102     244      60 14442.79         Positive
2 2022-07-04  771.7651        115     287      50 14765.80          neutral
3 2022-07-05  818.1564         99     208      77 14457.90         Positive
4 2022-07-06   50.0000         90     467     107 14033.63         NEGATIVE
5 2022-07-07  820.2134        130     305     148 14881.60         NEGATIVE
6 2022-07-08  794.6938         92     490     111 14127.73         Positive
  Demand_Index Supply_Index Rainfall Temperature Political_Stability
1          159          188      124    26.44452                   9
2          131          182       74    32.59184                   3
3          198          143      118    26.40042                   2
4          144          185      158    25.83467                   4
5          198           96       97    27.25106                   3
6          132          177      192    27.59168                   7

5.3 Data Cleaning Process

5.3.1 Handling Missing Values

Remove rows/columns with excessive missing data (dropna() in Pandas, na.omit() in R)
Fill missing values using mean, median, mode, or interpolation (fillna() in Pandas, mutate() in R)

Python Code

# Convert 'Date' to datetime and fix missing values
df['Date'] = pd.to_datetime(df['Date']).interpolate()

# Fill missing numerical values
df['Production'] = df['Production'].ffill()
df['Political_Stability'] = df['Political_Stability'].fillna(df['Political_Stability'].mean())

# Fill categorical missing values
df['Market_Sentiment'] = df['Market_Sentiment'].fillna(df['Market_Sentiment'].mode()[0])

# Drop remaining missing values
df_dropna = df.dropna()

R Code

library(zoo)
# Convert 'Date' to datetime and fix missing values
df$Date <- as.Date(df$Date)
df$Date <- zoo::na.approx(df$Date, na.rm = FALSE)

# Fill missing numerical values
df$Production <- tidyr::fill(df, Production, .direction = "down")$Production
df$Political_Stability[is.na(df$Political_Stability)] <- mean(df$Political_Stability, na.rm = TRUE)

# Fill categorical missing values
df$Market_Sentiment[is.na(df$Market_Sentiment)] <- names(sort(table(df$Market_Sentiment), decreasing = TRUE))[1]

# Drop remaining missing values
df_dropna <- na.omit(df)

print(head(df_dropna))

   Date CPO_Price Production Exports Imports  USD_IDR Market_Sentiment
1 19176  868.5479        102     244      60 14442.79         Positive
2 19177  771.7651        115     287      50 14765.80          neutral
3 19178  818.1564         99     208      77 14457.90         Positive
4 19179   50.0000         90     467     107 14033.63         NEGATIVE
5 19180  820.2134        130     305     148 14881.60         NEGATIVE
6 19181  794.6938         92     490     111 14127.73         Positive
  Demand_Index Supply_Index Rainfall Temperature Political_Stability
1          159          188      124    26.44452                   9
2          131          182       74    32.59184                   3
3          198          143      118    26.40042                   2
4          144          185      158    25.83467                   4
5          198           96       97    27.25106                   3
6          132          177      192    27.59168                   7

5.3.2 Removing Duplicates

Use drop_duplicates() in Pandas (Python)
Use distinct() in R (dplyr)

Python Code

df_duplicates= df_dropna.drop_duplicates()

R Code

# Remove duplicate rows
df_duplicates <- df_dropna %>% distinct()

print(head(df_duplicates))

   Date CPO_Price Production Exports Imports  USD_IDR Market_Sentiment
1 19176  868.5479        102     244      60 14442.79         Positive
2 19177  771.7651        115     287      50 14765.80          neutral
3 19178  818.1564         99     208      77 14457.90         Positive
4 19179   50.0000         90     467     107 14033.63         NEGATIVE
5 19180  820.2134        130     305     148 14881.60         NEGATIVE
6 19181  794.6938         92     490     111 14127.73         Positive
  Demand_Index Supply_Index Rainfall Temperature Political_Stability
1          159          188      124    26.44452                   9
2          131          182       74    32.59184                   3
3          198          143      118    26.40042                   2
4          144          185      158    25.83467                   4
5          198           96       97    27.25106                   3
6          132          177      192    27.59168                   7

5.3.3 Fixing Text Formatting Issues

Ensure uniform formats for dates, numbers, and text
Use regex to clean text data

Python Code

df_duplicates['Market_Sentiment'] = df_duplicates['Market_Sentiment'].str.capitalize().str.strip()

R Code

# Remove duplicate rows
df_duplicates <- df_dropna %>% distinct()

print(head(df_duplicates))

   Date CPO_Price Production Exports Imports  USD_IDR Market_Sentiment
1 19176  868.5479        102     244      60 14442.79         Positive
2 19177  771.7651        115     287      50 14765.80          neutral
3 19178  818.1564         99     208      77 14457.90         Positive
4 19179   50.0000         90     467     107 14033.63         NEGATIVE
5 19180  820.2134        130     305     148 14881.60         NEGATIVE
6 19181  794.6938         92     490     111 14127.73         Positive
  Demand_Index Supply_Index Rainfall Temperature Political_Stability
1          159          188      124    26.44452                   9
2          131          182       74    32.59184                   3
3          198          143      118    26.40042                   2
4          144          185      158    25.83467                   4
5          198           96       97    27.25106                   3
6          132          177      192    27.59168                   7

5.3.4 Handling Outliers

Visualize with boxplots to detect anomalies
Use IQR (Interquartile Range) or Z-score method

Python Code

# Remove outliers using IQR
Q1 = df_duplicates['CPO_Price'].quantile(0.25)
Q3 = df_duplicates['CPO_Price'].quantile(0.75)
IQR = Q3 - Q1
df_oulier1 = df_duplicates[(df_duplicates['CPO_Price'] >= (Q1 - 1.5 * IQR)) & 
     (df_duplicates['CPO_Price'] <= (Q3 + 1.5 * IQR))]

# Remove outliers using Z-Score for 'USD_IDR'
from scipy import stats
df_outlier2 = df_duplicates[(np.abs(stats.zscore(df_duplicates['USD_IDR'])) < 3)]

R Code

# Remove outliers using IQR
Q1 <- quantile(df_duplicates$CPO_Price, 0.25)
Q3 <- quantile(df_duplicates$CPO_Price, 0.75)
IQR <- Q3 - Q1
df_outlier1 <- df_duplicates[df_duplicates$CPO_Price >= (Q1 - 1.5 * IQR) & 
                            df_duplicates$CPO_Price <= (Q3 + 1.5 * IQR), ]

# Remove outliers using Z-Score for 'USD_IDR'
z_scores <- scale(df_duplicates$USD_IDR)
df_outlier2 <- df_duplicates[abs(z_scores) < 3, ]

5.3.5 Standardizing & Normalizing

Apply Min-Max Scaling or Standard Scaling for statistical models and ML

Python Code

# Normalize 'Temperature' & 'Rainfall'
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['Temperature', 'Rainfall']] = scaler.fit_transform(df[['Temperature', 'Rainfall']])

# Standardize 'Demand_Index' & 'Supply_Index'
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Demand_Index', 'Supply_Index']] = scaler.fit_transform(df[['Demand_Index', 'Supply_Index']])

R Code

# Normalize 'Temperature' & 'Rainfall' using Min-Max Scaling
normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}

df$Temperature <- normalize(df$Temperature)
df$Rainfall <- normalize(df$Rainfall)

# Standardize 'Demand_Index' & 'Supply_Index' using Z-score Standardization
df$Demand_Index <- scale(df$Demand_Index)
df$Supply_Index <- scale(df$Supply_Index)

5.3.6 Ensure Dataset

Drop unnecessary columns
Filter data that does not align with the analysis context

Python Code

# Ensure there are no missing values
df.dropna(inplace=True)

# Convert categorical variables into numerical format (if needed for regression)
df = pd.get_dummies(df, columns=['Market_Sentiment'], drop_first=True)
# Final Cleaned Dataset
print(df.head())

                        Date  ...  Market_Sentiment_neutral
0 2022-07-04 13:53:53.569358  ...                      True
1 2022-07-05 13:53:53.569358  ...                     False
2 2022-07-06 13:53:53.569358  ...                     False
3 2022-07-07 13:53:53.569358  ...                      True
4 2022-07-08 13:53:53.569358  ...                      True

[5 rows x 13 columns]

R Code

# Ensure there are no missing values
df <- na.omit(df)

# Convert categorical variables into numerical format (if needed for regression)
df <- model.matrix(~ Market_Sentiment - 1, data = df) %>% as.data.frame()

# Bind back to original dataframe (excluding the original categorical column)
df <- cbind(df, df[, !colnames(df) %in% "Market_Sentiment"])

# Print first few rows of the cleaned dataset
head(df)

  Market_SentimentNEGATIVE Market_Sentimentneutral Market_SentimentPositive
1                        0                       0                        1
2                        0                       1                        0
3                        0                       0                        1
4                        1                       0                        0
5                        1                       0                        0
6                        0                       0                        1
  Market_SentimentNEGATIVE Market_Sentimentneutral Market_SentimentPositive
1                        0                       0                        1
2                        0                       1                        0
3                        0                       0                        1
4                        1                       0                        0
5                        1                       0                        0
6                        0                       0                        1