5  Data Cleaning

Data cleaning is the process of identifying and correcting or removing inaccurate, incomplete, irrelevant, or erroneous data in a dataset. It is a crucial step in data analysis and machine learning to ensure accurate and reliable results.

5.1 Data Cleaning Process

In the data analysis process, the first and most critical step is ensuring that the data is clean and ready for further processing. Poor-quality data can lead to errors in analysis and generate misleading insights. Therefore, Data Cleaning is an essential component of the data preprocessing pipeline.

The following Mind Map illustrates the key stages of the Data Cleaning process, including duplicate removal, handling of missing values, normalization, and the treatment of anomalies such as outliers and inconsistencies. By following these steps, data quality can be significantly improved, leading to more accurate and reliable analytical outcomes.

DataCleaningMindMap Start Data Cleaning Dup Remove Duplicates Start:e->Dup:w Missing Handle Missing Values Start:e->Missing:w Inconsistencies Fix Data Inconsistencies Start:e->Inconsistencies:w Outliers Remove Outliers Start:e->Outliers:w Convert Convert Data Types Start:e->Convert:w Normalize Normalize Data Start:e->Normalize:w Irrelevant Remove Irrelevant Data Start:e->Irrelevant:w Dup1 Use drop_duplicates() in Pandas Dup:e->Dup1:w Dup2 Use distinct() in R (dplyr) Dup:e->Dup2:w Missing1 Remove excessive missing data (dropna(), na.omit()) Missing:e->Missing1:w Missing2 Fill with mean, median, mode (fillna(), mutate()) Missing:e->Missing2:w Incon1 Ensure uniform formats for dates, numbers, text Inconsistencies:e->Incon1:w Incon2 Use regex to clean text Inconsistencies:e->Incon2:w Outlier1 Use IQR or Z-score method Outliers:e->Outlier1:w Outlier2 Detect anomalies using boxplots Outliers:e->Outlier2:w Convert1 Ensure numerical data is not stored as strings Convert:e->Convert1:w Convert2 Convert date formats to datetime Convert:e->Convert2:w Normalize1 Apply Min-Max Scaling or Standard Scaling Normalize:e->Normalize1:w Normalize2 Useful for machine learning models Normalize:e->Normalize2:w Irrelevant1 Drop unnecessary columns Irrelevant:e->Irrelevant1:w Irrelevant2 Filter data relevant to analysis Irrelevant:e->Irrelevant2:w

5.2 Study Case Example

5.2.1 About Dataset

This is created dummy dataset, covers the past three years with 12 variables, including missing values, duplicates, and outliers.

Feature Description
Date Observation date (with missing & duplicate values)
CPO_Price CPO price per ton (includes outliers)
Production Production volume (contains missing values)
Exports CPO export volume
Imports CPO import volume
USD_IDR Exchange rate (USD to IDR)
Market_Sentiment Market sentiment (text inconsistencies)
Demand_Index Synthetic demand index (100-200 range)
Supply_Index Synthetic supply index (90-190 range)
Rainfall Rainfall in mm
Temperature Temperature in Celsius
Political_Stability Political stability score (1-10 scale)

5.2.2 Generate Raw Dataset

Python Code

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Generate 1000 dates
start_date = datetime.today() - timedelta(days=3*365)
dates = [start_date + timedelta(days=i) for i in range(1000)]

# Random Data Generation
np.random.seed(42)
cpo_prices = np.random.normal(800, 50, 1000)
cpo_prices[np.random.randint(0, 1000, 5)] = np.random.choice([2000, 2500, 50, 100], 5)

production = np.random.randint(90, 130, 1000).astype(float)
production[np.random.randint(0, 1000, 20)] = np.nan

exports = np.random.randint(200, 500, 1000)
imports = np.random.randint(50, 150, 1000)

usd_idr = np.random.normal(14500, 300, 1000)
demand_index = np.random.randint(100, 200, 1000)
supply_index = np.random.randint(90, 190, 1000)

rainfall = np.random.randint(50, 200, 1000)
temperature = np.random.uniform(25, 35, 1000)

political_stability = np.random.randint(1, 10, 1000).astype(float)
political_stability[np.random.randint(0, 1000, 15)] = np.nan

sentiments = np.random.choice(["Positive", "neutral", "NEGATIVE"], 1000)

dates[100] = dates[99]
dates[500] = None

# Create DataFrame
df = pd.DataFrame({
    'Date': dates, 'CPO_Price': cpo_prices, 'Production': production,
    'Exports': exports, 'Imports': imports, 'USD_IDR': usd_idr,
    'Market_Sentiment': sentiments, 'Demand_Index': demand_index,
    'Supply_Index': supply_index, 'Rainfall': rainfall,
    'Temperature': temperature, 'Political_Stability': political_stability
})

# Ensure dataset is sorted by Date for Time Series modeling
df = df.sort_values(by='Date')

print(df.head())
                        Date   CPO_Price  ...  Temperature  Political_Stability
0 2022-03-18 10:28:05.851223  824.835708  ...    31.604697                  3.0
1 2022-03-19 10:28:05.851223  793.086785  ...    29.708164                  5.0
2 2022-03-20 10:28:05.851223  832.384427  ...    26.989298                  3.0
3 2022-03-21 10:28:05.851223  876.151493  ...    31.016866                  5.0
4 2022-03-22 10:28:05.851223  788.292331  ...    25.606831                  3.0

[5 rows x 12 columns]

R Code

library(dplyr)
library(lubridate)
library(tidyr)

# Generate 1000 dates
set.seed(42)
start_date <- Sys.Date() - years(3)
dates <- seq.Date(from = start_date, by = "day", length.out = 1000)

# Random Data Generation
cpo_prices <- rnorm(1000, mean = 800, sd = 50)
cpo_prices[sample(1:1000, 5)] <- sample(c(2000, 2500, 50, 100), 5, replace = TRUE)

production <- as.numeric(sample(90:130, 1000, replace = TRUE))
production[sample(1:1000, 20)] <- NA

exports <- sample(200:500, 1000, replace = TRUE)
imports <- sample(50:150, 1000, replace = TRUE)

usd_idr <- rnorm(1000, mean = 14500, sd = 300)
demand_index <- sample(100:200, 1000, replace = TRUE)
supply_index <- sample(90:190, 1000, replace = TRUE)

rainfall <- sample(50:200, 1000, replace = TRUE)
temperature <- runif(1000, min = 25, max = 35)

political_stability <- as.numeric(sample(1:10, 1000, replace = TRUE))
political_stability[sample(1:1000, 15)] <- NA

sentiments <- sample(c("Positive", "neutral", "NEGATIVE"), 1000, replace = TRUE)

# Introduce duplicate and missing values in Date
dates[100] <- dates[99]
dates[500] <- NA

# Create DataFrame
df <- data.frame(
  Date = dates, CPO_Price = cpo_prices, Production = production,
  Exports = exports, Imports = imports, USD_IDR = usd_idr,
  Market_Sentiment = sentiments, Demand_Index = demand_index,
  Supply_Index = supply_index, Rainfall = rainfall,
  Temperature = temperature, Political_Stability = political_stability
)

# Ensure dataset is sorted by Date for Time Series modeling
df <- df %>% arrange(Date)

print(head(df))
        Date CPO_Price Production Exports Imports  USD_IDR Market_Sentiment
1 2022-03-17  868.5479        102     244      60 14442.79         Positive
2 2022-03-18  771.7651        115     287      50 14765.80          neutral
3 2022-03-19  818.1564         99     208      77 14457.90         Positive
4 2022-03-20   50.0000         90     467     107 14033.63         NEGATIVE
5 2022-03-21  820.2134        130     305     148 14881.60         NEGATIVE
6 2022-03-22  794.6938         92     490     111 14127.73         Positive
  Demand_Index Supply_Index Rainfall Temperature Political_Stability
1          159          188      124    26.44452                   9
2          131          182       74    32.59184                   3
3          198          143      118    26.40042                   2
4          144          185      158    25.83467                   4
5          198           96       97    27.25106                   3
6          132          177      192    27.59168                   7

5.3 Data Cleaning Process

5.3.1 Handling Missing Values

  • Remove rows/columns with excessive missing data (dropna() in Pandas, na.omit() in R)
  • Fill missing values using mean, median, mode, or interpolation (fillna() in Pandas, mutate() in R)

Python Code

# Convert 'Date' to datetime and fix missing values
df['Date'] = pd.to_datetime(df['Date']).interpolate()

# Fill missing numerical values
df['Production'] = df['Production'].ffill()
df['Political_Stability'] = df['Political_Stability'].fillna(df['Political_Stability'].mean())

# Fill categorical missing values
df['Market_Sentiment'] = df['Market_Sentiment'].fillna(df['Market_Sentiment'].mode()[0])

# Drop remaining missing values
df_dropna = df.dropna()

R Code

library(zoo)
# Convert 'Date' to datetime and fix missing values
df$Date <- as.Date(df$Date)
df$Date <- zoo::na.approx(df$Date, na.rm = FALSE)

# Fill missing numerical values
df$Production <- tidyr::fill(df, Production, .direction = "down")$Production
df$Political_Stability[is.na(df$Political_Stability)] <- mean(df$Political_Stability, na.rm = TRUE)

# Fill categorical missing values
df$Market_Sentiment[is.na(df$Market_Sentiment)] <- names(sort(table(df$Market_Sentiment), decreasing = TRUE))[1]

# Drop remaining missing values
df_dropna <- na.omit(df)

print(head(df_dropna))
   Date CPO_Price Production Exports Imports  USD_IDR Market_Sentiment
1 19068  868.5479        102     244      60 14442.79         Positive
2 19069  771.7651        115     287      50 14765.80          neutral
3 19070  818.1564         99     208      77 14457.90         Positive
4 19071   50.0000         90     467     107 14033.63         NEGATIVE
5 19072  820.2134        130     305     148 14881.60         NEGATIVE
6 19073  794.6938         92     490     111 14127.73         Positive
  Demand_Index Supply_Index Rainfall Temperature Political_Stability
1          159          188      124    26.44452                   9
2          131          182       74    32.59184                   3
3          198          143      118    26.40042                   2
4          144          185      158    25.83467                   4
5          198           96       97    27.25106                   3
6          132          177      192    27.59168                   7

5.3.2 Removing Duplicates

  • Use drop_duplicates() in Pandas (Python)
  • Use distinct() in R (dplyr)

Python Code

df_duplicates= df_dropna.drop_duplicates()

R Code

# Remove duplicate rows
df_duplicates <- df_dropna %>% distinct()

print(head(df_duplicates))
   Date CPO_Price Production Exports Imports  USD_IDR Market_Sentiment
1 19068  868.5479        102     244      60 14442.79         Positive
2 19069  771.7651        115     287      50 14765.80          neutral
3 19070  818.1564         99     208      77 14457.90         Positive
4 19071   50.0000         90     467     107 14033.63         NEGATIVE
5 19072  820.2134        130     305     148 14881.60         NEGATIVE
6 19073  794.6938         92     490     111 14127.73         Positive
  Demand_Index Supply_Index Rainfall Temperature Political_Stability
1          159          188      124    26.44452                   9
2          131          182       74    32.59184                   3
3          198          143      118    26.40042                   2
4          144          185      158    25.83467                   4
5          198           96       97    27.25106                   3
6          132          177      192    27.59168                   7

5.3.3 Fixing Text Formatting Issues

  • Ensure uniform formats for dates, numbers, and text
  • Use regex to clean text data

Python Code

df_duplicates['Market_Sentiment'] = df_duplicates['Market_Sentiment'].str.capitalize().str.strip()

R Code

# Remove duplicate rows
df_duplicates <- df_dropna %>% distinct()

print(head(df_duplicates))
   Date CPO_Price Production Exports Imports  USD_IDR Market_Sentiment
1 19068  868.5479        102     244      60 14442.79         Positive
2 19069  771.7651        115     287      50 14765.80          neutral
3 19070  818.1564         99     208      77 14457.90         Positive
4 19071   50.0000         90     467     107 14033.63         NEGATIVE
5 19072  820.2134        130     305     148 14881.60         NEGATIVE
6 19073  794.6938         92     490     111 14127.73         Positive
  Demand_Index Supply_Index Rainfall Temperature Political_Stability
1          159          188      124    26.44452                   9
2          131          182       74    32.59184                   3
3          198          143      118    26.40042                   2
4          144          185      158    25.83467                   4
5          198           96       97    27.25106                   3
6          132          177      192    27.59168                   7

5.3.4 Handling Outliers

  • Visualize with boxplots to detect anomalies
  • Use IQR (Interquartile Range) or Z-score method

Python Code

# Remove outliers using IQR
Q1 = df_duplicates['CPO_Price'].quantile(0.25)
Q3 = df_duplicates['CPO_Price'].quantile(0.75)
IQR = Q3 - Q1
df_oulier1 = df_duplicates[(df_duplicates['CPO_Price'] >= (Q1 - 1.5 * IQR)) & 
     (df_duplicates['CPO_Price'] <= (Q3 + 1.5 * IQR))]
# Remove outliers using Z-Score for 'USD_IDR'
from scipy import stats
df_outlier2 = df_duplicates[(np.abs(stats.zscore(df_duplicates['USD_IDR'])) < 3)]

R Code

# Remove outliers using IQR
Q1 <- quantile(df_duplicates$CPO_Price, 0.25)
Q3 <- quantile(df_duplicates$CPO_Price, 0.75)
IQR <- Q3 - Q1
df_outlier1 <- df_duplicates[df_duplicates$CPO_Price >= (Q1 - 1.5 * IQR) & 
                            df_duplicates$CPO_Price <= (Q3 + 1.5 * IQR), ]
# Remove outliers using Z-Score for 'USD_IDR'
z_scores <- scale(df_duplicates$USD_IDR)
df_outlier2 <- df_duplicates[abs(z_scores) < 3, ]

5.3.5 Standardizing & Normalizing

Apply Min-Max Scaling or Standard Scaling for statistical models and ML

Python Code

# Normalize 'Temperature' & 'Rainfall'
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['Temperature', 'Rainfall']] = scaler.fit_transform(df[['Temperature', 'Rainfall']])
# Standardize 'Demand_Index' & 'Supply_Index'
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Demand_Index', 'Supply_Index']] = scaler.fit_transform(df[['Demand_Index', 'Supply_Index']])

R Code

# Normalize 'Temperature' & 'Rainfall' using Min-Max Scaling
normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}

df$Temperature <- normalize(df$Temperature)
df$Rainfall <- normalize(df$Rainfall)
# Standardize 'Demand_Index' & 'Supply_Index' using Z-score Standardization
df$Demand_Index <- scale(df$Demand_Index)
df$Supply_Index <- scale(df$Supply_Index)

5.3.6 Ensure Dataset

  • Drop unnecessary columns
  • Filter data that does not align with the analysis context

Python Code

# Ensure there are no missing values
df.dropna(inplace=True)

# Convert categorical variables into numerical format (if needed for regression)
df = pd.get_dummies(df, columns=['Market_Sentiment'], drop_first=True)
# Final Cleaned Dataset
print(df.head())
                        Date  ...  Market_Sentiment_neutral
0 2022-03-18 10:28:05.851223  ...                      True
1 2022-03-19 10:28:05.851223  ...                     False
2 2022-03-20 10:28:05.851223  ...                     False
3 2022-03-21 10:28:05.851223  ...                      True
4 2022-03-22 10:28:05.851223  ...                      True

[5 rows x 13 columns]

R Code

# Ensure there are no missing values
df <- na.omit(df)

# Convert categorical variables into numerical format (if needed for regression)
df <- model.matrix(~ Market_Sentiment - 1, data = df) %>% as.data.frame()

# Bind back to original dataframe (excluding the original categorical column)
df <- cbind(df, df[, !colnames(df) %in% "Market_Sentiment"])

# Print first few rows of the cleaned dataset
head(df)
  Market_SentimentNEGATIVE Market_Sentimentneutral Market_SentimentPositive
1                        0                       0                        1
2                        0                       1                        0
3                        0                       0                        1
4                        1                       0                        0
5                        1                       0                        0
6                        0                       0                        1
  Market_SentimentNEGATIVE Market_Sentimentneutral Market_SentimentPositive
1                        0                       0                        1
2                        0                       1                        0
3                        0                       0                        1
4                        1                       0                        0
5                        1                       0                        0
6                        0                       0                        1