Chapter 6 Data Cleaning and Preprocessing

6.1 Introduction to Data Cleaning and Preprocessing

Data cleaning and preprocessing are essential steps in preparing raw data for analysis. High-quality data ensures reliable results, reduces noise, and facilitates better decision-making. This chapter explores techniques for handling missing values, normalizing data, and other preprocessing strategies such as Standard Normal Variate (SNV) transformations.

6.1.1 Why Data Cleaning and Preprocessing Are Important

Improves Data Quality: Ensures accuracy, consistency, and completeness.
Enhances Model Performance: Preprocessed data leads to better analytical and machine-learning results.
Reduces Bias: Identifies and mitigates issues like missing values and outliers.

6.2 Data clean

6.2.1 Handling Missing Values

Missing values (NAs in R) are common in datasets and must be handled carefully.

6.2.1.1 Identifying Missing Values

library(dplyr)  
data <- tibble::tibble(  
  ID = 1:6,  
  Name = c("Alice", "Bob", "Charlie", "David", NA, "Frank"),  
  Age = c(25, 30, NA, 40, 45, 50),  
  Score = c(88, NA, 85, 87, 90, 93)  
)  

print(data)  

# Check for missing values  
print(colSums(is.na(data)))

6.2.1.2 Removing Rows or Columns

Remove rows or columns containing missing values if their proportion is small.

cleaned_data <- data %>% drop_na()  
print(cleaned_data)

6.2.1.3 Imputing Missing Values

Impute missing values using statistical methods.

# Replace with mean  
data <- data %>% mutate(Age = ifelse(is.na(Age), mean(Age, na.rm = TRUE), Age))  
print(data)

6.2.1.4 Advanced Imputation

Use predictive models for imputation, such as mice or missForest packages.

# Example using the mice package  
# library(mice)  
# imputed_data <- mice(data, method = 'pmm', maxit = 5)  
# completed_data <- complete(imputed_data)

6.2.2 Handling Outliers

Outliers can distort statistical analysis and machine-learning models.

6.2.2.1 Identifying Outliers

6.2.2.1.1 Using Boxplots

boxplot(data$Score, main = "Score Boxplot", horizontal = TRUE)

6.2.2.1.2 Using the Interquartile Range (IQR)

iqr <- IQR(data$Score, na.rm = TRUE)  
lower_bound <- quantile(data$Score, 0.25, na.rm = TRUE) - 1.5 * iqr  
upper_bound <- quantile(data$Score, 0.75, na.rm = TRUE) + 1.5 * iqr  
print(lower_bound)  
print(upper_bound)

6.3 Data preprocess

6.3.1 Normalizing Data

Normalization scales data to a uniform range, making features comparable.

6.3.1.1 Min-Max Scaling

scaled_data <- data %>% mutate(Score = (Score - min(Score, na.rm = TRUE)) / (max(Score, na.rm = TRUE) - min(Score, na.rm = TRUE)))  
print(scaled_data)

6.3.1.2 Z-Score Standardization

standardized_data <- data %>% mutate(Score = (Score - mean(Score, na.rm = TRUE)) / sd(Score, na.rm = TRUE))  
print(standardized_data)

6.3.2 Transformations for Normality

Some analyses assume data is normally distributed. Transformations help achieve normality.

6.3.2.1 Logarithmic Transformation

Useful for right-skewed data.

log_transformed <- data %>% mutate(Score = log(Score + 1))  
print(log_transformed)

6.3.2.2 Square Root Transformation

sqrt_transformed <- data %>% mutate(Score = sqrt(Score))  
print(sqrt_transformed)

6.3.3 Standard Normal Variate (SNV) Transformation

SNV is often used in spectroscopy and other fields to remove scatter effects and standardize data.

6.3.3.1 Applying SNV

The SNV transformation standardizes each observation by subtracting its mean and dividing by its standard deviation.

snv <- function(x) {  
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)  
}  

data$Score_SNV <- snv(data$Score)  
print(data)

6.3.3.2 Categorical Data Encoding

Encoding categorical data is necessary for machine-learning models.

6.3.3.3 One-Hot Encoding

Convert categorical variables into binary columns.

encoded_data <- data %>% mutate(Group_A = ifelse(Group == "A", 1, 0), Group_B = ifelse(Group == "B", 1, 0))  
print(encoded_data)

6.3.3.4 Label Encoding

Assign numerical values to categories.

data <- data %>% mutate(Group = as.numeric(factor(Group)))  
print(data)

6.4 Summary

This chapter covered essential techniques for cleaning and preprocessing data in R. Properly addressing missing values, outliers, scaling, and normalizing data ensures high-quality datasets ready for analysis. Techniques like SNV transformation and logarithmic scaling are particularly useful in specialized fields, while encoding and duplicate handling ensure data compatibility with machine-learning workflows.