13 Data Cleaning and Preprocessing (11/7)
You might spend up to 80% of your time finding, cleaning, and preparing data for analysis. Typically, you will create a dataset that has a lot of noise, and maybe a few errors. Today we will address methods to clean and validate datasets that emerge from your research.
Things to consider:
Data-type: when values in a column must be a particular data type
Range of data: the expected ranges for responses
Missing data: Non-responses
Text field patterns: phone numbers in a similar format
Unnecessary information: extra columns that aren’t important
Dummy variables and labels
Outliers
The need to transform/creating new columns
13.1 Case Study 1: Titanic Data
These data were collected based on the titanic passenger list. Variables include:
Survived (0 = no, 1 = yes)
Pclass (1 = first, 2 = second, 3 = third)
Name (full name)
Sex (male or female)
Age (continuous)
SiblingsorSpouses (continuous)
ParentsorChildren (continuous)
Fare (cost of ticket, continuous).
There are several errors in the dataset, including typos in the sex column as well as errant numbers in the age column. Additionally, we need to a) label the class and survived variables with text that describes the data, b) deal with missing data, c) create a column for last names, d) transform some non-normal data, and e) make a combined number of family members column.
13.1.1 Typos in the Sex Column
library(tidyverse)
library(summarytools)
<- read_csv('titanic.csv') titanic
## Rows: 887 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Name, Sex
## dbl (6): Survived, Pclass, Age, SiblingsorSpouses, ParentsorChildren, Fare
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Use the unique() function to identify unique terms in the Sex column.
unique(titanic$Sex)
## [1] "male" "female" "Female" "Male" "Man" "Woman" "woman"
Rename responses such that they are either “male” or “female”
$Sex <- str_replace_all(titanic$Sex, "Female", "female")
titanic$Sex <- str_replace_all(titanic$Sex, "Male", "male")
titanic$Sex <- str_replace_all(titanic$Sex, "Man", "male")
titanic$Sex <- str_replace_all(titanic$Sex, "Woman", "female")
titanic$Sex <- str_replace_all(titanic$Sex, "woman", "female")
titanicunique(titanic$Sex)
## [1] "male" "female"
13.1.2 Missing Data and Typos in the Age Column
Let’s check out the age column. Notice the 14 NA’s? These are missing data. We can take two approaches. First, delete cases that have missing values in the age column.
summary(titanic$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.50 28.00 30.53 38.00 610.00 14
<- titanic |> filter(!is.na(Age))
titanic2 nrow(titanic2)
## [1] 873
Or impute the mean value into cells with missing data.
$Age[is.na(titanic$Age)] <- mean(titanic$Age, na.rm=TRUE)
titanicnrow(titanic)
## [1] 887
Assess the age column for outliers and remove cases above a reasonable threshold.
summary(titanic$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 21.00 28.00 30.53 38.00 610.00
<- titanic |> filter(titanic$Age < 100)
titanic summary(titanic$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 21.00 28.00 29.48 37.00 80.00
nrow(titanic) #two cases were removed
## [1] 885
13.1.3 Label Class and Survived Columns with Text
$Survived <- str_replace_all(titanic$Survived, "0", "No")
titanic$Survived <- str_replace_all(titanic$Survived, "1", "Yes")
titanic titanic
## # A tibble: 885 × 8
## Survived Pclass Name Sex Age Sibli…¹ Paren…² Fare
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 No 3 Mr. Owen Harris Braund male 22 1 0 7.25
## 2 Yes 3 Miss. Laina Heikkinen fema… 26 0 0 7.92
## 3 Yes 1 Mrs. Jacques Heath (Lily M… fema… 35 1 0 53.1
## 4 No 3 Mr. William Henry Allen male 35 0 0 8.05
## 5 No 3 Mr. James Moran male 27 0 0 8.46
## 6 No 1 Mr. Timothy J McCarthy male 54 0 0 51.9
## 7 No 3 Master. Gosta Leonard Pals… male 2 3 1 21.1
## 8 Yes 3 Mrs. Oscar W (Elisabeth Vi… fema… 27 0 2 11.1
## 9 Yes 2 Mrs. Nicholas (Adele Achem… fema… 14 1 0 30.1
## 10 Yes 3 Miss. Marguerite Rut Sands… fema… 4 1 1 16.7
## # … with 875 more rows, and abbreviated variable names ¹SiblingsorSpouses,
## # ²ParentsorChildren
$Pclass <- str_replace_all(titanic$Pclass, "1", "First-Class")
titanic$Pclass <- str_replace_all(titanic$Pclass, "2", "Second-Class")
titanic$Pclass <- str_replace_all(titanic$Pclass, "3", "Third-Class")
titanic titanic
## # A tibble: 885 × 8
## Survived Pclass Name Sex Age Sibli…¹ Paren…² Fare
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 No Third-Class Mr. Owen Harris Brau… male 22 1 0 7.25
## 2 Yes Third-Class Miss. Laina Heikkinen fema… 26 0 0 7.92
## 3 Yes First-Class Mrs. Jacques Heath (… fema… 35 1 0 53.1
## 4 No Third-Class Mr. William Henry Al… male 35 0 0 8.05
## 5 No Third-Class Mr. James Moran male 27 0 0 8.46
## 6 No First-Class Mr. Timothy J McCart… male 54 0 0 51.9
## 7 No Third-Class Master. Gosta Leonar… male 2 3 1 21.1
## 8 Yes Third-Class Mrs. Oscar W (Elisab… fema… 27 0 2 11.1
## 9 Yes Second-Class Mrs. Nicholas (Adele… fema… 14 1 0 30.1
## 10 Yes Third-Class Miss. Marguerite Rut… fema… 4 1 1 16.7
## # … with 875 more rows, and abbreviated variable names ¹SiblingsorSpouses,
## # ²ParentsorChildren
#set as factors
$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass) titanic
13.1.4 Create a Composite Family Members Column
$totalfamily <- titanic$SiblingsorSpouses + titanic$ParentsorChildren
titanic titanic
## # A tibble: 885 × 9
## Survived Pclass Name Sex Age Sibli…¹ Paren…² Fare total…³
## <fct> <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 No Third-Class Mr. Owen Har… male 22 1 0 7.25 1
## 2 Yes Third-Class Miss. Laina … fema… 26 0 0 7.92 0
## 3 Yes First-Class Mrs. Jacques… fema… 35 1 0 53.1 1
## 4 No Third-Class Mr. William … male 35 0 0 8.05 0
## 5 No Third-Class Mr. James Mo… male 27 0 0 8.46 0
## 6 No First-Class Mr. Timothy … male 54 0 0 51.9 0
## 7 No Third-Class Master. Gost… male 2 3 1 21.1 4
## 8 Yes Third-Class Mrs. Oscar W… fema… 27 0 2 11.1 2
## 9 Yes Second-Class Mrs. Nichola… fema… 14 1 0 30.1 1
## 10 Yes Third-Class Miss. Margue… fema… 4 1 1 16.7 2
## # … with 875 more rows, and abbreviated variable names ¹SiblingsorSpouses,
## # ²ParentsorChildren, ³totalfamily
13.1.5 Normality Transformation
Skewness and kurtosis were improved to normality via a log10 transformation. The log10 transformation calculates the exponent to which 10 must be raised to equal the number provided in the cell. 1 is added to deal with negative values.
::describe(titanic$Fare) psych
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 885 32.29 49.81 14.45 21.45 10.28 0 512.33 512.33 4.76 32.99
## se
## X1 1.67
$FareLog <- log10(titanic$Fare + 1)
titanic::describe(titanic$FareLog) psych
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 885 1.29 0.42 1.19 1.25 0.36 0 2.71 2.71 0.39 0.96 0.01
13.1.6 Adding a Last Name Column
library(stringi)
$LastName <- stri_extract_last_words(titanic$Name)
titanicdfSummary(titanic$LastName)
## titanic$LastName was converted to a data frame
## Data Frame Summary
## titanic
## Dimensions: 885 x 1
## Duplicates: 224
##
## --------------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
## ---- ------------- ---------------- -------------------- -------------------- ---------- ---------
## 1 LastName 1. Andersson 9 ( 1.0%) 885 0
## [character] 2. Sage 7 ( 0.8%) (100.0%) (0.0%)
## 3. Carter 6 ( 0.7%)
## 4. Goodwin 6 ( 0.7%)
## 5. Johnson 6 ( 0.7%)
## 6. Panula 6 ( 0.7%)
## 7. Skoog 6 ( 0.7%)
## 8. Rice 5 ( 0.6%)
## 9. Asplund 4 ( 0.5%)
## 10. Baclini 4 ( 0.5%)
## [ 651 others ] 826 (93.3%) IIIIIIIIIIIIIIIIII
## --------------------------------------------------------------------------------------------------
13.2 Case Study 2: Expert Teaching Data
Your turn, clean the dataset called “expertteaching.csv” Use R for any tasks listed above. To start, open the file in excel and remove excess information. Identify next steps to organize the dataset. We will need to know what subscales each of the items belong to. Reference the survey on qualtrics (shared with your email addresses) for subscale lists.
<- read_csv('expertteaching.csv') expertteaching
## Rows: 117 Columns: 68
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (68): StartDate, EndDate, Status, IPAddress, Progress, Duration (in seco...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.