13 Data Cleaning and Preprocessing (11/7)

You might spend up to 80% of your time finding, cleaning, and preparing data for analysis. Typically, you will create a dataset that has a lot of noise, and maybe a few errors. Today we will address methods to clean and validate datasets that emerge from your research.

Things to consider:

  • Data-type: when values in a column must be a particular data type

  • Range of data: the expected ranges for responses

  • Missing data: Non-responses

  • Text field patterns: phone numbers in a similar format

  • Unnecessary information: extra columns that aren’t important

  • Dummy variables and labels

  • Outliers

  • The need to transform/creating new columns

13.1 Case Study 1: Titanic Data

These data were collected based on the titanic passenger list. Variables include:

  • Survived (0 = no, 1 = yes)

  • Pclass (1 = first, 2 = second, 3 = third)

  • Name (full name)

  • Sex (male or female)

  • Age (continuous)

  • SiblingsorSpouses (continuous)

  • ParentsorChildren (continuous)

  • Fare (cost of ticket, continuous).

There are several errors in the dataset, including typos in the sex column as well as errant numbers in the age column. Additionally, we need to a) label the class and survived variables with text that describes the data, b) deal with missing data, c) create a column for last names, d) transform some non-normal data, and e) make a combined number of family members column.

13.1.1 Typos in the Sex Column

library(tidyverse)
library(summarytools)
titanic <- read_csv('titanic.csv')
## Rows: 887 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Name, Sex
## dbl (6): Survived, Pclass, Age, SiblingsorSpouses, ParentsorChildren, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Use the unique() function to identify unique terms in the Sex column.

unique(titanic$Sex)
## [1] "male"   "female" "Female" "Male"   "Man"    "Woman"  "woman"

Rename responses such that they are either “male” or “female”

titanic$Sex <- str_replace_all(titanic$Sex, "Female", "female") 
titanic$Sex <- str_replace_all(titanic$Sex, "Male", "male")
titanic$Sex <- str_replace_all(titanic$Sex, "Man", "male") 
titanic$Sex <- str_replace_all(titanic$Sex, "Woman", "female") 
titanic$Sex <- str_replace_all(titanic$Sex, "woman", "female") 
unique(titanic$Sex)
## [1] "male"   "female"

13.1.2 Missing Data and Typos in the Age Column

Let’s check out the age column. Notice the 14 NA’s? These are missing data. We can take two approaches. First, delete cases that have missing values in the age column.

summary(titanic$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.50   28.00   30.53   38.00  610.00      14
titanic2 <- titanic |> filter(!is.na(Age))
nrow(titanic2)
## [1] 873

Or impute the mean value into cells with missing data.

titanic$Age[is.na(titanic$Age)] <- mean(titanic$Age, na.rm=TRUE)
nrow(titanic)
## [1] 887

Assess the age column for outliers and remove cases above a reasonable threshold.

summary(titanic$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   21.00   28.00   30.53   38.00  610.00
titanic <- titanic |> filter(titanic$Age < 100)
summary(titanic$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   21.00   28.00   29.48   37.00   80.00
nrow(titanic) #two cases were removed
## [1] 885

13.1.3 Label Class and Survived Columns with Text

titanic$Survived <- str_replace_all(titanic$Survived, "0", "No")
titanic$Survived <- str_replace_all(titanic$Survived, "1", "Yes")
titanic
## # A tibble: 885 × 8
##    Survived Pclass Name                        Sex     Age Sibli…¹ Paren…²  Fare
##    <chr>     <dbl> <chr>                       <chr> <dbl>   <dbl>   <dbl> <dbl>
##  1 No            3 Mr. Owen Harris Braund      male     22       1       0  7.25
##  2 Yes           3 Miss. Laina Heikkinen       fema…    26       0       0  7.92
##  3 Yes           1 Mrs. Jacques Heath (Lily M… fema…    35       1       0 53.1 
##  4 No            3 Mr. William Henry Allen     male     35       0       0  8.05
##  5 No            3 Mr. James Moran             male     27       0       0  8.46
##  6 No            1 Mr. Timothy J McCarthy      male     54       0       0 51.9 
##  7 No            3 Master. Gosta Leonard Pals… male      2       3       1 21.1 
##  8 Yes           3 Mrs. Oscar W (Elisabeth Vi… fema…    27       0       2 11.1 
##  9 Yes           2 Mrs. Nicholas (Adele Achem… fema…    14       1       0 30.1 
## 10 Yes           3 Miss. Marguerite Rut Sands… fema…     4       1       1 16.7 
## # … with 875 more rows, and abbreviated variable names ¹​SiblingsorSpouses,
## #   ²​ParentsorChildren
titanic$Pclass <- str_replace_all(titanic$Pclass, "1", "First-Class")
titanic$Pclass <- str_replace_all(titanic$Pclass, "2", "Second-Class")
titanic$Pclass <- str_replace_all(titanic$Pclass, "3", "Third-Class")
titanic
## # A tibble: 885 × 8
##    Survived Pclass       Name                  Sex     Age Sibli…¹ Paren…²  Fare
##    <chr>    <chr>        <chr>                 <chr> <dbl>   <dbl>   <dbl> <dbl>
##  1 No       Third-Class  Mr. Owen Harris Brau… male     22       1       0  7.25
##  2 Yes      Third-Class  Miss. Laina Heikkinen fema…    26       0       0  7.92
##  3 Yes      First-Class  Mrs. Jacques Heath (… fema…    35       1       0 53.1 
##  4 No       Third-Class  Mr. William Henry Al… male     35       0       0  8.05
##  5 No       Third-Class  Mr. James Moran       male     27       0       0  8.46
##  6 No       First-Class  Mr. Timothy J McCart… male     54       0       0 51.9 
##  7 No       Third-Class  Master. Gosta Leonar… male      2       3       1 21.1 
##  8 Yes      Third-Class  Mrs. Oscar W (Elisab… fema…    27       0       2 11.1 
##  9 Yes      Second-Class Mrs. Nicholas (Adele… fema…    14       1       0 30.1 
## 10 Yes      Third-Class  Miss. Marguerite Rut… fema…     4       1       1 16.7 
## # … with 875 more rows, and abbreviated variable names ¹​SiblingsorSpouses,
## #   ²​ParentsorChildren
#set as factors
titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)

13.1.4 Create a Composite Family Members Column

titanic$totalfamily <- titanic$SiblingsorSpouses + titanic$ParentsorChildren
titanic
## # A tibble: 885 × 9
##    Survived Pclass       Name          Sex     Age Sibli…¹ Paren…²  Fare total…³
##    <fct>    <fct>        <chr>         <chr> <dbl>   <dbl>   <dbl> <dbl>   <dbl>
##  1 No       Third-Class  Mr. Owen Har… male     22       1       0  7.25       1
##  2 Yes      Third-Class  Miss. Laina … fema…    26       0       0  7.92       0
##  3 Yes      First-Class  Mrs. Jacques… fema…    35       1       0 53.1        1
##  4 No       Third-Class  Mr. William … male     35       0       0  8.05       0
##  5 No       Third-Class  Mr. James Mo… male     27       0       0  8.46       0
##  6 No       First-Class  Mr. Timothy … male     54       0       0 51.9        0
##  7 No       Third-Class  Master. Gost… male      2       3       1 21.1        4
##  8 Yes      Third-Class  Mrs. Oscar W… fema…    27       0       2 11.1        2
##  9 Yes      Second-Class Mrs. Nichola… fema…    14       1       0 30.1        1
## 10 Yes      Third-Class  Miss. Margue… fema…     4       1       1 16.7        2
## # … with 875 more rows, and abbreviated variable names ¹​SiblingsorSpouses,
## #   ²​ParentsorChildren, ³​totalfamily

13.1.5 Normality Transformation

Skewness and kurtosis were improved to normality via a log10 transformation. The log10 transformation calculates the exponent to which 10 must be raised to equal the number provided in the cell. 1 is added to deal with negative values.

psych::describe(titanic$Fare)
##    vars   n  mean    sd median trimmed   mad min    max  range skew kurtosis
## X1    1 885 32.29 49.81  14.45   21.45 10.28   0 512.33 512.33 4.76    32.99
##      se
## X1 1.67
titanic$FareLog <- log10(titanic$Fare + 1)
psych::describe(titanic$FareLog)
##    vars   n mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 885 1.29 0.42   1.19    1.25 0.36   0 2.71  2.71 0.39     0.96 0.01

13.1.6 Adding a Last Name Column

library(stringi)
titanic$LastName <- stri_extract_last_words(titanic$Name)
dfSummary(titanic$LastName)
## titanic$LastName was converted to a data frame
## Data Frame Summary  
## titanic  
## Dimensions: 885 x 1  
## Duplicates: 224  
## 
## --------------------------------------------------------------------------------------------------
## No   Variable      Stats / Values   Freqs (% of Valid)   Graph                Valid      Missing  
## ---- ------------- ---------------- -------------------- -------------------- ---------- ---------
## 1    LastName      1. Andersson       9 ( 1.0%)                               885        0        
##      [character]   2. Sage            7 ( 0.8%)                               (100.0%)   (0.0%)   
##                    3. Carter          6 ( 0.7%)                                                   
##                    4. Goodwin         6 ( 0.7%)                                                   
##                    5. Johnson         6 ( 0.7%)                                                   
##                    6. Panula          6 ( 0.7%)                                                   
##                    7. Skoog           6 ( 0.7%)                                                   
##                    8. Rice            5 ( 0.6%)                                                   
##                    9. Asplund         4 ( 0.5%)                                                   
##                    10. Baclini        4 ( 0.5%)                                                   
##                    [ 651 others ]   826 (93.3%)          IIIIIIIIIIIIIIIIII                       
## --------------------------------------------------------------------------------------------------

13.1.7 Save the Dataset

titanic <- titanic |> select(-SiblingsorSpouses, -ParentsorChildren, -Name, -Fare)
write_csv(titanic, 'titaniccleaned.csv')

13.2 Case Study 2: Expert Teaching Data

Your turn, clean the dataset called “expertteaching.csv” Use R for any tasks listed above. To start, open the file in excel and remove excess information. Identify next steps to organize the dataset. We will need to know what subscales each of the items belong to. Reference the survey on qualtrics (shared with your email addresses) for subscale lists.

expertteaching <- read_csv('expertteaching.csv')
## Rows: 117 Columns: 68
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (68): StartDate, EndDate, Status, IPAddress, Progress, Duration (in seco...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.