9 Data Cleaning

Data Cleaning is one of the most important, frustrating, and time consuming tasks you will have to do. But if you cannot trust your data, you also cant trust your analysis and conclusions. Garbage in - garbage out.

To practice, I created a data set with some climate data called dirtydata.csv. You should download it and follow along. It is good practice to save the original data file under a different name (I call it dd_original) so that you have ready access to it. The data frame I will be working with is dd, short for dirty data.

dd_original <-read.csv("dirtydata.csv")
dd <- dd_original

You will need the packages dplyr and stringr.

library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.3.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)
#> Warning: package 'stringr' was built under R version 4.3.2

Let’s have a look and see what we are dealing with.

str(dd)
#> 'data.frame':    157 obs. of  8 variables:
#>  $ Obs..Number: int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Ozone      : int  41 36 12 18 NA 28 23 19 8 NA ...
#>  $ Wind       : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
#>  $ Temp       : int  67 72 74 62 56 66 65 59 61 69 ...
#>  $ Collector  : chr  "KC" "BM" "BM" "KC" ...
#>  $ Rain       : chr  "0" "1,087" "0" "217" ...
#>  $ Date       : chr  "1973 5 1" "5 2 1973" "5 3 1973" "5 4 1973" ...
#>  $ Location   : int  2 3 1 2 1 1 2 1 2 3 ...

9.1 Removing incomplete and duplicate data sets

Looks like we have 157 observations (rows) of 8 different variables (columns). Some of these entries have missing values NA, which I do not want. You remove them with the na.omit command.

dd <- na.omit(dd)
str(dd)
#> 'data.frame':    119 obs. of  8 variables:
#>  $ Obs..Number: int  1 2 3 4 6 7 8 9 11 12 ...
#>  $ Ozone      : int  41 36 12 18 28 23 19 8 7 16 ...
#>  $ Wind       : num  7.4 8 12.6 11.5 14.9 8.6 13.8 20.1 6.9 9.7 ...
#>  $ Temp       : int  67 72 74 62 66 65 59 61 74 69 ...
#>  $ Collector  : chr  "KC" "BM" "BM" "KC" ...
#>  $ Rain       : chr  "0" "1,087" "0" "217" ...
#>  $ Date       : chr  "1973 5 1" "5 2 1973" "5 3 1973" "5 4 1973" ...
#>  $ Location   : int  2 3 1 2 1 2 1 2 3 2 ...
#>  - attr(*, "na.action")= 'omit' Named int [1:38] 5 10 25 26 27 32 33 34 35 36 ...
#>   ..- attr(*, "names")= chr [1:38] "5" "10" "25" "26" ...

We now have only 119 observations, but they are complete. Some of the entries might be duplicates, we check and remove them like this:

dd <- dd[!duplicated(dd),]

The ! means “not”, so [!duplicate(dd,] means keep those rows of dd that are not duplicate, and all the columns.

9.2 Fixing data types

Next, check the data types. Note how Rain is a chr variable, when it really should be numeric. However, a numeric variable can’t have a comma , in it. We need to remove the commas, then convert the result to a numeric variable. Remember that you can check that the conversion was done correctly either with the str function or by checking the environment.

dd$Rain <- as.numeric(str_remove(dd$Rain,","))

Location should be a factor, not an integer:

dd$Location <- as.factor(dd$Location)

Let’s check our work so far:

str(dd)
#> 'data.frame':    116 obs. of  8 variables:
#>  $ Obs..Number: int  1 2 3 4 6 7 8 9 11 12 ...
#>  $ Ozone      : int  41 36 12 18 28 23 19 8 7 16 ...
#>  $ Wind       : num  7.4 8 12.6 11.5 14.9 8.6 13.8 20.1 6.9 9.7 ...
#>  $ Temp       : int  67 72 74 62 66 65 59 61 74 69 ...
#>  $ Collector  : chr  "KC" "BM" "BM" "KC" ...
#>  $ Rain       : num  0 1087 0 217 1918 ...
#>  $ Date       : chr  "1973 5 1" "5 2 1973" "5 3 1973" "5 4 1973" ...
#>  $ Location   : Factor w/ 3 levels "1","2","3": 2 3 1 2 1 2 1 2 3 2 ...
#>  - attr(*, "na.action")= 'omit' Named int [1:38] 5 10 25 26 27 32 33 34 35 36 ...
#>   ..- attr(*, "names")= chr [1:38] "5" "10" "25" "26" ...

9.3 Working with dates

Date is stored as a character right now, it should be a date format. The built-in as.Date function handles dates without times, the package ‘chron’ handles dates and times, but not time zones, and the POSIXct and POSIXlt formats allow for dates,times, and time zones. In general, you want to use the simplest format possible. To use as.Date, you need to tell R what format to expect.The as.Date() function will convert the characters that it recognizes to be part of a dates and ignore all other characters in the string.

date <- as.Date(dd$Date,format='%Y %m %d')
Format Interpretation
%d Day of the month as a number
%a Day,abbreviated
%A Day, full name
%m Month as a number
%b Month, abbreviated
%B Month, full name
%y Year, two digits
%Y Year, four digits
%H Decimal hour (24 hours)
%I Decimal hour (12 hours)
%M Decimal minute
%S Decimal second
%p AM/PM

Our problem is that ‘Date’ is not stored in a consistent format. Sometimes the year comes first, sometimes last. as.Date will mis-interpret those entries that do not conform to the specified format, check your environment.

An easy fix is to us the POSIX format . POSIXct stores date and time in seconds with the number of seconds beginning at 1 January 1970. Negative numbers are used to store dates prior to 1970. POSIXlt stores a list of day, month, year, hour, minute, second, etc. From what I have read, POSIXct is optimized for storage and computation and might be preferrable to POSIXlt..

We first turn all the dates into a variable ‘date’ transformed into the POSIX format.

Date <- as.POSIXct(dd$Date, format = "%m %d %Y")

If a date was not in the correct format, it will show up as ‘NA’. We can find those ‘NA’, have a look at them, and assign them a corrected value.

missing <- which(is.na(Date))
missing
#> [1]  1 22
dd$Date[missing]
#> [1] "1973 5 1"  "1973 5 24"
dd$Date[1] <- "5 1 1973"
dd$Date[22] <- "5 24 1973"
dd$Date <- as.POSIXct(dd$Date, format = "%m %d %Y")

We can now extract the year and month.

ddyear <- as.numeric(format(dd$Date, format="%Y"))
ddmonth <- format(dd$Date, format="%m")

This gives the levels labels and adds the new columns to our data frame.

ddmonth <- factor(ddmonth, level=c("05", "06", "07", "08", "09"), labels=c("May","June","July","Aug","Sept"))
dd <- data.frame(dd, ddyear, ddmonth)

9.4 Detecting and removing unreasonable values

Next, let’s look at some of the numeric variables, starting with ozone. The EPA measures ozone on a scale from 0 to 500, so values between 1 and 168 are reasonable.

summary(dd$Ozone)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.00   18.00   31.50   42.13   63.25  168.00

Wind speed also seems reasonable:

summary(dd$Wind)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    2.30    7.40    9.70   10.66   11.62  103.00

The temperatures on the other hand seem a little high:

summary(dd$Temp)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   57.00   71.00   79.50   83.53   85.00  730.00

and rain should not be negative:

summary(dd$Rain)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -1741.0     0.0    50.0   570.4  1384.8  1994.0

For a factor variable, you can print out the levels like this:

print(levels(dd$Location))
#> [1] "1" "2" "3"

For a chr variable, get a list of all unique values like this:

unique(dd$Collector)
#> [1] "KC"             "BM"             "JD"            
#> [4] "Captain Picard" "Frodo"

Next, we remove all garbage data sets.

out <- which(dd$Rain <0)
dd <- dd[-out,]
out <- which(dd$Collector == "Frodo")
dd <- dd[-out,]
out <- which(dd$Collector == "Captain Picard")
dd <- dd[-out,]
out <- which(dd$Temp >= 120)
dd <- dd[-out,]

If you prefer, you can also combine removing and detecting invalid data like this:

out <- which(dd$Rain <0 | dd$Collector == "Frodo" | dd$Collector == "Captain Picard" | dd$Temp >= 120)

dd <- dd[-out,]

9.5 Adding and removing data, renaming variables

Assume you forgot a row of data or rows of data. You can add them with the rbind command. Note that the names in the original and added data frame have to match.

last <-max(dd$Obs..Number)
newRow <- data.frame(Obs..Number = last+1, Ozone=12,  Wind=0, Temp=36, Collector = "JD", Rain = 0, Date = "10-4-2022",Location="1",ddyear=2022,ddmonth="Oct" ) 
dd <- rbind(dd, newRow)
manyrows <- data.frame(Obs..Number =c(1,2,3,4), Ozone= c(10,11,12,12), Wind=c(1,34,21,1),Temp=c(35,37,38,66),  Collector=c("JD","BM","BM","KC"), Rain=c(12,13,1,0), Date=c("10-5-2022","10-6-2022","10-7-2022","10-8-2022"), Location=c("2","2","3","1"), ddyear=c(2022,2022,2022,2022), ddmonth=c("Oct","Oct","Oct","Oct"))
dd <- rbind(dd, manyrows)

I do not want to keep the observation number. You can remove a variable with the subset command or, if you know the column number you want to remove, with the ‘-’ command. Or you can list the name(s) of the columns you want to drop. Below are three methods listed.

dd <- subset(dd, select= -c(Obs..Number))
dd2 <- dd[,-1]
dd3 <- dd[,!names(dd) %in% c("Obs..Number")]

Lets rename Temp Temperature:

dd<- rename(dd,Temperature=Temp)

You can change the names to all upper case or all lower case with these commands:

dd<- rename_with(dd, toupper)
dd<- rename_with(dd, tolower)

We can change entries, for example change the name of our collectors from initials to full names. We first find out which initials are being used, and then assign the full names.

unique(dd$collector)
#> [1] "KC" "BM" "JD"
KC <- which(dd$collector =="KC")
BM <- which(dd$collector =="BM")
JD <- which(dd$collector =="JD")
dd$collector[KC] <- "Kevin Cool"
dd$collector[BM] <- "Bob Mole"
dd$collector[JD] <- "John Dorian"

Finally, let’s assume you want to add a new column. You need to make sure that the new column has as many rows as the rest of the data frame. The ‘nrow’ command finds how many rows you have.

new <- runif(nrow(dd), 0, 10)
dd <- data.frame(dd, new)
str(dd)
#> 'data.frame':    116 obs. of  10 variables:
#>  $ ozone      : num  41 36 12 18 28 23 19 8 7 16 ...
#>  $ wind       : num  7.4 8 12.6 11.5 14.9 8.6 13.8 20.1 6.9 9.7 ...
#>  $ temperature: num  67 72 74 62 66 65 59 61 74 69 ...
#>  $ collector  : chr  "Kevin Cool" "Bob Mole" "Bob Mole" "Kevin Cool" ...
#>  $ rain       : num  0 1087 0 217 1918 ...
#>  $ date       : POSIXct, format: "1973-05-01" ...
#>  $ location   : Factor w/ 3 levels "1","2","3": 2 3 1 2 1 2 1 2 3 2 ...
#>  $ ddyear     : num  1973 1973 1973 1973 1973 ...
#>  $ ddmonth    : Factor w/ 6 levels "May","June","July",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ new        : num  3.797 2.593 0.27 6.907 0.486 ...

Once you are done, you should save the cleaned data as a .csv file.

write.csv(dd, "dd_clean.csv", row.names = FALSE)

9.6 Assignment

Clean the border crossing data posted on Plato.