Chapter 4 Data Preparation

Data prep in R is not easy. Consider this page to still be under construction.

4.1 Packages Needed for Data Preparation

This code will check that required packages for this chapter are installed, install them if needed, and load them into your session.

req <- substitute(require(x, character.only = TRUE))
libs<-c("dplyr", "Hmisc", "DataExplorer", "funModeling", "reshape", "plyr", "car")
sapply(libs, function(x) eval(req) || {install.packages(x); eval(req)})

4.2 Knowing Thy Data

The following commands are useful in getting to know your data.

View(mydata)        # This is useful for viewing your data set (note the capital V)
head(mydata)        # Shows the first five rows of data
dplyr::glimpse(mydata)     # glimpse is a function of the dplyr package  
funModeling::status(mydata) # shows you the number/% of zeroes, NAs, and variable type in data set
Hmisc::describe(mydata)    # describe is a function of the Hmisc package (I favor this one)
class(mydata$var)   # To see how R currently classifies a particular variable
DataExplorer::create_report(mydata)    # Useful for creating a thorough report on your data

4.3 Cleaning and Prepping Data

Removing a variable

mydata$var <- NULL  # Base R approach

Keeping select variables (consider assigning the variables to a new dataset, as below)

mydata2 <- select(mydata, var1, var2, var3)  # tidyverse approach

Cloning a variable or even a data set

mydata$newvar <- mydata$oldvar   # Base R approach
mydata <- dplyr::mutate(mydata, clonevar = oldvar) # tidyverse approach

clone_dataset <- original_dataset   # I often use this to simplify the name of my data set

Generating a variable based on other variable(s) (examples)

mydata$newvar <- mydata$oldvar1 / mydata$oldvar2  # Base R approach
mydata$newvar <- mydata$oldvar / 1000  # Base R approach
mydata <- dplyr::mutate(mydata, newvar = var1/var2)  # tidyverse approach

Renaming a Variable (using “reshape” package)

mydata <- reshape::rename(mydata, c("oldname" = "newname"))

mydata <- dplyr::rename(mydata, newname = oldname)  # tidyverse approach

Recoding a categorical variable (using plyr, car, or forcats packages)

mydata$newvar <- plyr::revalue(mydata$origvar, c("1" = "2", "2" = "2", "3" = "1"))
mydata$newvar <- plyr::revalue(mydata$origvar, c("1 label" = "# label", "2 label" = "# label"))

mydata$newvar <- car::recode(origvar, "old=new; old=new; old=new")

forcats here!!!!!!!

Reverse Coding

mydata$var_r <- mydata$var * -1 + maxvalue

Z-scoring continuous variables

mydata$zvar <- scale(mydata$var)

Adding Value Labels

Before adding value labels, you must first declare your categorical variables as factor variables.

mydata$catvar <- factor(mydata$catvar)

# In this generic code, I use values 1 through 4. Your values may differ; adjust the code accordingly.

mydata$catvar <- factor(mydata$catvar, levels = c(1,2,3,4), labels = c("label1", "label2", "label3", "label4"))

4.4 Subsetting Data and Imposing Conditions

Generating Subset Datasets

You may want to consult this blog post on approaches to subsetting a data frame in R.

If you want to generate a new dataset that is a subset of an existing one, this may be the most straightforward approach. It makes use of dplyr’s select() function.

mydata_subset <- dplyr::select(mydata, var1, var2, var3, var4)

You can also include an option to restrict the new dataset based on other conditions by making use of dplyr’s filter() function. The following command replicates the above selection but specifies that the new dataset should include only those observations that have a value of 2 on variable 4.

mydata_subset <- mydata %>%
   filter(var4 == 2) %>%
   select(var1, var2, var3, var4)

Embedding Conditions within Commands

If you want to simply run an analysis based on certain conditions without first generating another dataset, you can embed the condition(s) within the command. For example, the following commands show a regular ANOVA command followed by commands that restrict the sample based on conditions tied to var1 and var2. Note that you cannot use < or > with factor variables when subsetting; you will have to use == and possibly ( ) if including a condition based on another variable.

object <- aov(intvar ~ catvar, data = mydata)
object <- aov(intvar ~ catvar, data = subset(mydata, var1<3 & var2 == 2))
object <- aov(intvar ~ catvar, data = subset(mydata, (var1 == 1 | var1 == 2) & var2 < 40))

object <- lm(dv ~ iv1 + iv2, data = filter(mydata, var3 < 10 & var4 == 3))

object <- mydata %>%
   filter(var3 < 10 & var4 == 3) %>%
   ggplot(aes(x = xvar, y = yvar)) +
object <- mydata %>%
   select(var1, var2, var3) %>%