Chapter 4 Data Preparation

Data prep in R is not easy. Consider this page to still be under construction.

4.1 Packages Needed for Data Preparation

This code will check that required packages for this chapter are installed, install them if needed, and load them into your session.

req <- substitute(require(x, character.only = TRUE))
libs<-c("dplyr", "Hmisc", "DataExplorer", "funModeling", "reshape", "plyr", "car")
sapply(libs, function(x) eval(req) || {install.packages(x); eval(req)})

4.2 Knowing Thy Data

The following commands are useful in getting to know your data.

View(mydata)        # This is useful for viewing your data set (note the capital V)
str(mydata)
head(mydata)        # Shows the first five rows of data
dplyr::glimpse(mydata)     # glimpse is a function of the dplyr package  
funModeling::status(mydata) # shows you the number/% of zeroes, NAs, and variable type in data set
Hmisc::describe(mydata)    # describe is a function of the Hmisc package (I favor this one)
class(mydata$var)   # To see how R currently classifies a particular variable
DataExplorer::create_report(mydata)    # Useful for creating a thorough report on your data

4.3 Cleaning and Prepping Data

Removing a variable

mydata$var <- NULL

Keeping select variables (consider assigning the variables to a new dataset, as below)

mydata2 <- mydata[ , c("var1", "var2", "var3")]

Cloning a variable or even a data set

mydata$newvar <- mydata$oldvar
clone_dataset <- original_dataset   # I often use this to simplify the name of my data set

Generating a variable based on other variable(s) (examples)

mydata$newvar <- mydata$oldvar1 / mydata$oldvar2
mydata$newvar <- mydata$oldvar / 1000

Renaming a Variable (using “reshape” package)

mydata <- reshape::rename(mydata, c("oldname" = "newname"))

Recoding a categorical variable (using plyr or car packages)

mydata$newvar <- plyr::revalue(mydata$origvar, c("1" = "2", "2" = "2", "3" = "1"))
mydata$newvar <- plyr::revalue(mydata$origvar, c("1 label" = "# label", "2 label" = "# label"))

mydata$newvar <- car::recode(origvar, "old=new; old=new; old=new")
mydata

Reverse Coding

summary(var)
mydata$var_r <- mydata$var * -1 + maxvalue

Z-scoring continuous variables

mydata$zvar <- scale(mydata$var)

Adding Value Labels

Before adding value labels, you must first declare your categorical variables as factor variables.

mydata$catvar <- factor(mydata$catvar)

# In this generic code, I use values 1 through 4. Your values may differ; adjust the code accordingly.

mydata$catvar <- factor(mydata$catvar, levels = c(1,2,3,4), labels = c("label1", "label2", "label3", "label4"))

4.4 Subsetting Data and Imposing Conditions

Generating Subset Datasets

You may want to consult this blog post on how to subset a data frame in R: https://www.r-bloggers.com/2016/11/5-ways-to-subset-a-data-frame-in-r/

If you want to generate a new dataset that is a subset of an existing one, this may be the most straightforward approach:

mydata_subset <- subset(mydata, select = c("Variable 1", "Variable 2", Variable 3"))

You can also include an option to restrict the new dataset based on other conditions. The following command replicates the above subset command, but specifies that the new dataset should include only those cases that have a value of 2 on variable 4.

mydata_subset <- subset(mydata var4 == 2, select = c("Variable 1", "Variable 2", "Variable 3"))

Embedding Conditions within Commands

If you want to simply run an analysis based on certain conditions without first generating another dataset, you can embed the condition(s) within the command. For example, the following commands show a regular ANOVA command followed by commands that restrict the sample based on conditions tied to var1 and var2. Note that you cannot use < or > with factor variables when subsetting; you will have to use == and possibly ( ) if including a condition based on another variable.

object <- aov(intvar ~ catvar, data = mydata)
object <- aov(intvar ~ catvar, data = subset(mydata, var1<3 & var2 == 2))
object <- aov(intvar ~ catvar, data = subset(mydata, (var1 == 1 | var1 == 2) & var2 < 40))