Chapter 4 Data Preparation
Data prep in R is not easy. Consider this page to still be under construction.
4.1 Packages Needed for Data Preparation
This code will check that required packages for this chapter are installed, install them if needed, and load them into your session.
req <- substitute(require(x, character.only = TRUE))
libs<-c("dplyr", "Hmisc", "DataExplorer", "funModeling", "reshape", "plyr", "car")
sapply(libs, function(x) eval(req) || {install.packages(x); eval(req)})
4.2 Knowing Thy Data
The following commands are useful in getting to know your data.
View(mydata) # This is useful for viewing your data set (note the capital V)
str(mydata)
head(mydata) # Shows the first five rows of data
dplyr::glimpse(mydata) # glimpse is a function of the dplyr package
funModeling::status(mydata) # shows you the number/% of zeroes, NAs, and variable type in data set
Hmisc::describe(mydata) # describe is a function of the Hmisc package (I favor this one)
class(mydata$var) # To see how R currently classifies a particular variable
DataExplorer::create_report(mydata) # Useful for creating a thorough report on your data
4.3 Cleaning and Prepping Data
Removing a variable
mydata$var <- NULL
Keeping select variables (consider assigning the variables to a new dataset, as below)
mydata2 <- mydata[ , c("var1", "var2", "var3")]
Cloning a variable or even a data set
mydata$newvar <- mydata$oldvar
clone_dataset <- original_dataset # I often use this to simplify the name of my data set
Generating a variable based on other variable(s) (examples)
mydata$newvar <- mydata$oldvar1 / mydata$oldvar2
mydata$newvar <- mydata$oldvar / 1000
Renaming a Variable (using “reshape” package)
mydata <- reshape::rename(mydata, c("oldname" = "newname"))
Recoding a categorical variable (using plyr or car packages)
mydata$newvar <- plyr::revalue(mydata$origvar, c("1" = "2", "2" = "2", "3" = "1"))
mydata$newvar <- plyr::revalue(mydata$origvar, c("1 label" = "# label", "2 label" = "# label"))
mydata$newvar <- car::recode(origvar, "old=new; old=new; old=new")
mydata
Reverse Coding
summary(var)
mydata$var_r <- mydata$var * -1 + maxvalue
Z-scoring continuous variables
mydata$zvar <- scale(mydata$var)
Adding Value Labels
Before adding value labels, you must first declare your categorical variables as factor variables.
mydata$catvar <- factor(mydata$catvar)
# In this generic code, I use values 1 through 4. Your values may differ; adjust the code accordingly.
mydata$catvar <- factor(mydata$catvar, levels = c(1,2,3,4), labels = c("label1", "label2", "label3", "label4"))
4.4 Subsetting Data and Imposing Conditions
Generating Subset Datasets
You may want to consult this blog post on how to subset a data frame in R: https://www.r-bloggers.com/2016/11/5-ways-to-subset-a-data-frame-in-r/
If you want to generate a new dataset that is a subset of an existing one, this may be the most straightforward approach:
mydata_subset <- subset(mydata, select = c("Variable 1", "Variable 2", Variable 3"))
You can also include an option to restrict the new dataset based on other conditions. The following command replicates the above subset command, but specifies that the new dataset should include only those cases that have a value of 2 on variable 4.
mydata_subset <- subset(mydata var4 == 2, select = c("Variable 1", "Variable 2", "Variable 3"))
Embedding Conditions within Commands
If you want to simply run an analysis based on certain conditions without first generating another dataset, you can embed the condition(s) within the command. For example, the following commands show a regular ANOVA command followed by commands that restrict the sample based on conditions tied to var1 and var2. Note that you cannot use < or > with factor variables when subsetting; you will have to use == and possibly ( ) if including a condition based on another variable.
object <- aov(intvar ~ catvar, data = mydata)
object <- aov(intvar ~ catvar, data = subset(mydata, var1<3 & var2 == 2))
object <- aov(intvar ~ catvar, data = subset(mydata, (var1 == 1 | var1 == 2) & var2 < 40))