Chapter 4 Data Preparation
Data prep in R is not easy. Consider this page to still be under construction.
4.1 Packages Needed for Data Preparation
This code will check that required packages for this chapter are installed, install them if needed, and load them into your session.
req <- substitute(require(x, character.only = TRUE))
libs<-c("dplyr", "Hmisc", "DataExplorer", "funModeling", "reshape", "plyr", "car")
sapply(libs, function(x) eval(req) || {install.packages(x); eval(req)})
4.2 Knowing Thy Data
The following commands are useful in getting to know your data.
View(mydata) # This is useful for viewing your data set (note the capital V)
str(mydata)
head(mydata) # Shows the first five rows of data
dplyr::glimpse(mydata) # glimpse is a function of the dplyr package
funModeling::status(mydata) # shows you the number/% of zeroes, NAs, and variable type in data set
Hmisc::describe(mydata) # describe is a function of the Hmisc package (I favor this one)
class(mydata$var) # To see how R currently classifies a particular variable
DataExplorer::create_report(mydata) # Useful for creating a thorough report on your data
4.3 Cleaning and Prepping Data
Removing a variable
mydata$var <- NULL # Base R approach
Keeping select variables (consider assigning the variables to a new dataset, as below)
mydata2 <- select(mydata, var1, var2, var3) # tidyverse approach
Cloning a variable or even a data set
mydata$newvar <- mydata$oldvar # Base R approach
mydata <- dplyr::mutate(mydata, clonevar = oldvar) # tidyverse approach
clone_dataset <- original_dataset # I often use this to simplify the name of my data set
Generating a variable based on other variable(s) (examples)
mydata$newvar <- mydata$oldvar1 / mydata$oldvar2 # Base R approach
mydata$newvar <- mydata$oldvar / 1000 # Base R approach
mydata <- dplyr::mutate(mydata, newvar = var1/var2) # tidyverse approach
Renaming a Variable (using “reshape” package)
mydata <- reshape::rename(mydata, c("oldname" = "newname"))
mydata <- dplyr::rename(mydata, newname = oldname) # tidyverse approach
Recoding a categorical variable (using plyr, car, or forcats packages)
mydata$newvar <- plyr::revalue(mydata$origvar, c("1" = "2", "2" = "2", "3" = "1"))
mydata$newvar <- plyr::revalue(mydata$origvar, c("1 label" = "# label", "2 label" = "# label"))
mydata$newvar <- car::recode(origvar, "old=new; old=new; old=new")
mydata
forcats here!!!!!!!
Reverse Coding
summary(var)
mydata$var_r <- mydata$var * -1 + maxvalue
Z-scoring continuous variables
mydata$zvar <- scale(mydata$var)
Adding Value Labels
Before adding value labels, you must first declare your categorical variables as factor variables.
mydata$catvar <- factor(mydata$catvar)
# In this generic code, I use values 1 through 4. Your values may differ; adjust the code accordingly.
mydata$catvar <- factor(mydata$catvar, levels = c(1,2,3,4), labels = c("label1", "label2", "label3", "label4"))
4.4 Subsetting Data and Imposing Conditions
Generating Subset Datasets
You may want to consult this blog post on approaches to subsetting a data frame in R.
If you want to generate a new dataset that is a subset of an existing one, this may be the most straightforward approach. It makes use of dplyr’s select() function.
mydata_subset <- dplyr::select(mydata, var1, var2, var3, var4)
You can also include an option to restrict the new dataset based on other conditions by making use of dplyr’s filter() function. The following command replicates the above selection but specifies that the new dataset should include only those observations that have a value of 2 on variable 4.
mydata_subset <- mydata %>%
filter(var4 == 2) %>%
select(var1, var2, var3, var4)
Embedding Conditions within Commands
If you want to simply run an analysis based on certain conditions without first generating another dataset, you can embed the condition(s) within the command. For example, the following commands show a regular ANOVA command followed by commands that restrict the sample based on conditions tied to var1 and var2. Note that you cannot use < or > with factor variables when subsetting; you will have to use == and possibly ( ) if including a condition based on another variable.
object <- aov(intvar ~ catvar, data = mydata)
object <- aov(intvar ~ catvar, data = subset(mydata, var1<3 & var2 == 2))
object <- aov(intvar ~ catvar, data = subset(mydata, (var1 == 1 | var1 == 2) & var2 < 40))
object <- lm(dv ~ iv1 + iv2, data = filter(mydata, var3 < 10 & var4 == 3))
object <- mydata %>%
filter(var3 < 10 & var4 == 3) %>%
ggplot(aes(x = xvar, y = yvar)) +
geom_point()
object <- mydata %>%
select(var1, var2, var3) %>%
corrmorant()