# Chapter 4 Data Preparation

Data prep in R is not easy. Consider this page to still be under construction.

## 4.1 Packages Needed for Data Preparation

This code will check that required packages for this chapter are installed, install them if needed, and load them into your session.

```
req <- substitute(require(x, character.only = TRUE))
libs<-c("dplyr", "Hmisc", "DataExplorer", "funModeling", "reshape", "plyr", "car")
sapply(libs, function(x) eval(req) || {install.packages(x); eval(req)})
```

## 4.2 Knowing Thy Data

The following commands are useful in getting to know your data.

```
View(mydata) # This is useful for viewing your data set (note the capital V)
str(mydata)
head(mydata) # Shows the first five rows of data
dplyr::glimpse(mydata) # glimpse is a function of the dplyr package
funModeling::status(mydata) # shows you the number/% of zeroes, NAs, and variable type in data set
Hmisc::describe(mydata) # describe is a function of the Hmisc package (I favor this one)
class(mydata$var) # To see how R currently classifies a particular variable
DataExplorer::create_report(mydata) # Useful for creating a thorough report on your data
```

## 4.3 Cleaning and Prepping Data

**Removing a variable**

`mydata$var <- NULL # Base R approach`

**Keeping select variables** (consider assigning the variables to a new dataset, as below)

`mydata2 <- select(mydata, var1, var2, var3) # tidyverse approach`

**Cloning a variable or even a data set**

```
mydata$newvar <- mydata$oldvar # Base R approach
mydata <- dplyr::mutate(mydata, clonevar = oldvar) # tidyverse approach
clone_dataset <- original_dataset # I often use this to simplify the name of my data set
```

**Generating a variable based on other variable(s) (examples)**

```
mydata$newvar <- mydata$oldvar1 / mydata$oldvar2 # Base R approach
mydata$newvar <- mydata$oldvar / 1000 # Base R approach
mydata <- dplyr::mutate(mydata, newvar = var1/var2) # tidyverse approach
```

**Renaming a Variable (using “reshape” package)**

```
mydata <- reshape::rename(mydata, c("oldname" = "newname"))
mydata <- dplyr::rename(mydata, newname = oldname) # tidyverse approach
```

**Recoding a categorical variable (using plyr, car, or forcats packages)**

```
mydata$newvar <- plyr::revalue(mydata$origvar, c("1" = "2", "2" = "2", "3" = "1"))
mydata$newvar <- plyr::revalue(mydata$origvar, c("1 label" = "# label", "2 label" = "# label"))
mydata$newvar <- car::recode(origvar, "old=new; old=new; old=new")
mydata
forcats here!!!!!!!
```

**Reverse Coding**

```
summary(var)
mydata$var_r <- mydata$var * -1 + maxvalue
```

**Z-scoring continuous variables**

`mydata$zvar <- scale(mydata$var)`

**Adding Value Labels**

Before adding value labels, you must first declare your categorical variables as factor variables.

```
mydata$catvar <- factor(mydata$catvar)
# In this generic code, I use values 1 through 4. Your values may differ; adjust the code accordingly.
mydata$catvar <- factor(mydata$catvar, levels = c(1,2,3,4), labels = c("label1", "label2", "label3", "label4"))
```

## 4.4 Subsetting Data and Imposing Conditions

**Generating Subset Datasets**

You may want to consult this blog post on approaches to subsetting a data frame in R.

If you want to generate a new dataset that is a subset of an existing one, this may be the most straightforward approach. It makes use of dplyr’s select() function.

`mydata_subset <- dplyr::select(mydata, var1, var2, var3, var4)`

You can also include an option to restrict the new dataset based on other conditions by making use of dplyr’s filter() function. The following command replicates the above selection but specifies that the new dataset should include only those observations that have a value of 2 on variable 4.

```
mydata_subset <- mydata %>%
filter(var4 == 2) %>%
select(var1, var2, var3, var4)
```

**Embedding Conditions within Commands**

If you want to simply run an analysis based on certain conditions without first generating another dataset, you can embed the condition(s) within the command. For example, the following commands show a regular ANOVA command followed by commands that restrict the sample based on conditions tied to var1 and var2. Note that you cannot use < or > with factor variables when subsetting; you will have to use == and possibly ( ) if including a condition based on another variable.

```
object <- aov(intvar ~ catvar, data = mydata)
object <- aov(intvar ~ catvar, data = subset(mydata, var1<3 & var2 == 2))
object <- aov(intvar ~ catvar, data = subset(mydata, (var1 == 1 | var1 == 2) & var2 < 40))
object <- lm(dv ~ iv1 + iv2, data = filter(mydata, var3 < 10 & var4 == 3))
object <- mydata %>%
filter(var3 < 10 & var4 == 3) %>%
ggplot(aes(x = xvar, y = yvar)) +
geom_point()
object <- mydata %>%
select(var1, var2, var3) %>%
corrmorant()
```