Chapter 3 Manage variables within a dataset

Once you have imported a dataset and done some checking for data anomalies, the next step is to get your data ready for analyses. In this chapter, you will learn how to:

  • Clean up data anomalies,
  • Code categorical variables as factor variables,
  • Collapse categorical variables into fewer levels,
  • Transform continuous variables,
  • Bin continuous variables,
  • Set values to missing, and
  • Rename variables.

This chapter covers managing variables within a dataset and the next chapter covers managing datasets.

NOTE: It is best practice to keep a copy of the original dataset and/or the original variables so you can compare your dataset before vs. after manipulation to verify that any differences are as you intended. It is very easy to make mistakes, so ALWAYS, ALWAYS, ALWAYS check any derivations by comparing before to after. If your final dataset will have the same number of observations, and with rows in the same order, as the original dataset, then you can use a copy of the original dataset for comparison. If not, then within the new dataset create a copy of each variable you are going to change. In either case, after any data manipulation, compare the original to the new variable to verify that all changes are as expected. In particular, pay attention to the unintentional creation of missing values.

For this chapter, we will use a dataset based on the Rheumatoid Arthritis study data but modified to illustrate various points in this chapter. See Section 1.20.1 for more information about the original dataset.

load("Data/RheumArth-Chapter3.RData")

Using ls() you can see that the dataset was called mydat in R when it was saved to the file RheumArth-Chapter3.RData.

ls()

Create a tibble version for the tidyverse code we will be using.

library(tidyverse)
mydat_tibble <- as_tibble(mydat)

While mydat is of class data.frame, mydat_tibble has three classes.

class(mydat)
## [1] "data.frame"
class(mydat_tibble)
## [1] "tbl_df"     "tbl"        "data.frame"