4 Variables

Because you can have many objects open simultaneously in R, whenever you refer to a variable you need to first specify the dataset that contains it.

Variables are referred to as dataset$variable or dataset[[“variable”]], e.g.

flights$year
flights[["year"]]

4.1 Variable types

There are different types of variables in R. Variable type is automatically assigned when the data are read in, but can also be changed manually.

int = integers
dbl = real numbers
chr = character vectors/strings
dttm = date-time
lgl = logical
fctr = factor (categorical w/ fixed possible values)
date = date

4.2 Variable recoding

Sometimes we want to change a variable from one type to another, or change how information is stored in the variable. This is called variable recoding! As an example of variable recoding, let’s create a new variable within the flights dataset that marks whether a flight was delayed. We’ll work with an existing variable called “dep_delay”, which indicates the length of departure delays, in minutes.

4.2.1 A brief pause to talk about “piping”

The tidyverse approach is designed to allow you to string code together to create, essentially, one long piece of code that’s made up of many steps attached together. This is called the “pipeline.” Steps are connected with the pipeline operator, “%>%”. I will comment on this notation throughout.

4.2.2 Variable recoding in the tidyverse approach: mutate

In all my coding, I almost never change the variables that come with a dataset. Instead, I create new, recoded variables. To create a new variable, we’ll use the “mutate” function. Unlike the base R approach (shown below) we can create our new variable in one line of code.

flights <- flights %>% 
  mutate(delayed = ifelse(dep_delay > 0, 1, 0))

Let’s break this down. Skip “flights <-” for now. I begin with my dataset flights and then I add a pipeline operator, which tells R I’m working within the dataset flights. This gets around the need to specify data$variable each time I mention a variable in flights. By piping the dataset “flights” into the other commands, I’m specifying that any code should be applied to the dataset flights.

I like to move to the next line at this point, to keep my code organized, but you don’t have to. Now I draw on the mutate function. I start with the new variable name; I’m calling it “delayed”. Then I have to set values for this new variable. I’m using the ifelse function. For info see:

?ifelse

As the documentation page explains, the input for ifelse is ifelse(test, yes, no). I first test whether dep_delay is greater than 0; if yes, I assign a value of 1 for delayed. If no, I assign a value of 0.

Finally, I have to make sure I’m saving my changes to my data object “flights”, so I put “flights <-” in front of everything. If I were re-saving this data as a different dataset, I would change the name before the arrow. If I don’t do this step, R will not save the changes, it will only display them in the console.

Note that we can create multiple variables within the same mutate function.

flights <- flights %>%
  mutate(delayed = ifelse(dep_delay > 0, 1, 0),
         long.distance = ifelse(distance > 1000, 1, 0))

The variable long.distance takes on a value of 1 if the flight distance was greater than 1000 miles, and a 0 otherwise. I’ve separated between the two new variables with a comma.

Both of these new variables use the function ifelse, which is super helpful, especially for dummy variable construction. Within mutate, you can also any number of other functions. You can add (+), subtract (-), multiply (*) and divide (/) existing variables. You can calculate row means (rowMeans()) or row sums (rowSums()), or just set the whole variable equal to one value. A more advanced version of ifelse() is case_when().

R’s tidyverse approach is an efficient way to create and recode variables. I would encourage you to explore on your own as you work with your own data. For example, try mutate_at() to manipulate (multiple) existing variables in the same way.

4.2.3 Old school variable recoding in base R

For people who are familiar with Stata, sometimes base R is a more straightforward transition. And, sometimes I like to do things old school. So here’s how to create the same variable in base R.

First, create a blank variable in flights called “delayed” by giving every observation a 0. This is equivalent to Stata’s “generate” command.

The syntax reads, in English: the variable “delayed” in the dataset “flights” is equal to (<-) repeated values of 0 for the number of rows in the dataset flights.

flights$delayed <- rep(0, nrow(flights))

Then, give delayed flights (or flights that have a departure delay of greater than 0 minutes) a value of 1 on the new variable delayed. This is equivalent to Stata’s “replace” command.

R uses hard brackets [ ] to denote data subsets. You should read them as “where” or “if”.

The syntax reads: the variable “delayed” in the dataset “flights” WHERE the variable “dep_delay” in “flights” is greater than 0, should be set equal to (<-) 1.

flights$delayed[flights$dep_delay > 0] <- 1

Finally, we need to accommodate missing data (the tidyverse approach does this automatically but base R does not). R holds missing data as “NA” and has a specific function to test whether a value is missing: is.na().

Note: sometimes people will try to use a logical statement such as " == NA“. This won’t work, you have to use is.na()”

The syntax reads: the variable “delayed” in the dataset “flights” WHERE the variable “dep_delay” in “flights” is missing, is equal to (<-) NA (missing).

flights$delayed[is.na(flights$dep_delay)] <- NA

4.3 Variable renaming

We can easily rename existing variables using the rename() function within the tidyverse. For example, let’s rename the variable “dest” to “destination”.

flights <- flights %>%
  rename(destination = dest) # the new name goes first

4.3.1 Renaming multiple variables at once

You can also rename several variables at the same time with rename_at(). This is challenging, so feel free to skip. Let’s look at a mini dataset I’m calling children.

children

##      V1 V2    V3
## 1    al  6  blue
## 2   bea  7 green
## 3 carol  4   red

Right now, the variables are named V1, V2, and V3. That’s not super informative. They should actually be called “name”, “age”, and “fav.color”.

var.list <- c("V1", "V2", "V3") #this is a vector of the existing variables
names <- c("name", "age", "fav.color") #this is a vector, of the same length, with the new variable names
children <- children %>%
  rename_at(vars(all_of(var.list)), ~ names)

We begin by saving over and piping in the children dataset. Then we use rename_at(). vars(var.list) tells R to focus on variables that are in the list we created. and “~ names” means apply the values in names to those variables. If we look at the dataset again:

children

##    name age fav.color
## 1    al   6      blue
## 2   bea   7     green
## 3 carol   4       red

Note: mutate_at() works similarly to rename_at().