5 Data Manipulation

Once our data are clean and we’ve created or recoded any necessary variables, we’re ready to start working with the data.

5.1 Subsetting

To create or look at subsets of your data, you can use the filter() function to select observations based on their values. Within the filter command, we use logical statements. Here is a short list of logical notation used in R.

== # equals
> >= # greater than, greater than or equal to
< <= # less than, less than or equal to
& # and. can also be denoted with a comma between statements
| # or
! # not
%in% # tests whether value falls within a range, specified with a vector
between() # another helpful function

For example, let’s look at flights that occurred on January 1st.

flights %>% filter(month==1,day==1) 
## # A tibble: 842 x 21
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     1      517            515         2      830            819        11 UA     
##  2  2013     1     1      533            529         4      850            830        20 UA     
##  3  2013     1     1      542            540         2      923            850        33 AA     
##  4  2013     1     1      544            545        -1     1004           1022       -18 B6     
##  5  2013     1     1      554            600        -6      812            837       -25 DL     
##  6  2013     1     1      554            558        -4      740            728        12 UA     
##  7  2013     1     1      555            600        -5      913            854        19 B6     
##  8  2013     1     1      557            600        -3      709            723       -14 EV     
##  9  2013     1     1      557            600        -3      838            846        -8 B6     
## 10  2013     1     1      558            600        -2      753            745         8 AA     
## # … with 832 more rows, and 11 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   destination <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, delayed <dbl>, long.distance <dbl>

If I wanted to save this as its own dataset, I could add use the “<-” notation introduced in the last section.

jan1 <- flights %>% filter(month==1,day==1) 

Although slightly different from subsetting, you can also select certain variables using the select() function. Here’s an example, in which I’m also using pipeline notation to first filter my data and then select certain variables.

flights %>% filter(month == 1, day == 1) %>%
  select(tailnum, dep_time, arr_time)
## # A tibble: 842 x 3
##    tailnum dep_time arr_time
##    <chr>      <int>    <int>
##  1 N14228       517      830
##  2 N24211       533      850
##  3 N619AA       542      923
##  4 N804JB       544     1004
##  5 N668DN       554      812
##  6 N39463       554      740
##  7 N516JB       555      913
##  8 N829AS       557      709
##  9 N593JB       557      838
## 10 N3ALAA       558      753
## # … with 832 more rows

5.2 Merging

Oppositely from subsetting, we can merge multiple datasets together, as long as they have at least one variable in common. The flights dataset we’ve been using is connected to several other datasets: airlines, airports, planes, and weather. I’m going to use the airlines dataset below.

airlines
## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

5.2.1 left_join()

One option for merging datasets is to preserve all of your original data, or keep all observations in your original dataset, and then add on more variables where there’s a match. We do this with the function left_join(). You need two pieces of information: the new dataset you are merging with your original, and the variable that’s shared between the two.

# first I'm going to make a smaller version of the flights data
flights2 <- flights %>% select(year:day, hour, tailnum,carrier)
# now I'm going to merge in the airlines dataset
flights2 %>% left_join(airlines, by = "carrier")
## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # … with 336,766 more rows

5.2.2 Other join commands

left_join() preserve all the original data. There are other join functions as well. For example, right_join() preserves what you’e adding while full_join() preserves both and fills in NAs where any cell is missing information.

5.2.3 merge

The join commands are built into R’s tidyverse. Another option is to use the merge() command, which works similarly.

flights2 %>% merge(airlines, by = "carrier")

Note that since the merge command is in base R and not the tidyverse, if you were to save the results using “<-”, they’d be saved as a data frame and not a tibble.

5.3 Hierarchical data structures

5.3.1 Going from wide to long

To convert wide data into long format, we use the function pivot_longer(). For clarity, let’s see a small example.

table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

I’m now going to use the pivot_longer() function to convert the columns 1999 and 2000 into values of a new variable, year. The values in the variables 1999 and 2000 will be held in a new variable, cases.

tiny4a <- table4a %>% pivot_longer(c('1999','2000'), names_to = "year", values_to = "cases")
tiny4a
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

Walking through this notation, I’m focusing on the variables 1999 and 2000 within table4a and I’m converting the column names to a new variable called year and moving the values to a new variable called cases.

5.3.2 Going from long to wide

If we have long data but want to convert it to wide format, we can use pivot_wider(), which follows nearly the same syntax. Let’s start with a long format dataset.

table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

I’m going to create variables for cases and population. In the syntax below, I pull new variable names from the original variable “type” and then fill in the values from the original variable “count”.

tiny2 <- table2 %>% pivot_wider(names_from = "type", values_from = "count")
tiny2
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583