5 Data Manipulation
Once our data are clean and we’ve created or recoded any necessary variables, we’re ready to start working with the data.
5.1 Subsetting
To create or look at subsets of your data, you can use the filter() function to select observations based on their values. Within the filter command, we use logical statements. Here is a short list of logical notation used in R.
== # equals
> >= # greater than, greater than or equal to
< <= # less than, less than or equal to
& # and. can also be denoted with a comma between statements
| # or
! # not
%in% # tests whether value falls within a range, specified with a vector
between() # another helpful function
For example, let’s look at flights that occurred on January 1st.
## # A tibble: 842 x 21
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 832 more rows, and 11 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # destination <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, delayed <dbl>, long.distance <dbl>
If I wanted to save this as its own dataset, I could add use the “<-” notation introduced in the last section.
Although slightly different from subsetting, you can also select certain variables using the select() function. Here’s an example, in which I’m also using pipeline notation to first filter my data and then select certain variables.
## # A tibble: 842 x 3
## tailnum dep_time arr_time
## <chr> <int> <int>
## 1 N14228 517 830
## 2 N24211 533 850
## 3 N619AA 542 923
## 4 N804JB 544 1004
## 5 N668DN 554 812
## 6 N39463 554 740
## 7 N516JB 555 913
## 8 N829AS 557 709
## 9 N593JB 557 838
## 10 N3ALAA 558 753
## # … with 832 more rows
5.2 Merging
Oppositely from subsetting, we can merge multiple datasets together, as long as they have at least one variable in common. The flights dataset we’ve been using is connected to several other datasets: airlines, airports, planes, and weather. I’m going to use the airlines dataset below.
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
5.2.1 left_join()
One option for merging datasets is to preserve all of your original data, or keep all observations in your original dataset, and then add on more variables where there’s a match. We do this with the function left_join(). You need two pieces of information: the new dataset you are merging with your original, and the variable that’s shared between the two.
# first I'm going to make a smaller version of the flights data
flights2 <- flights %>% select(year:day, hour, tailnum,carrier)
# now I'm going to merge in the airlines dataset
flights2 %>% left_join(airlines, by = "carrier")
## # A tibble: 336,776 x 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5 N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 N619AA AA American Airlines Inc.
## 4 2013 1 1 5 N804JB B6 JetBlue Airways
## 5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5 N39463 UA United Air Lines Inc.
## 7 2013 1 1 6 N516JB B6 JetBlue Airways
## 8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6 N593JB B6 JetBlue Airways
## 10 2013 1 1 6 N3ALAA AA American Airlines Inc.
## # … with 336,766 more rows
5.2.2 Other join commands
left_join() preserve all the original data. There are other join functions as well. For example, right_join() preserves what you’e adding while full_join() preserves both and fills in NAs where any cell is missing information.
5.2.3 merge
The join commands are built into R’s tidyverse. Another option is to use the merge() command, which works similarly.
Note that since the merge command is in base R and not the tidyverse, if you were to save the results using “<-”, they’d be saved as a data frame and not a tibble.
5.3 Hierarchical data structures
5.3.1 Going from wide to long
To convert wide data into long format, we use the function pivot_longer(). For clarity, let’s see a small example.
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
I’m now going to use the pivot_longer() function to convert the columns 1999 and 2000 into values of a new variable, year. The values in the variables 1999 and 2000 will be held in a new variable, cases.
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
Walking through this notation, I’m focusing on the variables 1999 and 2000 within table4a and I’m converting the column names to a new variable called year and moving the values to a new variable called cases.
5.3.2 Going from long to wide
If we have long data but want to convert it to wide format, we can use pivot_wider(), which follows nearly the same syntax. Let’s start with a long format dataset.
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
I’m going to create variables for cases and population. In the syntax below, I pull new variable names from the original variable “type” and then fill in the values from the original variable “count”.
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583