5.4 Working with columns
5.4.1 Select()
To select columns of a data frame, use select()
. This is a great way to create a smaller data frame with only the variables you are interested in.
select()
is a function. Remember that functions have arguments which go inside the parentheses. The first argument to this function is the name of the data object, which in this case is measles_us
, and the subsequent arguments are the names of the columns we want to keep, separated by commas.
Let’s try selecting the Admin1Name
column and the CountValue
column.
select(measles_us, Admin1Name, CountValue)
## # A tibble: 422,051 x 2
## Admin1Name CountValue
## <chr> <dbl>
## 1 WISCONSIN 85
## 2 WISCONSIN 120
## 3 WISCONSIN 84
## 4 WISCONSIN 106
## 5 WISCONSIN 39
## 6 WISCONSIN 45
## 7 WISCONSIN 28
## 8 WISCONSIN 140
## 9 WISCONSIN 48
## 10 WISCONSIN 85
## # ... with 422,041 more rows
As you can imagine, if you had a number of columns to select, it could get tiresome to write them all out. One way around this is to use a colon :
to name a range of adjacent columns.
select(measles_us, ConditionName:Admin1ISO)
## # A tibble: 422,051 x 4
## ConditionName CountryName Admin1Name Admin1ISO
## <chr> <chr> <chr> <chr>
## 1 Measles UNITED STATES OF AMERICA WISCONSIN US-WI
## 2 Measles UNITED STATES OF AMERICA WISCONSIN US-WI
## 3 Measles UNITED STATES OF AMERICA WISCONSIN US-WI
## 4 Measles UNITED STATES OF AMERICA WISCONSIN US-WI
## 5 Measles UNITED STATES OF AMERICA WISCONSIN US-WI
## 6 Measles UNITED STATES OF AMERICA WISCONSIN US-WI
## 7 Measles UNITED STATES OF AMERICA WISCONSIN US-WI
## 8 Measles UNITED STATES OF AMERICA WISCONSIN US-WI
## 9 Measles UNITED STATES OF AMERICA WISCONSIN US-WI
## 10 Measles UNITED STATES OF AMERICA WISCONSIN US-WI
## # ... with 422,041 more rows
Another helpful tool that the tidyverse
provides is the pipe
operator which looks like %>%
. The pipe is made available via the magrittr
package, installed automatically with dplyr
. If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
With the pipe you start with your data object and pipe it to the function, rather than naming the data as your first argument. So, the pipe becomes especially valuable when you have a number of steps that you want to connect. Another benefit of using the pipe in RStudio is that the interface will supply column names to you in the auto complete. This helps so you do not need to remember sometimes lengthy column names, and you are less likely to get an error from a typo.
%>%
measles_us select(Admin1Name, PartOfCumulativeCountSeries)
## # A tibble: 422,051 x 2
## Admin1Name PartOfCumulativeCountSeries
## <chr> <dbl>
## 1 WISCONSIN 0
## 2 WISCONSIN 0
## 3 WISCONSIN 0
## 4 WISCONSIN 0
## 5 WISCONSIN 0
## 6 WISCONSIN 0
## 7 WISCONSIN 0
## 8 WISCONSIN 0
## 9 WISCONSIN 0
## 10 WISCONSIN 0
## # ... with 422,041 more rows
Now unfortunately, we did not save this to an object, so while it prints out to the console, we cannot really do anything with these results. So now, let’s think through which columns we want for our analysis and save this to a new object called measles_us_mod
.
So if we look at our data, we see that the columns ConditionName
, CountryName
, and SourceName
have the same values throughout the data set. So it seems redundant to keep those. Admin1ISO
has redundant information to Admin1Name
, so we can probably get rid of that as well. For this particular analysis we are going to take a state level view of measles, so we do not need Admin2Name
and CityName
either. That leaves us with Admin1Name
, PeriodStartDate
, PeriodEndDate
, PartOfCumulativeCountSeries
, and CountValue
.
<-
measles_us_mod %>%
measles_us select(
Admin1Name,
PeriodStartDate,
PeriodEndDate,
PartOfCumulativeCountSeries,
CountValue
)
#inspect our new data frame
glimpse(measles_us_mod)
## Rows: 422,051
## Columns: 5
## $ Admin1Name <chr> "WISCONSIN", "WISCONSIN", "WISCONSIN", "WI~
## $ PeriodStartDate <date> 1927-11-20, 1927-11-27, 1927-12-04, 1927-~
## $ PeriodEndDate <date> 1927-11-26, 1927-12-03, 1927-12-10, 1927-~
## $ PartOfCumulativeCountSeries <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ CountValue <dbl> 85, 120, 84, 106, 39, 45, 28, 140, 48, 85,~
Now when you look in your environment pane, you should see your new object which as the same number of rows but 6 instead of 11 columns (or variables)
5.4.2 Renaming columns
Sometimes when you receive data, you may find that the column names are not very descriptive or useful, and it may be necessary to rename them. You can assign new names to columns when you select them select(newColumnName = OldColumnName)
or you can use the rename()
function. Like naming objects, you should use a simple, descriptive, relatively short name without spaces for your column names.
<-
measles_us_mod rename(measles_us_mod, State = Admin1Name)
5.4.3 Creating new columns with mutate()
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this you can use the mutate()
function. The transmute()
function is similar, but replaces old columns with the new one.
We know that each row in our dataset represents some time span, but we’re not sure what that span is. Below we use mutate()
to create a new column called Span, which gives the number of days between the PeriodEndDate column and the PeriodStartDate column. This is why it is important to make sure your date columns are parsed properly as dates when importing your data. If the dates had been left as character data, it would not have been possible to do arithmetic with them.
<-
measles_us_mod mutate(measles_us_mod, Span = PeriodEndDate - PeriodStartDate)
measles_us_mod
## # A tibble: 422,051 x 6
## State PeriodStartDate PeriodEndDate PartOfCumulativeCount~ CountValue Span
## <chr> <date> <date> <dbl> <dbl> <drt>
## 1 WISCON~ 1927-11-20 1927-11-26 0 85 6 da~
## 2 WISCON~ 1927-11-27 1927-12-03 0 120 6 da~
## 3 WISCON~ 1927-12-04 1927-12-10 0 84 6 da~
## 4 WISCON~ 1927-12-18 1927-12-24 0 106 6 da~
## 5 WISCON~ 1927-12-25 1927-12-31 0 39 6 da~
## 6 WISCON~ 1928-01-01 1928-01-07 0 45 6 da~
## 7 WISCON~ 1928-01-08 1928-01-14 0 28 6 da~
## 8 WISCON~ 1928-01-15 1928-01-21 0 140 6 da~
## 9 WISCON~ 1928-01-22 1928-01-28 0 48 6 da~
## 10 WISCON~ 1928-01-29 1928-02-04 0 85 6 da~
## # ... with 422,041 more rows
It’s hard to get a sense of what different time spans are present in our data. We can use the count()
function to help us figure that out.
count(measles_us_mod,Span)
## # A tibble: 59 x 2
## Span n
## <drtn> <int>
## 1 6 days 338063
## 2 13 days 558
## 3 20 days 693
## 4 27 days 782
## 5 34 days 857
## 6 41 days 929
## 7 48 days 913
## 8 55 days 997
## 9 62 days 1083
## 10 69 days 1136
## # ... with 49 more rows
All those different time spans will get us into trouble when we try to work with the data. So next we need to filter the dataset for only the rows we want.