5.4 Working with columns

5.4.1 Select()

To select columns of a data frame, use select(). This is a great way to create a smaller data frame with only the variables you are interested in.

select() is a function. Remember that functions have arguments which go inside the parentheses. The first argument to this function is the name of the data object, which in this case is measles_us, and the subsequent arguments are the names of the columns we want to keep, separated by commas.

Let’s try selecting the Admin1Name column and the CountValue column.

select(measles_us, Admin1Name, CountValue)
## # A tibble: 422,051 x 2
##    Admin1Name CountValue
##    <chr>           <dbl>
##  1 WISCONSIN          85
##  2 WISCONSIN         120
##  3 WISCONSIN          84
##  4 WISCONSIN         106
##  5 WISCONSIN          39
##  6 WISCONSIN          45
##  7 WISCONSIN          28
##  8 WISCONSIN         140
##  9 WISCONSIN          48
## 10 WISCONSIN          85
## # ... with 422,041 more rows

As you can imagine, if you had a number of columns to select, it could get tiresome to write them all out. One way around this is to use a colon : to name a range of adjacent columns.

select(measles_us, ConditionName:Admin1ISO)
## # A tibble: 422,051 x 4
##    ConditionName CountryName              Admin1Name Admin1ISO
##    <chr>         <chr>                    <chr>      <chr>    
##  1 Measles       UNITED STATES OF AMERICA WISCONSIN  US-WI    
##  2 Measles       UNITED STATES OF AMERICA WISCONSIN  US-WI    
##  3 Measles       UNITED STATES OF AMERICA WISCONSIN  US-WI    
##  4 Measles       UNITED STATES OF AMERICA WISCONSIN  US-WI    
##  5 Measles       UNITED STATES OF AMERICA WISCONSIN  US-WI    
##  6 Measles       UNITED STATES OF AMERICA WISCONSIN  US-WI    
##  7 Measles       UNITED STATES OF AMERICA WISCONSIN  US-WI    
##  8 Measles       UNITED STATES OF AMERICA WISCONSIN  US-WI    
##  9 Measles       UNITED STATES OF AMERICA WISCONSIN  US-WI    
## 10 Measles       UNITED STATES OF AMERICA WISCONSIN  US-WI    
## # ... with 422,041 more rows

Another helpful tool that the tidyverse provides is the pipe operator which looks like %>%. The pipe is made available via the magrittr package, installed automatically with dplyr. If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.

With the pipe you start with your data object and pipe it to the function, rather than naming the data as your first argument. So, the pipe becomes especially valuable when you have a number of steps that you want to connect. Another benefit of using the pipe in RStudio is that the interface will supply column names to you in the auto complete. This helps so you do not need to remember sometimes lengthy column names, and you are less likely to get an error from a typo.

measles_us %>% 
  select(Admin1Name, PartOfCumulativeCountSeries)
## # A tibble: 422,051 x 2
##    Admin1Name PartOfCumulativeCountSeries
##    <chr>                            <dbl>
##  1 WISCONSIN                            0
##  2 WISCONSIN                            0
##  3 WISCONSIN                            0
##  4 WISCONSIN                            0
##  5 WISCONSIN                            0
##  6 WISCONSIN                            0
##  7 WISCONSIN                            0
##  8 WISCONSIN                            0
##  9 WISCONSIN                            0
## 10 WISCONSIN                            0
## # ... with 422,041 more rows

Now unfortunately, we did not save this to an object, so while it prints out to the console, we cannot really do anything with these results. So now, let’s think through which columns we want for our analysis and save this to a new object called measles_us_mod.

So if we look at our data, we see that the columns ConditionName, CountryName, and SourceName have the same values throughout the data set. So it seems redundant to keep those. Admin1ISO has redundant information to Admin1Name, so we can probably get rid of that as well. For this particular analysis we are going to take a state level view of measles, so we do not need Admin2Name and CityName either. That leaves us with Admin1Name, PeriodStartDate, PeriodEndDate, PartOfCumulativeCountSeries, and CountValue.

measles_us_mod <-
  measles_us %>% 
    select(
    Admin1Name,
    PeriodStartDate,
    PeriodEndDate,
    PartOfCumulativeCountSeries,
    CountValue
  )

#inspect our new data frame
glimpse(measles_us_mod)
## Rows: 422,051
## Columns: 5
## $ Admin1Name                  <chr> "WISCONSIN", "WISCONSIN", "WISCONSIN", "WI~
## $ PeriodStartDate             <date> 1927-11-20, 1927-11-27, 1927-12-04, 1927-~
## $ PeriodEndDate               <date> 1927-11-26, 1927-12-03, 1927-12-10, 1927-~
## $ PartOfCumulativeCountSeries <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ CountValue                  <dbl> 85, 120, 84, 106, 39, 45, 28, 140, 48, 85,~

Now when you look in your environment pane, you should see your new object which as the same number of rows but 6 instead of 11 columns (or variables)

5.4.2 Renaming columns

Sometimes when you receive data, you may find that the column names are not very descriptive or useful, and it may be necessary to rename them. You can assign new names to columns when you select them select(newColumnName = OldColumnName) or you can use the rename() function. Like naming objects, you should use a simple, descriptive, relatively short name without spaces for your column names.

measles_us_mod <-
  rename(measles_us_mod, State = Admin1Name)

5.4.3 Creating new columns with mutate()

Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this you can use the mutate() function. The transmute() function is similar, but replaces old columns with the new one.

We know that each row in our dataset represents some time span, but we’re not sure what that span is. Below we use mutate() to create a new column called Span, which gives the number of days between the PeriodEndDate column and the PeriodStartDate column. This is why it is important to make sure your date columns are parsed properly as dates when importing your data. If the dates had been left as character data, it would not have been possible to do arithmetic with them.

measles_us_mod <- 
  mutate(measles_us_mod, Span = PeriodEndDate - PeriodStartDate)

measles_us_mod
## # A tibble: 422,051 x 6
##    State   PeriodStartDate PeriodEndDate PartOfCumulativeCount~ CountValue Span 
##    <chr>   <date>          <date>                         <dbl>      <dbl> <drt>
##  1 WISCON~ 1927-11-20      1927-11-26                         0         85 6 da~
##  2 WISCON~ 1927-11-27      1927-12-03                         0        120 6 da~
##  3 WISCON~ 1927-12-04      1927-12-10                         0         84 6 da~
##  4 WISCON~ 1927-12-18      1927-12-24                         0        106 6 da~
##  5 WISCON~ 1927-12-25      1927-12-31                         0         39 6 da~
##  6 WISCON~ 1928-01-01      1928-01-07                         0         45 6 da~
##  7 WISCON~ 1928-01-08      1928-01-14                         0         28 6 da~
##  8 WISCON~ 1928-01-15      1928-01-21                         0        140 6 da~
##  9 WISCON~ 1928-01-22      1928-01-28                         0         48 6 da~
## 10 WISCON~ 1928-01-29      1928-02-04                         0         85 6 da~
## # ... with 422,041 more rows

It’s hard to get a sense of what different time spans are present in our data. We can use the count() function to help us figure that out.

count(measles_us_mod,Span)
## # A tibble: 59 x 2
##    Span         n
##    <drtn>   <int>
##  1  6 days 338063
##  2 13 days    558
##  3 20 days    693
##  4 27 days    782
##  5 34 days    857
##  6 41 days    929
##  7 48 days    913
##  8 55 days    997
##  9 62 days   1083
## 10 69 days   1136
## # ... with 49 more rows

All those different time spans will get us into trouble when we try to work with the data. So next we need to filter the dataset for only the rows we want.