5.5 Working with rows

5.5.1 `filter()`

select() lets us choose columns. To choose rows based on a specific criteria, we can use the filter() function. The arguments after the data frame are the condition(s) we want for our final dataframe to adhere to. Specify conditions using logical operator:

operator	meaning
==	exactly equal
!=	not equal to
<	less than
<=	less than or equal to
>	greater than
>=	greater than or equal to
x\|y	x or y
x&y	x and y
!x	not x

We’ll come back to our problem of different time spans in a moment. First, let’s try filtering our data by just one condition. We want to see just the rows that contain counts from Maryland.

measles_us_mod %>% filter(State == "MARYLAND")

## # A tibble: 7,246 x 6
##    State   PeriodStartDate PeriodEndDate PartOfCumulativeCount~ CountValue Span 
##    <chr>   <date>          <date>                         <dbl>      <dbl> <drt>
##  1 MARYLA~ 1927-11-27      1927-12-03                         0         64 6 da~
##  2 MARYLA~ 1927-12-04      1927-12-10                         0         88 6 da~
##  3 MARYLA~ 1927-12-18      1927-12-24                         0        105 6 da~
##  4 MARYLA~ 1927-12-25      1927-12-31                         0        109 6 da~
##  5 MARYLA~ 1928-01-01      1928-01-07                         0        175 6 da~
##  6 MARYLA~ 1928-01-08      1928-01-14                         0        249 6 da~
##  7 MARYLA~ 1928-01-15      1928-01-21                         0        345 6 da~
##  8 MARYLA~ 1928-01-22      1928-01-28                         0        365 6 da~
##  9 MARYLA~ 1928-01-29      1928-02-04                         0        504 6 da~
## 10 MARYLA~ 1928-02-05      1928-02-11                         0        563 6 da~
## # ... with 7,236 more rows

When matching strings you must be exact. R is case-sensitive. So State == "Maryland" or State == "maryland" would return 0 rows.

You can add additional conditions to filter by, separated by commas or other logical operators like &, >, and >.

Below we want just the rows for Maryland, and only include periods where the count was more than 500 reported cases. Note that while you need quotation marks around character data, you do not need them around numeric data.

measles_us_mod %>% 
  filter(State == "MARYLAND" & CountValue > 500)

## # A tibble: 328 x 6
##    State   PeriodStartDate PeriodEndDate PartOfCumulativeCount~ CountValue Span 
##    <chr>   <date>          <date>                         <dbl>      <dbl> <drt>
##  1 MARYLA~ 1928-01-29      1928-02-04                         0        504 6 da~
##  2 MARYLA~ 1928-02-05      1928-02-11                         0        563 6 da~
##  3 MARYLA~ 1928-02-12      1928-02-18                         0        696 6 da~
##  4 MARYLA~ 1928-02-19      1928-02-25                         0        750 6 da~
##  5 MARYLA~ 1928-02-26      1928-03-03                         0       1012 6 da~
##  6 MARYLA~ 1928-03-04      1928-03-10                         0        951 6 da~
##  7 MARYLA~ 1928-03-11      1928-03-17                         0       1189 6 da~
##  8 MARYLA~ 1928-03-18      1928-03-24                         0       1163 6 da~
##  9 MARYLA~ 1928-03-25      1928-03-31                         0       1020 6 da~
## 10 MARYLA~ 1928-04-01      1928-04-07                         0        753 6 da~
## # ... with 318 more rows

5.5.2 Grouping and Summarizing data

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics.

Let’s return to our measles_us_mod data frame. To analyze this further, we need to separate out the rows that represent cumulative counts from the rows that have a fixed interval. So we use filter() as we did above, this time saving to a new object called non_cumulative_period.

#filter out cumulative counts

non_cumulative <-
  measles_us_mod %>%
    filter(PartOfCumulativeCountSeries == 0)

How can we calculate the total number of measles cases in each state for each reported time period?

First we need to group our data by state, and then group by period. Remember that in our original data set each state contained counts for multiple municipalities. That means in our current data frame there will be multiple rows for each period that needs to be condensed.

#group - prepares for summary calculations
count_by_period <-
  non_cumulative %>%
  group_by(State, PeriodStartDate)

count_by_period

## # A tibble: 337,637 x 6
## # Groups:   State, PeriodStartDate [150,481]
##    State   PeriodStartDate PeriodEndDate PartOfCumulativeCount~ CountValue Span 
##    <chr>   <date>          <date>                         <dbl>      <dbl> <drt>
##  1 WISCON~ 1927-11-20      1927-11-26                         0         85 6 da~
##  2 WISCON~ 1927-11-27      1927-12-03                         0        120 6 da~
##  3 WISCON~ 1927-12-04      1927-12-10                         0         84 6 da~
##  4 WISCON~ 1927-12-18      1927-12-24                         0        106 6 da~
##  5 WISCON~ 1927-12-25      1927-12-31                         0         39 6 da~
##  6 WISCON~ 1928-01-01      1928-01-07                         0         45 6 da~
##  7 WISCON~ 1928-01-08      1928-01-14                         0         28 6 da~
##  8 WISCON~ 1928-01-15      1928-01-21                         0        140 6 da~
##  9 WISCON~ 1928-01-22      1928-01-28                         0         48 6 da~
## 10 WISCON~ 1928-01-29      1928-02-04                         0         85 6 da~
## # ... with 337,627 more rows

When you inspect your new data frame, everything should look the same. Grouping prepares your data for summarize, but it does not do anything visually to the data.

Now let’s trying summarizing that data. summarize() takes the dataframe as it’s first argument, then you need to provide some calculation and the name for the new column where the results of that calculation will go. In this case, we will use the sum() function on the CountValue column and put this in a new column called TotalCount

count_by_period <-
  non_cumulative %>%
  group_by(State, PeriodStartDate) %>% 
  summarize(TotalCount = sum(CountValue))

## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.

Notice how the use of pipes really comes in handy here. It saved us from having to create and keep track of a number of intermediate objects.

Now each row in our dataset represents a week of data. It might be better if we can look at yearly trends. So, let’s try separating out the year from PeriodStartDate into its own column. In addition to mutate(), we will also use the year() function from lubridate. Since lubridate is not loaded with the core tidyverse let’s go ahead and load that now.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Remember, you only need to install a package once, but you have to load it with each new session.

#Get year in separate column from PeriodStartDate
count_by_period <- 
  count_by_period %>% 
  mutate(Year = year(PeriodStartDate))

count_by_period

## # A tibble: 150,481 x 4
## # Groups:   State [56]
##    State   PeriodStartDate TotalCount  Year
##    <chr>   <date>               <dbl> <dbl>
##  1 ALABAMA 1909-05-23               1  1909
##  2 ALABAMA 1910-01-16               1  1910
##  3 ALABAMA 1910-02-06              13  1910
##  4 ALABAMA 1910-04-10             127  1910
##  5 ALABAMA 1910-04-17              81  1910
##  6 ALABAMA 1910-04-24              73  1910
##  7 ALABAMA 1910-05-01              62  1910
##  8 ALABAMA 1910-05-09              72  1910
##  9 ALABAMA 1910-05-15              39  1910
## 10 ALABAMA 1910-05-22              38  1910
## # ... with 150,471 more rows

Now we can do our grouping and summarizing again as we did above.

#Get totals for each state each year.
yearly_count_state <-
  count_by_period %>% 
  group_by(Year, State) %>% 
  summarise(TotalCount = sum(TotalCount))

## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.

yearly_count_state

## # A tibble: 4,210 x 3
## # Groups:   Year [96]
##     Year State         TotalCount
##    <dbl> <chr>              <dbl>
##  1  1906 CALIFORNIA           224
##  2  1906 CONNECTICUT           23
##  3  1906 FLORIDA                4
##  4  1906 ILLINOIS             187
##  5  1906 INDIANA               20
##  6  1906 KENTUCKY               2
##  7  1906 MAINE                 26
##  8  1906 MASSACHUSETTS        282
##  9  1906 MICHIGAN             320
## 10  1906 MISSOURI             274
## # ... with 4,200 more rows

Alternatively, to see a national level summary, we could group just by Year

#Get totals each year.
yearly_count <-
  count_by_period %>% 
  group_by(Year) %>%
  summarise(TotalCount = sum(TotalCount))

yearly_count

## # A tibble: 96 x 2
##     Year TotalCount
##    <dbl>      <dbl>
##  1  1906       2345
##  2  1907      40699
##  3  1908      55632
##  4  1909      56317
##  5  1910      87191
##  6  1911      62306
##  7  1912      66388
##  8  1913     117571
##  9  1914      57363
## 10  1915      95507
## # ... with 86 more rows