5.5 Working with rows

5.5.1 filter()

select() lets us choose columns. To choose rows based on a specific criteria, we can use the filter() function. The arguments after the dataframe are the condition(s) we want for our final dataframe to adhere to. Specify conditions using logical operator:

operator meaning
== exactly equal
!= not equal to
< less than
<= less than or equal to
> greater than
>= greater than or equal to
x|y x or y
x&y x and y
!x not x

First, let’s try filtering by just one condition. We want to see just the rows that contain counts from Maryland.

filter(measles_us_modified, State == "MARYLAND")
## # A tibble: 7,246 x 5
##    State    PeriodStartDate PeriodEndDate PartOfCumulativeCountSeries CountValue
##    <chr>    <date>          <date>                              <dbl>      <dbl>
##  1 MARYLAND 1927-11-27      1927-12-03                              0         64
##  2 MARYLAND 1927-12-04      1927-12-10                              0         88
##  3 MARYLAND 1927-12-18      1927-12-24                              0        105
##  4 MARYLAND 1927-12-25      1927-12-31                              0        109
##  5 MARYLAND 1928-01-01      1928-01-07                              0        175
##  6 MARYLAND 1928-01-08      1928-01-14                              0        249
##  7 MARYLAND 1928-01-15      1928-01-21                              0        345
##  8 MARYLAND 1928-01-22      1928-01-28                              0        365
##  9 MARYLAND 1928-01-29      1928-02-04                              0        504
## 10 MARYLAND 1928-02-05      1928-02-11                              0        563
## # … with 7,236 more rows

When matching strings you must be exact. R is case-sensitive. So Admin1Name == “Maryland” or Admin1Name == “maryland” would return 0 rows.

You can add additional conditions to filter by, separated by commas or other logical operators like &, >, and >. Below we want just the rows for Maryland, we want so separate out the overlapping time periods, and only include periods where the count was more than 500 reported cases. Note that while you need quotation marks around character data, you do not need them around numeric data.

filter(
  measles_us_modified,
  State == "MARYLAND" &
    PartOfCumulativeCountSeries == 0 & CountValue > 500
)
## # A tibble: 141 x 5
##    State    PeriodStartDate PeriodEndDate PartOfCumulativeCountSeries CountValue
##    <chr>    <date>          <date>                              <dbl>      <dbl>
##  1 MARYLAND 1928-01-29      1928-02-04                              0        504
##  2 MARYLAND 1928-02-05      1928-02-11                              0        563
##  3 MARYLAND 1928-02-12      1928-02-18                              0        696
##  4 MARYLAND 1928-02-19      1928-02-25                              0        750
##  5 MARYLAND 1928-02-26      1928-03-03                              0       1012
##  6 MARYLAND 1928-03-04      1928-03-10                              0        951
##  7 MARYLAND 1928-03-11      1928-03-17                              0       1189
##  8 MARYLAND 1928-03-18      1928-03-24                              0       1163
##  9 MARYLAND 1928-03-25      1928-03-31                              0       1020
## 10 MARYLAND 1928-04-01      1928-04-07                              0        753
## # … with 131 more rows

5.5.2 Grouping and Summarizing data

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics.

Let’s return to our measles_us_modified dataframe. To analyze this further, we need to separate out the rows that represent cumulative counts and the rows where the counts represent only fatalities. So we use filter() as we did above, this time saving to a new object called non-fatal.

#filter out cumulative counts, summarize counts per week.

#first just the filter part
non_cumulative <-
  filter(measles_us_modified,
         PartOfCumulativeCountSeries == 0)

How can we calculate the total number of measles cases in each state for each reported time period?

First we need to group our data by state, and then group by period. Remember that in our original dataset each state contained counts for multiple municipalities. That means in our current dataframe there will be multiple rows for each period that needs to be condensed.

#group - prepares for summary calculations
non_cumulative_group <-
  group_by(non_cumulative, State, PeriodStartDate)

non_cumulative_group
## # A tibble: 337,637 x 5
## # Groups:   State, PeriodStartDate [150,481]
##    State     PeriodStartDate PeriodEndDate PartOfCumulativeCountSeri… CountValue
##    <chr>     <date>          <date>                             <dbl>      <dbl>
##  1 WISCONSIN 1927-11-20      1927-11-26                             0         85
##  2 WISCONSIN 1927-11-27      1927-12-03                             0        120
##  3 WISCONSIN 1927-12-04      1927-12-10                             0         84
##  4 WISCONSIN 1927-12-18      1927-12-24                             0        106
##  5 WISCONSIN 1927-12-25      1927-12-31                             0         39
##  6 WISCONSIN 1928-01-01      1928-01-07                             0         45
##  7 WISCONSIN 1928-01-08      1928-01-14                             0         28
##  8 WISCONSIN 1928-01-15      1928-01-21                             0        140
##  9 WISCONSIN 1928-01-22      1928-01-28                             0         48
## 10 WISCONSIN 1928-01-29      1928-02-04                             0         85
## # … with 337,627 more rows

When you inspect your new dataframe, everything should look the same. Grouping prepares your data for summarize, but it does not do anything visually to the data.

Now let’s trying summarizing that data. summarize() takes the dataframe as it’s first argument, then you need to provide some calculation and the name for the new column where the results of that calculation will go. In this case, we will use the sum() function on the CountValue column and put this in a new column called TotalCount

#summarize the groups
non_cumulative_period <-
  summarize(non_cumulative_group, TotalCount = sum(CountValue))
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.