5.5 Working with rows
5.5.1 filter()
select()
lets us choose columns. To choose rows based on a specific criteria, we can use the filter()
function. The arguments after the data frame are the condition(s) we want for our final dataframe to adhere to. Specify conditions using logical operator:
operator | meaning |
---|---|
== | exactly equal |
!= | not equal to |
< | less than |
<= | less than or equal to |
> | greater than |
>= | greater than or equal to |
x|y | x or y |
x&y | x and y |
!x | not x |
We’ll come back to our problem of different time spans in a moment. First, let’s try filtering our data by just one condition. We want to see just the rows that contain counts from Maryland.
%>% filter(State == "MARYLAND") measles_us_mod
## # A tibble: 7,246 x 6
## State PeriodStartDate PeriodEndDate PartOfCumulativeCount~ CountValue Span
## <chr> <date> <date> <dbl> <dbl> <drt>
## 1 MARYLA~ 1927-11-27 1927-12-03 0 64 6 da~
## 2 MARYLA~ 1927-12-04 1927-12-10 0 88 6 da~
## 3 MARYLA~ 1927-12-18 1927-12-24 0 105 6 da~
## 4 MARYLA~ 1927-12-25 1927-12-31 0 109 6 da~
## 5 MARYLA~ 1928-01-01 1928-01-07 0 175 6 da~
## 6 MARYLA~ 1928-01-08 1928-01-14 0 249 6 da~
## 7 MARYLA~ 1928-01-15 1928-01-21 0 345 6 da~
## 8 MARYLA~ 1928-01-22 1928-01-28 0 365 6 da~
## 9 MARYLA~ 1928-01-29 1928-02-04 0 504 6 da~
## 10 MARYLA~ 1928-02-05 1928-02-11 0 563 6 da~
## # ... with 7,236 more rows
When matching strings you must be exact. R is case-sensitive. So State == "Maryland"
or State == "maryland"
would return 0 rows.
You can add additional conditions to filter by, separated by commas or other logical operators like &
, >
, and >
.
Below we want just the rows for Maryland, and only include periods where the count was more than 500 reported cases. Note that while you need quotation marks around character data, you do not need them around numeric data.
%>%
measles_us_mod filter(State == "MARYLAND" & CountValue > 500)
## # A tibble: 328 x 6
## State PeriodStartDate PeriodEndDate PartOfCumulativeCount~ CountValue Span
## <chr> <date> <date> <dbl> <dbl> <drt>
## 1 MARYLA~ 1928-01-29 1928-02-04 0 504 6 da~
## 2 MARYLA~ 1928-02-05 1928-02-11 0 563 6 da~
## 3 MARYLA~ 1928-02-12 1928-02-18 0 696 6 da~
## 4 MARYLA~ 1928-02-19 1928-02-25 0 750 6 da~
## 5 MARYLA~ 1928-02-26 1928-03-03 0 1012 6 da~
## 6 MARYLA~ 1928-03-04 1928-03-10 0 951 6 da~
## 7 MARYLA~ 1928-03-11 1928-03-17 0 1189 6 da~
## 8 MARYLA~ 1928-03-18 1928-03-24 0 1163 6 da~
## 9 MARYLA~ 1928-03-25 1928-03-31 0 1020 6 da~
## 10 MARYLA~ 1928-04-01 1928-04-07 0 753 6 da~
## # ... with 318 more rows
5.5.2 Grouping and Summarizing data
Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr
makes this very easy through the use of the group_by()
function.
group_by()
is often used together with summarize()
, which collapses each group into a single-row summary of that group. group_by()
takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics.
Let’s return to our measles_us_mod
data frame. To analyze this further, we need to separate out the rows that represent cumulative counts from the rows that have a fixed interval. So we use filter()
as we did above, this time saving to a new object called non_cumulative_period
.
#filter out cumulative counts
<-
non_cumulative %>%
measles_us_mod filter(PartOfCumulativeCountSeries == 0)
How can we calculate the total number of measles cases in each state for each reported time period?
First we need to group our data by state, and then group by period. Remember that in our original data set each state contained counts for multiple municipalities. That means in our current data frame there will be multiple rows for each period that needs to be condensed.
#group - prepares for summary calculations
<-
count_by_period %>%
non_cumulative group_by(State, PeriodStartDate)
count_by_period
## # A tibble: 337,637 x 6
## # Groups: State, PeriodStartDate [150,481]
## State PeriodStartDate PeriodEndDate PartOfCumulativeCount~ CountValue Span
## <chr> <date> <date> <dbl> <dbl> <drt>
## 1 WISCON~ 1927-11-20 1927-11-26 0 85 6 da~
## 2 WISCON~ 1927-11-27 1927-12-03 0 120 6 da~
## 3 WISCON~ 1927-12-04 1927-12-10 0 84 6 da~
## 4 WISCON~ 1927-12-18 1927-12-24 0 106 6 da~
## 5 WISCON~ 1927-12-25 1927-12-31 0 39 6 da~
## 6 WISCON~ 1928-01-01 1928-01-07 0 45 6 da~
## 7 WISCON~ 1928-01-08 1928-01-14 0 28 6 da~
## 8 WISCON~ 1928-01-15 1928-01-21 0 140 6 da~
## 9 WISCON~ 1928-01-22 1928-01-28 0 48 6 da~
## 10 WISCON~ 1928-01-29 1928-02-04 0 85 6 da~
## # ... with 337,627 more rows
When you inspect your new data frame, everything should look the same. Grouping prepares your data for summarize, but it does not do anything visually to the data.
Now let’s trying summarizing that data. summarize()
takes the dataframe as it’s first argument, then you need to provide some calculation and the name for the new column where the results of that calculation will go. In this case, we will use the sum()
function on the CountValue column and put this in a new column called TotalCount
<-
count_by_period %>%
non_cumulative group_by(State, PeriodStartDate) %>%
summarize(TotalCount = sum(CountValue))
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
Notice how the use of pipes really comes in handy here. It saved us from having to create and keep track of a number of intermediate objects.
Now each row in our dataset represents a week of data. It might be better if we can look at yearly trends. So, let’s try separating out the year from PeriodStartDate into its own column. In addition to mutate()
, we will also use the year()
function from lubridate
. Since lubridate
is not loaded with the core tidyverse
let’s go ahead and load that now.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
Remember, you only need to install a package once, but you have to load it with each new session.
#Get year in separate column from PeriodStartDate
<-
count_by_period %>%
count_by_period mutate(Year = year(PeriodStartDate))
count_by_period
## # A tibble: 150,481 x 4
## # Groups: State [56]
## State PeriodStartDate TotalCount Year
## <chr> <date> <dbl> <dbl>
## 1 ALABAMA 1909-05-23 1 1909
## 2 ALABAMA 1910-01-16 1 1910
## 3 ALABAMA 1910-02-06 13 1910
## 4 ALABAMA 1910-04-10 127 1910
## 5 ALABAMA 1910-04-17 81 1910
## 6 ALABAMA 1910-04-24 73 1910
## 7 ALABAMA 1910-05-01 62 1910
## 8 ALABAMA 1910-05-09 72 1910
## 9 ALABAMA 1910-05-15 39 1910
## 10 ALABAMA 1910-05-22 38 1910
## # ... with 150,471 more rows
Now we can do our grouping and summarizing again as we did above.
#Get totals for each state each year.
<-
yearly_count_state %>%
count_by_period group_by(Year, State) %>%
summarise(TotalCount = sum(TotalCount))
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
yearly_count_state
## # A tibble: 4,210 x 3
## # Groups: Year [96]
## Year State TotalCount
## <dbl> <chr> <dbl>
## 1 1906 CALIFORNIA 224
## 2 1906 CONNECTICUT 23
## 3 1906 FLORIDA 4
## 4 1906 ILLINOIS 187
## 5 1906 INDIANA 20
## 6 1906 KENTUCKY 2
## 7 1906 MAINE 26
## 8 1906 MASSACHUSETTS 282
## 9 1906 MICHIGAN 320
## 10 1906 MISSOURI 274
## # ... with 4,200 more rows
Alternatively, to see a national level summary, we could group just by Year
#Get totals each year.
<-
yearly_count %>%
count_by_period group_by(Year) %>%
summarise(TotalCount = sum(TotalCount))
yearly_count
## # A tibble: 96 x 2
## Year TotalCount
## <dbl> <dbl>
## 1 1906 2345
## 2 1907 40699
## 3 1908 55632
## 4 1909 56317
## 5 1910 87191
## 6 1911 62306
## 7 1912 66388
## 8 1913 117571
## 9 1914 57363
## 10 1915 95507
## # ... with 86 more rows