6.1 Long vs Wide formats
For data to be considered “tidy,” it should be in what is called “long” format. Each column is a variable, each row is an observation, and each cell is a value. Our state population data is in “wide” format, which is often preferable for human-readability, but is less ideal for machine-readability.
We will use the package tidyr
and the function pivot_longer
to convert our population data to a long format, thus making it easier to join with our measles data.
Each column in our population dataset represents a state. To make it tidy we are going to reduce those to one column called State with the state names as the values of the column. We will then need to create a new column for population containing the current cell values. To remember that the population data is provided in 1000s of persons, we will call this new column pop1000.
pivot_longer()
takes four principal arguments:
- the data
- cols are the names of the columns we use to fill the new values variable (or to drop).
- the names_to column variable we wish to create from the cols provided.
- the values_to column variable we wish to create and fill with values associated with the cols provided.
library(tidyr)
<- hist_pop_by_state %>%
hist_pop_long pivot_longer(ALASKA:WYOMING,
names_to = "State",
values_to = "pop1000")
View(hist_pop_long)
Now our two datasets have similar structures, a column of state names, a column of years, and a column of values. Let’s join these two datasets by the state and year columns. Note that if both sets have the same column names, you do not need to specify anything in the by argument. We use a left join here which preserves all the rows in our measles dataset and adds the matching rows from the population dataset.
<-
joined_df left_join(yearly_count_state, hist_pop_long, by=c("State" = "State", "Year" = "DATE" ))
joined_df
## # A tibble: 4,210 x 4
## # Groups: Year [96]
## Year State TotalCount pop1000
## <dbl> <chr> <dbl> <dbl>
## 1 1906 CALIFORNIA 224 1976
## 2 1906 CONNECTICUT 23 1033
## 3 1906 FLORIDA 4 628
## 4 1906 ILLINOIS 187 5309
## 5 1906 INDIANA 20 2663
## 6 1906 KENTUCKY 2 2234
## 7 1906 MAINE 26 729
## 8 1906 MASSACHUSETTS 282 3107
## 9 1906 MICHIGAN 320 2626
## 10 1906 MISSOURI 274 3223
## # ... with 4,200 more rows
Now we can use our old friend mutate()
to add a rate column calculated from the count and pop1000 columns.
#Add column for rate (per 1000) of measles
<- mutate(joined_df, rate = TotalCount / pop1000)
rate_by_year rate_by_year
## # A tibble: 4,210 x 5
## # Groups: Year [96]
## Year State TotalCount pop1000 rate
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1906 CALIFORNIA 224 1976 0.113
## 2 1906 CONNECTICUT 23 1033 0.0223
## 3 1906 FLORIDA 4 628 0.00637
## 4 1906 ILLINOIS 187 5309 0.0352
## 5 1906 INDIANA 20 2663 0.00751
## 6 1906 KENTUCKY 2 2234 0.000895
## 7 1906 MAINE 26 729 0.0357
## 8 1906 MASSACHUSETTS 282 3107 0.0908
## 9 1906 MICHIGAN 320 2626 0.122
## 10 1906 MISSOURI 274 3223 0.0850
## # ... with 4,200 more rows
Now, let’s plot that data
<-
rate_1963 %>%
rate_by_year filter(Year == 1963) %>%
ggplot(aes(x = State, y = rate)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90))
rate_1963
## Warning: Removed 1 rows containing missing values (position_stack).
It can be helpful to look at bar graphs sorted
<-
rate_1963 %>%
rate_by_year filter(Year == 1963) %>%
ggplot(aes(x = reorder(State, -rate), y = rate)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90))
rate_1963
## Warning: Removed 1 rows containing missing values (position_stack).