5.9 Joining datasets
Of course, looking at total counts in each state is not the most helpful metric without taking population into account. To rectify this, let’s try joining some historical population data with our measles data.
First we need to import the population data.
#load csv of populations by state over time, changing some of the datatypes from default
<-
hist_pop_by_state read_csv(
"data/Historical_Population_by_State.csv",
col_types = cols(ALASKA = col_double(), HAWAII = col_double())
)View(hist_pop_by_state)
As we saw in our measles data import, the presence of NAs makes it necessary to explicitly state that the Hawaii and Alaska columns contain numerical data.
5.9.1 Long vs Wide formats
For data to be considered “tidy,” it should be in what is called “long” format. Each column is a variable, each row is an observation, and each cell is a value. Our state population data is in “wide” format, which is often preferable for human-readability, but is less ideal for machine-readability.
We will use the package tidyr
and the function pivot_longer
to convert our population data to a long format, thus making it easier to join with our measles data.
Each column in our population dataset represents a state. To make it tidy we are going to reduce those to one column called State with the state names as the values of the column. We will then need to create a new column for population containing the current cell values. To remember that the population data is provided in 1000s of persons, we will call this new column pop1000.
pivot_longer()
takes four principal arguments:
- the data
- cols are the names of the columns we use to fill the new values variable (or to drop).
- the names_to column variable we wish to create from the cols provided.
- the values_to column variable we wish to create and fill with values associated with the cols provided.
library(tidyr)
<- hist_pop_by_state %>%
hist_pop_long pivot_longer(ALASKA:WYOMING,
names_to = "State",
values_to = "pop1000")
View(hist_pop_long)
Now our two datasets have similar structures, a column of state names, a column of years, and a column of values. Let’s join these two datasets by the state and year columns. Note that if both sets have the same column names, you do not need to specify anything in the by argument. We use a left join here which preserves all the rows in our measles dataset and adds the matching rows from the population dataset.
<- left_join(non_cumulative_year, hist_pop_long, by=c("State" = "State", "Year" = "DATE" ))
joined_df
joined_df
## # A tibble: 4,210 x 4
## # Groups: State [56]
## State Year TotalCount pop1000
## <chr> <dbl> <dbl> <dbl>
## 1 ALABAMA 1909 1 2108
## 2 ALABAMA 1910 606 2150
## 3 ALABAMA 1911 587 2185
## 4 ALABAMA 1912 109 2217
## 5 ALABAMA 1913 7 2279
## 6 ALABAMA 1914 4 2336
## 7 ALABAMA 1915 2 2345
## 8 ALABAMA 1916 66 2359
## 9 ALABAMA 1917 5134 2361
## 10 ALABAMA 1918 1651 2343
## # … with 4,200 more rows
Now we can use our old friend mutate()
to add a rate column calculated from the count and pop1000 columns.
#Add column for rate (per 1000) of measles
<- mutate(joined_df, rate = TotalCount / pop1000)
rate_by_year rate_by_year
## # A tibble: 4,210 x 5
## # Groups: State [56]
## State Year TotalCount pop1000 rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ALABAMA 1909 1 2108 0.000474
## 2 ALABAMA 1910 606 2150 0.282
## 3 ALABAMA 1911 587 2185 0.269
## 4 ALABAMA 1912 109 2217 0.0492
## 5 ALABAMA 1913 7 2279 0.00307
## 6 ALABAMA 1914 4 2336 0.00171
## 7 ALABAMA 1915 2 2345 0.000853
## 8 ALABAMA 1916 66 2359 0.0280
## 9 ALABAMA 1917 5134 2361 2.17
## 10 ALABAMA 1918 1651 2343 0.705
## # … with 4,200 more rows
Now, let’s plot that data
<-
rate_1963 %>%
rate_by_year filter(Year == 1963) %>%
ggplot(aes(x = State, y = rate)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90))
rate_1963
## Warning: Removed 1 rows containing missing values (position_stack).
It can be helpful to look at bar graphs sorted
<-
rate_1963 %>%
rate_by_year filter(Year == 1963) %>%
ggplot(aes(x = reorder(State, -rate), y = rate)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90))
rate_1963
## Warning: Removed 1 rows containing missing values (position_stack).