4.5 Mutate & Group By
We are going to introduce two new functions at once now.
mutate() is a function you will use a lot. It is used any time you wish to create a new variable. It comes in two main flavours: mutate() and transmute(). mutate() creates a new variable and preserves the existing one, while transmute() replaces the variable. mutate() returns the complete dataset, while transmute() returns only the variable that you transmuted.
Often we wish to conduct a particular action only split by some group in the data. Grouping doesn’t change the look of the data, but changes how the other functions work on the data.
4.5.1 Mutate
Let’s say we wish to look at gdp on a logarithmic scale. This is easily doable with mutate. Note the slightly different syntax, where you have to express the name of the new variable and assign using a single ‘=’ instead of the assignment operator (<-).
sdg %>% mutate(log.gdp = log(gdp)) %>% select(country, gdp, log.gdp) %>% head()
## country gdp log.gdp
## 1 Afghanistan 561.7787 6.331108
## 2 Angola 3110.8080 8.042638
## 3 Argentina 12449.2200 9.429413
## 4 Armenia 3606.1520 8.190397
## 5 Bangladesh 1358.7800 7.214343
## 6 Belize 4810.5660 8.478570
So we don’t have to scroll all the way through the dataset to see these values, we combine mutate() and select() to extract the changed columns.
sdg %>% mutate(log.gdp = log(gdp)) %>%
select(country, gdp, log.gdp) %>% head()
## country gdp log.gdp
## 1 Afghanistan 561.7787 6.331108
## 2 Angola 3110.8080 8.042638
## 3 Argentina 12449.2200 9.429413
## 4 Armenia 3606.1520 8.190397
## 5 Bangladesh 1358.7800 7.214343
## 6 Belize 4810.5660 8.478570
Sometimes we might wish to do more complex transformations. What if we wanted to look at the difference in GDP from the mean?
We add na.rm to the mean() function to tell R it should ignore missing values.
sdg %>% mutate(diff.gdp = gdp - mean(gdp, na.rm = T)) %>%
select(country, gdp, diff.gdp) %>% head()
## country gdp diff.gdp
## 1 Afghanistan 561.7787 -15043.51
## 2 Angola 3110.8080 -12494.48
## 3 Argentina 12449.2200 -3156.07
## 4 Armenia 3606.1520 -11999.14
## 5 Bangladesh 1358.7800 -14246.51
## 6 Belize 4810.5660 -10794.72
To make a fairer comparison, we might want to look within each WHO region to determine how far each GDP is from its regional mean. We do this by adding a group_by() statement.
sdg %>% group_by(reg) %>%
mutate(diff.gdp = gdp - mean(gdp, na.rm = T)) %>%
select(country, gdp, diff.gdp) %>% head()
## Adding missing grouping variables: `reg`
## # A tibble: 6 x 4
## # Groups: reg [5]
## reg country gdp diff.gdp
## <fct> <fct> <dbl> <dbl>
## 1 EMR Afghanistan 562. -10447.
## 2 AFR Angola 3111. 1005.
## 3 AMR Argentina 12449. -1972.
## 4 EUR Armenia 3606. -27290.
## 5 SEA Bangladesh 1359. -1737.
## 6 AMR Belize 4811. -9611.
The possibilities of mutate() are pretty endless. Here are a few use cases. We will look at recoding data with mutate() in the Tidying Data section.
#--- Get the percentile rank of GDP
sdg %>% arrange(gdp) %>%
mutate(perc = percent_rank(gdp)) %>%
select(country, gdp, perc) %>%
head()
## country gdp perc
## 1 Burundi 285.7274 0.000000000
## 2 Malawi 300.7948 0.004926108
## 3 Niger 363.2270 0.009852217
## 4 Mozambique 382.0693 0.014778325
## 5 Central African Republic 382.2132 0.019704433
## 6 Madagascar 401.3189 0.024630542
#--- Divide TB incidence into quintiles
sdg %>% mutate(tb.quint = ntile(tb, 5)) %>%
select(country, tb, tb.quint) %>%
arrange(tb.quint) %>%
head()
## country tb tb.quint
## 1 Jordan 7.0 1
## 2 Antigua and Barbuda 7.5 1
## 3 Barbados 0.0 1
## 4 Bermuda 0.0 1
## 5 British Virgin Islands 0.0 1
## 6 Canada 5.1 1
#--- Create a z score for proportion living in slums by region
# A z-score shows where the country sits on a bell curve of all recorded slum values
# The mean is at 0 and the standard deviation is 1 - more about this on STEPH!
sdg %>% group_by(reg) %>%
filter(!is.na(slums)) %>%
mutate(z.slums = (slums - mean(slums))/sd(slums)) %>%
select(country:reg, slums, z.slums) %>%
arrange(reg, desc(z.slums)) %>%
head()
## # A tibble: 6 x 5
## # Groups: reg [1]
## country code reg slums z.slums
## <fct> <fct> <fct> <dbl> <dbl>
## 1 South Sudan SSD AFR 95.6 1.97
## 2 Central African Republic CAF AFR 93.3 1.84
## 3 Chad TCD AFR 88.2 1.57
## 4 Sao Tome and Principe STP AFR 86.6 1.49
## 5 Guinea-Bissau GNB AFR 82.3 1.26
## 6 Mozambique MOZ AFR 80.3 1.15
#--- Convert all numeric variables to factor variables
sdg %>% mutate_if(is.numeric, as.factor) %>% head()
## country code reg gdp gini pop delta.pop int.migrant urb
## 1 Afghanistan AFG EMR 561.7787 <NA> 34656032 33.84 1.18 27.13
## 2 Angola AGO AFR 3110.808 <NA> 28813463 42.2 0.43 44.82
## 3 Argentina ARG AMR 12449.22 42.7 43847430 10.84 4.81 91.89
## 4 Armenia ARM EUR 3606.152 32.4 2924816 -1.14 6.34 62.56
## 5 Bangladesh BGD SEA 1358.78 <NA> 163000000 12.1 0.88 35.04
## 6 Belize BLZ AMR 4810.566 <NA> 366954 26.21 14.99 43.85
## delta.urb emp.ratio slums pop.density largest.city sanitation water
## 1 3.9 48.05 62.7 53.08 51.49 45.1 78.2
## 2 7.88 63.81 55.5 23.11 44.43 88.6 75.4
## 3 1.63 57.02 16.7 16.02 38.06 96.2 99
## 4 -1.59 52.9 14.4 102.73 56.85 96.2 100
## 5 7.52 59.68 55.1 1251.84 31.94 57.7 86.5
## 6 -2.2 62.22 10.8 16.09 <NA> 93.5 98.9
## million tb urb.pov electric pollution urban.pov.hc primary health.exp
## 1 13.97 189 27.6 98.7 48.01676 27.6 <NA> 63.87641
## 2 24.55 370 <NA> 51 36.39543 <NA> 84.01231 23.96155
## 3 43.94 25 13.6 <NA> 13.44397 4.7 99.34679 30.72721
## 4 35.57 41 <NA> 100 25.50769 30 96.07425 53.51334
## 5 14.66 225 <NA> 90.7 89.39291 21.3 90.50861 66.97587
## 6 <NA> 25 4.7 100 27.04049 <NA> 96.14116 23.00839
## tb.cure case.d diarrhea.trt imm.dpt mat.mort nurse.mw beds ari
## 1 87 58 40.7 65 1291 0.36 0.5 61.5
## 2 34 64 <NA> 64 <NA> <NA> <NA> <NA>
## 3 52 87 59.1 92 32.4 <NA> 4.7 94.3
## 4 78 89 <NA> 94 19 4.994 3.9 56.8
## 5 93 57 66.1 97 210 0.213 0.6 42
## 6 35 87 42.5 95 45 1.959 1.1 82.2
## lmic
## 1 Low income
## 2 Lower middle income
## 3 Upper middle income
## 4 Lower middle income
## 5 Lower middle income
## 6 Upper middle income
EXERCISE: Replace the proportion living in slums variable with the mean of its region
EXERCISE: Get the cumulative mean incidence of TB as you go from low GDP to high GDP. (hint: filter, arrange, mutate)
CHALLENGE EXERCISE: Plot the cumulative mean incidence of TB as you go from low to high GDP (hint: ?reorder)