4.5 Mutate & Group By

We are going to introduce two new functions at once now.

mutate() is a function you will use a lot. It is used any time you wish to create a new variable. It comes in two main flavours: mutate() and transmute(). mutate() creates a new variable and preserves the existing one, while transmute() replaces the variable. mutate() returns the complete dataset, while transmute() returns only the variable that you transmuted.

Often we wish to conduct a particular action only split by some group in the data. Grouping doesn’t change the look of the data, but changes how the other functions work on the data.

4.5.1 Mutate

Let’s say we wish to look at gdp on a logarithmic scale. This is easily doable with mutate. Note the slightly different syntax, where you have to express the name of the new variable and assign using a single ‘=’ instead of the assignment operator (<-).

sdg %>% mutate(log.gdp = log(gdp)) %>% select(country, gdp, log.gdp) %>% head()

##       country        gdp  log.gdp
## 1 Afghanistan   561.7787 6.331108
## 2      Angola  3110.8080 8.042638
## 3   Argentina 12449.2200 9.429413
## 4     Armenia  3606.1520 8.190397
## 5  Bangladesh  1358.7800 7.214343
## 6      Belize  4810.5660 8.478570

So we don’t have to scroll all the way through the dataset to see these values, we combine mutate() and select() to extract the changed columns.

sdg %>% mutate(log.gdp = log(gdp)) %>% 
  select(country, gdp, log.gdp) %>% head()

##       country        gdp  log.gdp
## 1 Afghanistan   561.7787 6.331108
## 2      Angola  3110.8080 8.042638
## 3   Argentina 12449.2200 9.429413
## 4     Armenia  3606.1520 8.190397
## 5  Bangladesh  1358.7800 7.214343
## 6      Belize  4810.5660 8.478570

Sometimes we might wish to do more complex transformations. What if we wanted to look at the difference in GDP from the mean?

We add na.rm to the mean() function to tell R it should ignore missing values.

sdg %>% mutate(diff.gdp = gdp - mean(gdp, na.rm = T)) %>% 
  select(country, gdp, diff.gdp) %>% head()

##       country        gdp  diff.gdp
## 1 Afghanistan   561.7787 -15043.51
## 2      Angola  3110.8080 -12494.48
## 3   Argentina 12449.2200  -3156.07
## 4     Armenia  3606.1520 -11999.14
## 5  Bangladesh  1358.7800 -14246.51
## 6      Belize  4810.5660 -10794.72

To make a fairer comparison, we might want to look within each WHO region to determine how far each GDP is from its regional mean. We do this by adding a group_by() statement.

sdg %>% group_by(reg) %>% 
  mutate(diff.gdp = gdp - mean(gdp, na.rm = T)) %>% 
  select(country, gdp, diff.gdp) %>% head()

## Adding missing grouping variables: `reg`

## # A tibble: 6 x 4
## # Groups:   reg [5]
##   reg   country        gdp diff.gdp
##   <fct> <fct>        <dbl>    <dbl>
## 1 EMR   Afghanistan   562.  -10447.
## 2 AFR   Angola       3111.    1005.
## 3 AMR   Argentina   12449.   -1972.
## 4 EUR   Armenia      3606.  -27290.
## 5 SEA   Bangladesh   1359.   -1737.
## 6 AMR   Belize       4811.   -9611.

The possibilities of mutate() are pretty endless. Here are a few use cases. We will look at recoding data with mutate() in the Tidying Data section.

#--- Get the percentile rank of GDP
sdg %>% arrange(gdp) %>% 
  mutate(perc = percent_rank(gdp)) %>% 
  select(country, gdp, perc) %>% 
  head()

##                    country      gdp        perc
## 1                  Burundi 285.7274 0.000000000
## 2                   Malawi 300.7948 0.004926108
## 3                    Niger 363.2270 0.009852217
## 4               Mozambique 382.0693 0.014778325
## 5 Central African Republic 382.2132 0.019704433
## 6               Madagascar 401.3189 0.024630542

#--- Divide TB incidence into quintiles
sdg %>% mutate(tb.quint = ntile(tb, 5)) %>% 
  select(country, tb, tb.quint) %>% 
  arrange(tb.quint) %>% 
  head()

##                  country  tb tb.quint
## 1                 Jordan 7.0        1
## 2    Antigua and Barbuda 7.5        1
## 3               Barbados 0.0        1
## 4                Bermuda 0.0        1
## 5 British Virgin Islands 0.0        1
## 6                 Canada 5.1        1

#--- Create a z score for proportion living in slums by region
# A z-score shows where the country sits on a bell curve of all recorded slum values
# The mean is at 0 and the standard deviation is 1 - more about this on STEPH!
sdg %>% group_by(reg) %>%  
  filter(!is.na(slums)) %>% 
  mutate(z.slums = (slums - mean(slums))/sd(slums)) %>% 
  select(country:reg, slums, z.slums) %>% 
  arrange(reg, desc(z.slums)) %>% 
  head()

## # A tibble: 6 x 5
## # Groups:   reg [1]
##   country                  code  reg   slums z.slums
##   <fct>                    <fct> <fct> <dbl>   <dbl>
## 1 South Sudan              SSD   AFR    95.6    1.97
## 2 Central African Republic CAF   AFR    93.3    1.84
## 3 Chad                     TCD   AFR    88.2    1.57
## 4 Sao Tome and Principe    STP   AFR    86.6    1.49
## 5 Guinea-Bissau            GNB   AFR    82.3    1.26
## 6 Mozambique               MOZ   AFR    80.3    1.15

#--- Convert all numeric variables to factor variables
sdg %>% mutate_if(is.numeric, as.factor) %>% head()

##       country code reg      gdp gini       pop delta.pop int.migrant   urb
## 1 Afghanistan  AFG EMR 561.7787 <NA>  34656032     33.84        1.18 27.13
## 2      Angola  AGO AFR 3110.808 <NA>  28813463      42.2        0.43 44.82
## 3   Argentina  ARG AMR 12449.22 42.7  43847430     10.84        4.81 91.89
## 4     Armenia  ARM EUR 3606.152 32.4   2924816     -1.14        6.34 62.56
## 5  Bangladesh  BGD SEA  1358.78 <NA> 163000000      12.1        0.88 35.04
## 6      Belize  BLZ AMR 4810.566 <NA>    366954     26.21       14.99 43.85
##   delta.urb emp.ratio slums pop.density largest.city sanitation water
## 1       3.9     48.05  62.7       53.08        51.49       45.1  78.2
## 2      7.88     63.81  55.5       23.11        44.43       88.6  75.4
## 3      1.63     57.02  16.7       16.02        38.06       96.2    99
## 4     -1.59      52.9  14.4      102.73        56.85       96.2   100
## 5      7.52     59.68  55.1     1251.84        31.94       57.7  86.5
## 6      -2.2     62.22  10.8       16.09         <NA>       93.5  98.9
##   million  tb urb.pov electric pollution urban.pov.hc  primary health.exp
## 1   13.97 189    27.6     98.7  48.01676         27.6     <NA>   63.87641
## 2   24.55 370    <NA>       51  36.39543         <NA> 84.01231   23.96155
## 3   43.94  25    13.6     <NA>  13.44397          4.7 99.34679   30.72721
## 4   35.57  41    <NA>      100  25.50769           30 96.07425   53.51334
## 5   14.66 225    <NA>     90.7  89.39291         21.3 90.50861   66.97587
## 6    <NA>  25     4.7      100  27.04049         <NA> 96.14116   23.00839
##   tb.cure case.d diarrhea.trt imm.dpt mat.mort nurse.mw beds  ari
## 1      87     58         40.7      65     1291     0.36  0.5 61.5
## 2      34     64         <NA>      64     <NA>     <NA> <NA> <NA>
## 3      52     87         59.1      92     32.4     <NA>  4.7 94.3
## 4      78     89         <NA>      94       19    4.994  3.9 56.8
## 5      93     57         66.1      97      210    0.213  0.6   42
## 6      35     87         42.5      95       45    1.959  1.1 82.2
##                  lmic
## 1          Low income
## 2 Lower middle income
## 3 Upper middle income
## 4 Lower middle income
## 5 Lower middle income
## 6 Upper middle income

EXERCISE: Replace the proportion living in slums variable with the mean of its region

EXERCISE: Get the cumulative mean incidence of TB as you go from low GDP to high GDP. (hint: filter, arrange, mutate)

CHALLENGE EXERCISE: Plot the cumulative mean incidence of TB as you go from low to high GDP (hint: ?reorder)