4.3 Select
Select allows you to pick which variables you’d like to look at.
Here are some ‘use cases’ for select(). I append head() to keep the output short - feel free to delete it.
#--- View the names of the columns of the sdg dataset
names(sdg)
## [1] "country" "code" "reg" "gdp"
## [5] "gini" "pop" "delta.pop" "int.migrant"
## [9] "urb" "delta.urb" "emp.ratio" "slums"
## [13] "pop.density" "largest.city" "sanitation" "water"
## [17] "million" "tb" "urb.pov" "electric"
## [21] "pollution" "urban.pov.hc" "primary" "health.exp"
## [25] "tb.cure" "case.d" "diarrhea.trt" "imm.dpt"
## [29] "mat.mort" "nurse.mw" "beds" "ari"
## [33] "lmic"
#--- Select only the TB column
sdg %>% select(tb) %>% head()
## tb
## 1 189
## 2 370
## 3 25
## 4 41
## 5 225
## 6 25
#--- Select every column between TB and TB Case Detection Rate
sdg %>% select(tb:case.d) %>% head()
## tb urb.pov electric pollution urban.pov.hc primary health.exp tb.cure
## 1 189 27.6 98.7 48.01676 27.6 NA 63.87641 87
## 2 370 NA 51.0 36.39543 NA 84.01231 23.96155 34
## 3 25 13.6 NA 13.44397 4.7 99.34679 30.72721 52
## 4 41 NA 100.0 25.50769 30.0 96.07425 53.51334 78
## 5 225 NA 90.7 89.39291 21.3 90.50861 66.97587 93
## 6 25 4.7 100.0 27.04049 NA 96.14116 23.00839 35
## case.d
## 1 58
## 2 64
## 3 87
## 4 89
## 5 57
## 6 87
#--- Select all the TB related columns and the country column
sdg %>% select(country, tb, tb.cure, case.d) %>% head()
## country tb tb.cure case.d
## 1 Afghanistan 189 87 58
## 2 Angola 370 34 64
## 3 Argentina 25 52 87
## 4 Armenia 41 78 89
## 5 Bangladesh 225 93 57
## 6 Belize 25 35 87
#--- Select all the columns that are numeric
sdg %>% select_if(is.numeric) %>% head()
## gdp gini pop delta.pop int.migrant urb delta.urb
## 1 561.7787 NA 34656032 33.84 1.18 27.13 3.90
## 2 3110.8080 NA 28813463 42.20 0.43 44.82 7.88
## 3 12449.2200 42.7 43847430 10.84 4.81 91.89 1.63
## 4 3606.1520 32.4 2924816 -1.14 6.34 62.56 -1.59
## 5 1358.7800 NA 163000000 12.10 0.88 35.04 7.52
## 6 4810.5660 NA 366954 26.21 14.99 43.85 -2.20
## emp.ratio slums pop.density largest.city sanitation water million tb
## 1 48.05 62.7 53.08 51.49 45.1 78.2 13.97 189
## 2 63.81 55.5 23.11 44.43 88.6 75.4 24.55 370
## 3 57.02 16.7 16.02 38.06 96.2 99.0 43.94 25
## 4 52.90 14.4 102.73 56.85 96.2 100.0 35.57 41
## 5 59.68 55.1 1251.84 31.94 57.7 86.5 14.66 225
## 6 62.22 10.8 16.09 NA 93.5 98.9 NA 25
## urb.pov electric pollution urban.pov.hc primary health.exp tb.cure
## 1 27.6 98.7 48.01676 27.6 NA 63.87641 87
## 2 NA 51.0 36.39543 NA 84.01231 23.96155 34
## 3 13.6 NA 13.44397 4.7 99.34679 30.72721 52
## 4 NA 100.0 25.50769 30.0 96.07425 53.51334 78
## 5 NA 90.7 89.39291 21.3 90.50861 66.97587 93
## 6 4.7 100.0 27.04049 NA 96.14116 23.00839 35
## case.d diarrhea.trt imm.dpt mat.mort nurse.mw beds ari
## 1 58 40.7 65 1291.0 0.360 0.5 61.5
## 2 64 NA 64 NA NA NA NA
## 3 87 59.1 92 32.4 NA 4.7 94.3
## 4 89 NA 94 19.0 4.994 3.9 56.8
## 5 57 66.1 97 210.0 0.213 0.6 42.0
## 6 87 42.5 95 45.0 1.959 1.1 82.2
With many tidyverse functions, small variants are in place to handle specific tasks. We see this in action with select_if(), the purpose of which is intuitive: select the column that meets a particular criterion.
EXERCISE: Extract all columns that have to do with population and save them as a data frame called ‘pop’
EXERCISE: Extract all columns that start with the letter ‘u’ (hint: ?select).
EXERCISE: Drop the lmic column.
CHALLENGE EXERCISE: Move the gdp column to the front of the data frame, move the tb column to the back, and drop the urb, urb.pov, and urban.pov.hc variables.