4.3 Select

Select allows you to pick which variables you’d like to look at.

Here are some ‘use cases’ for select(). I append head() to keep the output short - feel free to delete it.

#--- View the names of the columns of the sdg dataset
names(sdg)

##  [1] "country"      "code"         "reg"          "gdp"         
##  [5] "gini"         "pop"          "delta.pop"    "int.migrant" 
##  [9] "urb"          "delta.urb"    "emp.ratio"    "slums"       
## [13] "pop.density"  "largest.city" "sanitation"   "water"       
## [17] "million"      "tb"           "urb.pov"      "electric"    
## [21] "pollution"    "urban.pov.hc" "primary"      "health.exp"  
## [25] "tb.cure"      "case.d"       "diarrhea.trt" "imm.dpt"     
## [29] "mat.mort"     "nurse.mw"     "beds"         "ari"         
## [33] "lmic"

#--- Select only the TB column
sdg %>% select(tb) %>% head()

##    tb
## 1 189
## 2 370
## 3  25
## 4  41
## 5 225
## 6  25

#--- Select every column between TB and TB Case Detection Rate
sdg %>% select(tb:case.d) %>% head()

##    tb urb.pov electric pollution urban.pov.hc  primary health.exp tb.cure
## 1 189    27.6     98.7  48.01676         27.6       NA   63.87641      87
## 2 370      NA     51.0  36.39543           NA 84.01231   23.96155      34
## 3  25    13.6       NA  13.44397          4.7 99.34679   30.72721      52
## 4  41      NA    100.0  25.50769         30.0 96.07425   53.51334      78
## 5 225      NA     90.7  89.39291         21.3 90.50861   66.97587      93
## 6  25     4.7    100.0  27.04049           NA 96.14116   23.00839      35
##   case.d
## 1     58
## 2     64
## 3     87
## 4     89
## 5     57
## 6     87

#--- Select all the TB related columns and the country column
sdg %>% select(country, tb, tb.cure, case.d) %>% head()

##       country  tb tb.cure case.d
## 1 Afghanistan 189      87     58
## 2      Angola 370      34     64
## 3   Argentina  25      52     87
## 4     Armenia  41      78     89
## 5  Bangladesh 225      93     57
## 6      Belize  25      35     87

#--- Select all the columns that are numeric
sdg %>% select_if(is.numeric) %>% head()

##          gdp gini       pop delta.pop int.migrant   urb delta.urb
## 1   561.7787   NA  34656032     33.84        1.18 27.13      3.90
## 2  3110.8080   NA  28813463     42.20        0.43 44.82      7.88
## 3 12449.2200 42.7  43847430     10.84        4.81 91.89      1.63
## 4  3606.1520 32.4   2924816     -1.14        6.34 62.56     -1.59
## 5  1358.7800   NA 163000000     12.10        0.88 35.04      7.52
## 6  4810.5660   NA    366954     26.21       14.99 43.85     -2.20
##   emp.ratio slums pop.density largest.city sanitation water million  tb
## 1     48.05  62.7       53.08        51.49       45.1  78.2   13.97 189
## 2     63.81  55.5       23.11        44.43       88.6  75.4   24.55 370
## 3     57.02  16.7       16.02        38.06       96.2  99.0   43.94  25
## 4     52.90  14.4      102.73        56.85       96.2 100.0   35.57  41
## 5     59.68  55.1     1251.84        31.94       57.7  86.5   14.66 225
## 6     62.22  10.8       16.09           NA       93.5  98.9      NA  25
##   urb.pov electric pollution urban.pov.hc  primary health.exp tb.cure
## 1    27.6     98.7  48.01676         27.6       NA   63.87641      87
## 2      NA     51.0  36.39543           NA 84.01231   23.96155      34
## 3    13.6       NA  13.44397          4.7 99.34679   30.72721      52
## 4      NA    100.0  25.50769         30.0 96.07425   53.51334      78
## 5      NA     90.7  89.39291         21.3 90.50861   66.97587      93
## 6     4.7    100.0  27.04049           NA 96.14116   23.00839      35
##   case.d diarrhea.trt imm.dpt mat.mort nurse.mw beds  ari
## 1     58         40.7      65   1291.0    0.360  0.5 61.5
## 2     64           NA      64       NA       NA   NA   NA
## 3     87         59.1      92     32.4       NA  4.7 94.3
## 4     89           NA      94     19.0    4.994  3.9 56.8
## 5     57         66.1      97    210.0    0.213  0.6 42.0
## 6     87         42.5      95     45.0    1.959  1.1 82.2

With many tidyverse functions, small variants are in place to handle specific tasks. We see this in action with select_if(), the purpose of which is intuitive: select the column that meets a particular criterion.

EXERCISE: Extract all columns that have to do with population and save them as a data frame called ‘pop’

EXERCISE: Extract all columns that start with the letter ‘u’ (hint: ?select).

EXERCISE: Drop the lmic column.

CHALLENGE EXERCISE: Move the gdp column to the front of the data frame, move the tb column to the back, and drop the urb, urb.pov, and urban.pov.hc variables.