4 Sub-setting vectors

A very common task in data wrangling is filtering or ‘sub-setting’ down to a smaller set of potentially interesting values.

This short chapter is intended to give you a basic understanding of the operations that R is performing when filtering and evaluating data using the functions introduced in Chapter 5. Note that it is not necessary to have the tidyverse package loaded to complete this chapter - we are using ‘base R’ functions.

First create a new .R file, and save as ‘Week_2_tidyverse.R’ in your Desktop WEHI_tidyR_course folder.


wegotthiscovered.com/wp-content/uploads/the-beatles.jpg

To start, we will create a vector named ‘beatles’ containing the years of birth of The Beatles:

John Lennon: 1940, Ringo Starr: 1940, Paul McCartney: 1942 and George Harrison: 1943.

beatles <- c(1940,1940,1942,1943)

4.1 Subset by position

To hone in on different values in this vector we can request them based on their position in order 1 through 4. We give the position of the value we want in square brackets. To print the value in position 1:

beatles[1]
## [1] 1940

Importantly, we can subset for values at position 2 and 3 by including a vector of numbers defining those positions.

beatles[c(2,3)]
## [1] 1940 1942

We can also request everything but the value at a certain position, using the minus sign

beatles[-3]
## [1] 1940 1940 1943

4.2 Adding names

To add additional information to the values we can give each a name, supplied as a vector of words:

names(beatles) <- c('John','Ringo','Paul','George')

Notice how the beatles variable has now changed slightly in the Environment panel.

From now on every time a value is returned from the beatles vector, the Beatle member associated with that value is also returned.

beatles[-3]
##   John  Ringo George 
##   1940   1940   1943

4.3 Subset by condition

In addition to sub-setting a vector by the position of values, we can ‘pose questions’ about the set of values to R, which will be returned with TRUE or FALSE answers.

‘Which Beatles date of birth is 1940’? To code this we ask for values ‘exactly equal to’ 1940, using the == sign.

beatles == 1940
##   John  Ringo   Paul George 
##   TRUE   TRUE  FALSE  FALSE

R returns ‘TRUE’ for values that satisfy our ‘condition’, and FALSE for those that don’t.

‘Which member(s) date of birth is before 1943?’ To code this we use the less-than < sign.

beatles < 1943
##   John  Ringo   Paul George 
##   TRUE   TRUE   TRUE  FALSE

R is assessing the value at each position, and returning an answer to our conditional question.

A way to directly subset this vector is to directly provide a vector of TRUE and FALSE values within square brackets, in a similar manner to specifying the positions, above.

beatles[c(TRUE,FALSE,FALSE,TRUE)]
##   John George 
##   1940   1943

Note that the only values returned are in the ‘TRUE’ positions, in this case the values at position 1 and 4.

We can see that the process of sub-setting depends on the presence of ‘TRUE’ or ‘FALSE’ at each position along the vector.

In most cases when sub-setting data, we want the values themselves, rather than the TRUE/FALSE evaluations.
Now that we know that i) conditional requests return TRUE/FALSE values, and ii) TRUE/FALSE values are the basis of sub-setting vectors, we can substitute the TRUE/FALSE vector in the brackets above for a conditional statement:

beatles[beatles > 1940]
##   Paul George 
##   1942   1943

To check this, try running just the code within the square brackets.
It is timely to mention that in R, code is processed from the inside- to outside of brackets. Here, the conditional statement is evaluated and produces a vector of four TRUE / FALSE values. This logical vector is then used to sub-set the original vector, returning a subset of named numeric values.

‘Which Beatles were not born in 1942?’ To answer this we need to use a ! symbol that ‘negates’, or inverts the condition:

beatles[ beatles!=1942 ]
##   John  Ringo George 
##   1940   1940   1943

To get an even more succinct answer, we could request only the names associated with the numeric values:

names( beatles[ beatles != 1942] )
## [1] "John"   "Ringo"  "George"

You will use these types of conditional statements regularly in the next chapter.

4.4 Challenge

Create three different commands to return information about the Beatles born before 1943. You can use positional information, and/or conditional requests.







4.5 Possible solutions

beatles[beatles < 1943]
##  John Ringo  Paul 
##  1940  1940  1942
beatles[beatles != 1943]
##  John Ringo  Paul 
##  1940  1940  1942
beatles[-4]
##  John Ringo  Paul 
##  1940  1940  1942
beatles[c(1,2,3)]
##  John Ringo  Paul 
##  1940  1940  1942
beatles[names(beatles) != 'George']
##  John Ringo  Paul 
##  1940  1940  1942