4 Sub-setting vectors
A very common task in data wrangling is filtering or ‘sub-setting’ down to a smaller set of potentially interesting values.
This short chapter is intended to give you a basic understanding of the operations that R is performing when filtering and evaluating data using the functions introduced in Chapter 5. Note that it is not necessary to have the tidyverse package loaded to complete this chapter - we are using ‘base R’ functions.
First create a new .R file, and save as ‘Week_2_tidyverse.R’ in your Desktop WEHI_tidyR_course folder.
To start, we will create a vector named ‘beatles’ containing the years of birth of The Beatles:
John Lennon: 1940, Ringo Starr: 1940, Paul McCartney: 1942 and George Harrison: 1943.
<- c(1940,1940,1942,1943) beatles
4.1 Subset by position
To hone in on different values in this vector we can request them based on their position in order 1 through 4. We give the position of the value we want in square brackets. To print the value in position 1:
1] beatles[
## [1] 1940
Importantly, we can subset for values at position 2 and 3 by including a vector of numbers defining those positions.
c(2,3)] beatles[
## [1] 1940 1942
We can also request everything but the value at a certain position, using the minus sign
-3] beatles[
## [1] 1940 1940 1943
4.2 Adding names
To add additional information to the values we can give each a name, supplied as a vector of words:
names(beatles) <- c('John','Ringo','Paul','George')
Notice how the beatles variable has now changed slightly in the Environment panel.
From now on every time a value is returned from the beatles vector, the Beatle member associated with that value is also returned.
-3] beatles[
## John Ringo George
## 1940 1940 1943
4.3 Subset by condition
In addition to sub-setting a vector by the position of values, we can ‘pose questions’ about the set of values to R, which will be returned with TRUE or FALSE answers.
‘Which Beatles date of birth is 1940’? To code this we ask for values ‘exactly equal to’ 1940, using the == sign.
== 1940 beatles
## John Ringo Paul George
## TRUE TRUE FALSE FALSE
R returns ‘TRUE’ for values that satisfy our ‘condition’, and FALSE for those that don’t.
‘Which member(s) date of birth is before 1943?’ To code this we use the less-than < sign.
< 1943 beatles
## John Ringo Paul George
## TRUE TRUE TRUE FALSE
R is assessing the value at each position, and returning an answer to our conditional question.
A way to directly subset this vector is to directly provide a vector of TRUE and FALSE values within square brackets, in a similar manner to specifying the positions, above.
c(TRUE,FALSE,FALSE,TRUE)] beatles[
## John George
## 1940 1943
Note that the only values returned are in the ‘TRUE’ positions, in this case the values at position 1 and 4.
We can see that the process of sub-setting depends on the presence of ‘TRUE’ or ‘FALSE’ at each position along the vector.
In most cases when sub-setting data, we want the values themselves, rather than the TRUE/FALSE evaluations.
Now that we know that i) conditional requests return TRUE/FALSE values, and ii) TRUE/FALSE values are the basis of sub-setting vectors, we can substitute the TRUE/FALSE vector in the brackets above for a conditional statement:
> 1940] beatles[beatles
## Paul George
## 1942 1943
To check this, try running just the code within the square brackets.
It is timely to mention that in R, code is processed from the inside- to outside of brackets. Here, the conditional statement is evaluated and produces a vector of four TRUE / FALSE values. This logical vector is then used to sub-set the original vector, returning a subset of named numeric values.
‘Which Beatles were not born in 1942?’ To answer this we need to use a ! symbol that ‘negates’, or inverts the condition:
!=1942 ] beatles[ beatles
## John Ringo George
## 1940 1940 1943
To get an even more succinct answer, we could request only the names associated with the numeric values:
names( beatles[ beatles != 1942] )
## [1] "John" "Ringo" "George"
You will use these types of conditional statements regularly in the next chapter.
4.4 Challenge
Create three different commands to return information about the Beatles born before 1943. You can use positional information, and/or conditional requests.
4.5 Possible solutions
< 1943] beatles[beatles
## John Ringo Paul
## 1940 1940 1942
!= 1943] beatles[beatles
## John Ringo Paul
## 1940 1940 1942
-4] beatles[
## John Ringo Paul
## 1940 1940 1942
c(1,2,3)] beatles[
## John Ringo Paul
## 1940 1940 1942
names(beatles) != 'George'] beatles[
## John Ringo Paul
## 1940 1940 1942