4.2 Select a subset of observations
To limit your dataset to a subset of observations in base R, use brackets [ ]
or subset()
. With brackets you can subset based on row numbers, row names, or a logical expression. With subset()
, you must use a logical expression. Selecting a subset of observations is called filtering.
NOTE: Filtering allows you to implement inclusion and exclusion criteria. To exclude, either use the opposite logical statement or use the “not” operator !
. For example, if you want to exclude those age 65 years or older, you could use either Age < 65
or !(Age >= 65)
.
# Number of rows in unfiltered data
nrow(mydat)
## [1] 530
# Keep only a subset of rows using row numbers
<- mydat[1:10, ]
subdat
# Number of rows in filtered data
nrow(subdat)
## [1] 10
# Exclude using a minus sign
<- mydat[-(1:10), ]
subdat nrow(subdat)
## [1] 520
# Keep a subset of rows using a logical expression
<- mydat[mydat$Age >= 65, ]
subdat
nrow(subdat)
## [1] 155
# The number of rows is the same as the sum of observations
# where the condition was TRUE
sum(mydat$Age >= 65, na.rm=T)
## [1] 155
# To make it an exclusion, either reverse the logical
# expression or use !
<- mydat[mydat$Age < 65, ]
subdat nrow(subdat)
## [1] 375
<- mydat[!(mydat$Age >= 65), ]
subdat nrow(subdat)
## [1] 375
# Same thing but using subset
# Have to use a logical expression - will not work with row numbers
# Inclusion
<- subset(mydat, subset = Age >= 65)
subdat nrow(subdat)
## [1] 155
# Exclusion using !
<- subset(mydat, subset = !(Age >= 65))
subdat nrow(subdat)
## [1] 375
In tidyverse
, use filter()
.
# Number of rows in unfiltered data
nrow(mydat_tibble)
# Keep only a subset of rows using a logical expression
<- mydat_tibble %>%
subdat filter(Age >= 65)
nrow(subdat)
# Exclusion using !
<- mydat_tibble %>%
subdat filter(!(Age >= 65))
nrow(subdat)