4.2 Select a subset of observations

To limit your dataset to a subset of observations in base R, use brackets [ ] or subset(). With brackets you can subset based on row numbers, row names, or a logical expression. With subset(), you must use a logical expression. Selecting a subset of observations is called filtering.

NOTE: Filtering allows you to implement inclusion and exclusion criteria. To exclude, either use the opposite logical statement or use the “not” operator !. For example, if you want to exclude those age 65 years or older, you could use either Age < 65 or !(Age >= 65).

# Number of rows in unfiltered data
nrow(mydat)
## [1] 530
# Keep only a subset of rows using row numbers
subdat <- mydat[1:10, ]

# Number of rows in filtered data
nrow(subdat)
## [1] 10
# Exclude using a minus sign
subdat <- mydat[-(1:10), ]
nrow(subdat)
## [1] 520
# Keep a subset of rows using a logical expression
subdat <- mydat[mydat$Age >= 65, ]

nrow(subdat)
## [1] 155
# The number of rows is the same as the sum of observations
# where the condition was TRUE
sum(mydat$Age >= 65, na.rm=T)
## [1] 155
# To make it an exclusion, either reverse the logical
# expression or use !
subdat <- mydat[mydat$Age < 65, ]
nrow(subdat)
## [1] 375
subdat <- mydat[!(mydat$Age >= 65), ]
nrow(subdat)
## [1] 375
# Same thing but using subset
# Have to use a logical expression - will not work with row numbers
# Inclusion
subdat <- subset(mydat, subset = Age >= 65)
nrow(subdat)
## [1] 155
# Exclusion using !
subdat <- subset(mydat, subset = !(Age >= 65))
nrow(subdat)
## [1] 375

In tidyverse, use filter().

# Number of rows in unfiltered data
nrow(mydat_tibble)

# Keep only a subset of rows using a logical expression
subdat <- mydat_tibble %>% 
  filter(Age >= 65)
nrow(subdat)

# Exclusion using !
subdat <- mydat_tibble %>% 
  filter(!(Age >= 65))
nrow(subdat)