A.3 Looking ahead and back
Lookahead specifies a pattern to be matched but not returned. A lookahead is actually a subexpression and is formatted as such. The syntax for a lookahead pattern is a subexpression preceded by ?=
, and the text to match follows the =
sign. Some refer to this behaviour as “match but not consume”, in the sense that lookhead and lookahead match a pattern after/before what we actually want to extract, but do not return it.
In the following example, we only want to matcch “my homepage” that followed by a </title>
, and we do not want </title>
in the results
text <- c("<title>my homepage</title>", "<p>my homepage</p>")
str_extract(text, "my homepage(?=</title>)")
#> [1] "my homepage" NA
# looking ahead (and back) must be used in subexpressions
str_extract(text, "my homepage?=</title>")
#> [1] NA NA
Similarly, ?<=
is interpreted as the lookback operator, which specifies a pattern before the text we actually want to extract. Following is an example. A database search lists products, and you need only the prices.
Following is an example. A database search lists products, and you need only the prices.
text <- c("ABC01: $23.45",
"HGG42: $5.31",
"CFMX1: $899.00",
"XTC99: $69.96",
"Total items found: 4")
str_extract(text, "(?<=\\$)[0-9]+")
#> [1] "23" "5" "899" "69" NA
ookahead and lookbehind operations may be combined, as in the following example
str_extract("<title>my homepage</title>", "(?<=<title>)my homepage(?=</title>)")
#> [1] "my homepage"
Additionally, (?=)
and (?<=)
are known as positive lookahead and lookback. A lesser used version is the negative form of those two operators, looking for text that does not match the specified pattern.
class | description |
---|---|
(?=) |
positive lookahead |
(?!) |
negative lookahead |
(?<=) |
positive lookbehind |
(?<!) |
negative lookbehind |
Suppose we want to extract just the quantities but not the prices in the followin text: