Loading [MathJax]/jax/output/CommonHTML/jax.js

20 정규표현식

20.1 들어가기

19 장에서 정규표현식 기초를 배웠지만 정규표현식은 사실 나름의 미니어처언어이므로 배우는 데 시간을 더 투자할 가치가 있다.

이 장은 패턴지식을 확장하여 중요한 여섯 개의 주제 (이스케이핑, 앵커링, 문자 클래스, 단축어 클래스, quantifiers, 대체구문)를 커버하면서 시작한다. 여기서 우리는 대부분 패턴 언어 자체를 살펴보고, 이를 사용하는 함수는 다루지 않을 것이다. 다른말로 하면, 우리는 대부분 str_view()str_view_all() 로 결과를 보여주는 간단한 벡터들로 작업할 것이다. tidyr 에 있는 함수를 사용하거나, dplyr 함수를 stringr 함수와 조합하여 정규표현식을 데이터프레임에 적용하는 것을 배울 것이다. 그리고나서, 복잡한 패턴을 생성할 때 사용할 수 있는 편리한 전략들을 보여줄 것이다.

그 다음으로 중요한 “그룹핑” 과 “캡쳐링” 개념에 대해 이야기 할 것인데, 이는 tidyr::separate_group() 를 이용하여 문자열에서 변수를 추출하는 새로운 방법을 제공할 것이다. 그룹핑을 사용하면 반복패턴 매칭이 가능한 역참조도 할 수 있다. 다양한 “flags” 논의를 마지막으로 마칠 것인데, 이를 이용하면 정규표현식 동작을 조작할 수 있다.

20.1.1 준비하기

이 장에서는 stringr 패키지가 제공하는 정규표현식을 사용할 것이다.

stringr 이 제공하는 정규표현식은 베이스 R 에서와 약간 다르다는 것을 주목할 필요가 있다. 이는 stringr 은 stringi 패키지를 바탕으로 만들어졌기 때문이고, 또 stringi 는 ICU 엔진 을 바탕으로 만들어진 반면, (gsub()grepl() 같은) 베이스 R 함수들은 TRE 엔진 혹은 PCRE 엔진 을 사용하기 때문이다. 다행스럽게도, 정규표현식의 기초는 잘 정립되어 있어서, 이 책에서 배울 패턴으로 작업할 때 차이를 만날 가능성은 거의 없다. (차이가 중요한 경우에는 설명을 할 예정이다.) 복잡한 유니코드 문자범위같은 고급기능이나 (?…) 문법을 사용하는 특수기능을 사용하기 시작할 때 이 차이에 대해 인지하기만 하면 된다. 이에 관한 고급 기능은 vignette("regular-expressions", package = "stringr") 로 알아볼 수 있다. https://www.regular-expressions.info/도 유용한 참고자료이다. 이 자료는 R 에 국한되지 않지만, 정규표현식의 고급기능을 커버하고 정규표현식이 내부에서 어떻게 작동하는지를 설명한다.

20.1.2 연습문제

  1. 다음 문자열이 모두 \ 을 매칭하지 못하는 이유를 설명하라: "\", "\\", "\\\".

  2. 시퀀스 "'\ 를 어떻게 매칭하겠는가?

  3. 정규표현식 \..\..\.. 은 어떤 패턴을 매칭하겠는가? 이를 문자열로 어떻게 표현하겠는가?

20.2 패턴 언어

19 장에서 정규표현식 패턴언어의 기초사항을 배웠고, 이제 조금 더 세부사항으로 들어갈 시간이다. 우선, 이스케이프 부터 시작할 것인데, 이를 사용하면 패턴언어가 특수하게 다루는 문자를 매치할 수 있다. 다음으로 문자열의 시작이나 끝을 매치할 수 잇는 앵커를 배울 것이다. 그다음, 문자 클래스 와 이것의 단축어를 배울 것인데, 집합 안의 임의의 문자를 매치할 수 있다. 몇회 패턴 매치할 수 있는지를 컨트롤하는 quantifiers 와, 이것 혹은 저것 중 임의의 것을 매치하는 대체구문으로 마무리할 것이다.

여기서 사용하는 용어들은 각 구성요소에 대한 기술적 이름들이다. 이 이름들은 목적을 가장 잘 떠오르게 하는 건 아니지만, 나중에 검색하여 찾아볼 때 정확한 용어를 아는 것이 매우 도움이 된다.

이러한 패턴들이 어떻게 작동하는지, str_view()str_view_all() 를 이용하여 보여줄 것이지만, 19 장에서 배운 함수들 중 어떤 것을 사용할 수 있다는 것을 기억하라, 즉:

  • str_detect(x, pattern)x 와 같은 길이의 논리형 벡터를 반환하는데, 각 요소가 패턴에 매치하는지(TRUE) 아닌지 (FALSE)를 보여준다.
  • str_count(x, pattern)x 각 요소에서의 pattern 매치 회수를 반환한다.
  • str_replace_all(x, pattern, replacement)pattern 의 모든 인스턴스를 replacement 로 치환한다.

20.3 이스케이핑

19 장에서, fixed(".") 를 사용하여 문자 . 를 매치하는 법을 배울 것이다. 정규표현식의 부분으로 문자 . 를 매치하고 싶으면 어떻게 해야할까? 이스케이프를 사용해야 하는데, 고유의 특별동작이 아닌 정확히 그것을 매치하고 싶다고 정규표현식에게 알려주는 것이 이스케이프다. 문자열과 같이 정규표현식은 백슬래시, \ 를 사용하여 특별한 동작을 이스케이프한다. 따라서 . 를 매칭하기 위해서는 정규표현식\. 를 써야한다. 그런데 이렇게 하면 문제가 생긴다. 정규표현식을 나타내기 위해 문자열을 사용했고 \ 도 문자열에서 이스케이프 상징어로 사용하였다. 따라서 정규표현식 \. 를 작성하기 위해서는 문자열 "\\." 이 필요하다.

# To create the regular expression \., we need to use \\.
dot <- "\\."

# But the expression itself only contains one:
str_view(dot)
  • \.
# And this tells R to look for an explicit . str_view(c("abc", "a.c", "bef"), "a\\.c")
  • abc
  • a.c
  • bef

이 책에서, 정규표현식은 \. 으로, 정규표현식을 나타내는 문자열은 "\\." 으로 쓸 것이다.

정규표현식에서 \ 를 이스케이프 문자로 사용한다면 문자 \ 는 도대체 어떻게 매칭하겠는가? 정규표현식 \\ 를 만들어 이스케이프해야 한다. 앞의 정규표현식을 만들려면 \ 를 이스케이프하는 문자열이 필요하다. 즉, 문자 \ 을 매칭하기 위해서 "\\\\" 라고 작성해야 한다. 즉, 하나를 매칭하기 위해 네 개의 역슬래시가 필요하다!

x <- "a\\b"
str_view(x)
  • a\b
str_view(x, "\\\\")
  • a\b

다른 방법으로, ?? 장에서 배운 원 문자열을 사용하는 것이 더 쉬울 수도 있다. 이렇게 하면 이스케이핑 한 단계를 덜 수 있다:

str_view(x, r"(\\)")
  • abc
  • a.c
  • bef

이스케이프가 필요한 특수 의미를 가진 문자들의 전체 셋은 .^$\|*+?{}[]() 이다. 일반적으로, 구둣점을 의심의 눈초리로 바라보라; 정규표현식이 기대한 것에 매칭되지 않는다면, 이러한 문자들을 사용했는지 확인해보라.

20.4 앵커

기본적으로 정규표현식은 문자열의 일부를 매치한다. 정규표현식을 앵커로 고정(anchor) 하여 문자열의 시작 또는 끝과 매칭하면 유용한 경우가 많다. 다음을 사용할 수 있다:

  • ^: 문자열의 시작과 매칭.
  • $: 문자열의 끝과 매칭.
x <- c("apple", "banana", "pear")
str_view(x, "a")  # match "a" anywhere
  • \.
str_view(x, "^a") # match "a" at start
  • abc
  • a.c
  • bef
str_view(x, "a$") # match "a" at end`
  • a\b

두 기호를 올바로 기억하기 위해, 에반 미슐라가 알려준 다음의 연상 구문을 시도해보자. 파워(^)로 시작하면, 돈($)으로 끝나게 된다.

정규표현식을 문자열 전체와 강제로 매칭하도록 하려면 ^$ 로 고정하라:

x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
  • apple pie
  • apple
  • apple cake
str_view(x, "^apple$")
  • apple pie
  • apple
  • apple cake

단어 사이의 경계(boundary, 즉 단어의 처음이나 끝)를 매칭시키려면 \b 를 사용하면 된다. 나는 R 에서 이 방법을 자주 사용하지는 않지만 RStudio 에서 검색할 때 한번씩 사용한다. 다른 함수의 구성요소인 함수의 이름을 찾고자 할 때 편리하다. 예를 들어 \bsum\b 를 사용하여 summarize, summary, rowsum 등이 매칭되는 것을 피할 수 있다.

x <- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)")
str_view(x, "sum")
  • summary(x)
  • summarise(df)
  • rowsum(x)
  • sum(x)
str_view_all(x, "\\bsum\\b")
  • summary(x)
  • summarise(df)
  • rowsum(x)
  • sum(x)

이 앵커들을 단독으로 사용하면 0-넓이 매칭이 된다:

str_view_all("abc", c("$", "^", "\\b"))
  • a\b

20.4.1 문자 클래스

문자 클래스 혹은 문자 집합 을 사용하면 집합 안에 있는 임의의 문자에 매칭할 수 있다. 기본 문법은 [] 내부에 매치할 수 있는 개별 문자를 리스트하는 것이어서, [abc] 는 a, 혹은 b, 혹은 c 에 매칭한다. [] 내부에는 -, ^, \ 만 특별한 의미를 갖는다:

  • - 는 범위를 정의함. [a-z] 는 소문자를 매칭하고 [0-9] 는 숫자를 매칭한다.
  • ^ 는 여집합을 취함. [^abc]: a, b, c 를 제외한 것을 매칭한다.
  • \ 는 특별 문자를 이스케이프함. [\^\-\]]: ^, 혹은 -, 혹은 ] 를 매칭한다.
str_view_all("abcd12345-!@#%.", c("[abc]", "[a-z]", "[^a-z0-9]"))
  • apple pie
  • apple
  • apple cake
# You need an escape to match characters that are otherwise # special inside of [] str_view_all("a-b-c", "[a\\-c]")
  • apple pie
  • apple
  • apple cake

정규 표현식은 대문자를 구별 (case sensitive) 하기 때문에 임의의 소문자 혹은 대문자 혹은 숫자를 매칭하고 싶다면, 다음과 같이 작성한다: [a-zA-Z0-9].

20.4.2 단축어 문자 클래스

매우 자주 사용되어서 고유의 단축어가 있는 문자들이 있다. 이미 . 를 보았는데, 이는 신규라인으로 부터 떨어진 임의의 문자를 매칭한다. 이 외에도 세가지 유용한 짝이 있다:

  • \d: 숫자에 매칭;
    \D: 숫자가 아닌 것에 매칭.
  • \s: 화이트스페이스에 매칭 (예, space, tab, newline);
    \S: 화이트스페이스가 아닌 것에 매칭.
  • \w: “단어” 문자, 즉 문자와 숫자에 매칭;
    \W: 비단어 문자에 매칭.

\d 이나 \s 를 포함하는 정규표현식을 생성하려면, 해당 문자열에 대해 \ 를 이스케이프해야 하기 때문에, "\\d""\\s" 를 타이핑해야한다. 다음 코드는 문자, 숫자, 구두점 문자에 대한 다양한 단축어들을 보여준다.

str_view_all("abcd12345!@#%. ", "\\d+")
  • summary(x)
  • summarise(df)
  • rowsum(x)
  • sum(x)
str_view_all("abcd12345!@#%. ", "\\D+")
  • summary(x)
  • summarise(df)
  • rowsum(x)
  • sum(x)
str_view_all("abcd12345!@#%. ", "\\w+")
  • abcd12345!@#%. 
str_view_all("abcd12345!@#%. ", "\\W+")
  • abcd12345!@#%. 
str_view_all("abcd12345!@#%. ", "\\s+")
  • abcd12345!@#%. 
str_view_all("abcd12345!@#%. ", "\\S+")
  • abcd12345!@#%. 

20.4.3 Quantifiers

quantifiers 는 패턴이 얼마나 많이 매칭되는가를 컨트롤한다. 19 장에서 ? (0 혹은 1 회 매칭), + (1 이상 매칭), * (0 이상 매칭) 에 대해 배웠었다. 예를 들어, colou?r 는 미국 영국 스펠링에 관계없이 매칭하고, \d+ 는 하나 이상 숫자에 매칭하고, \s? 은 선택적으로 화이트스페이스(whitespace) 하나에 매칭한다.

정확하게 매칭의 개수를 설정할 수도 있다:

  • {n}: 정확하게 n
  • {n,}: n 이상
  • {n,m}: n 과 m 사이

The following code shows how this works for a few simple examples using to \b match the or end of a word.

x <- " x xx xxx xxxx"
str_view_all(x, "\\bx{2}")
  •  x xx xxx xxxx
str_view_all(x, "\\bx{2,}")
  •  x xx xxx xxxx
str_view_all(x, "\\bx{1,3}")
  •  x xx xxx xxxx
str_view_all(x, "\\bx{2,3}")
  •  x xx xxx xxxx

20.4.4 대체구문과 괄호

대체구문(alternation)을 이용하여 하나 이상의 대체 패턴 사이에서 선택하도록 할 수 있다. 다음은 예이다:

  • apple 혹은 pear 혹은 banana 에 매칭: apple|pear|banana.
  • 3 숫자 혹은 두 숫자에 매칭: \w{3}|\d{2}.

20.4.5 괄호와 연산 우선순위

ab+ 는 무엇에 매치할까요? “a” 와 “b” 가 하나 이상 따라나오는 것에 매칭하는가, 아니면 “ab” 가 임의 횟수 반복하는 것에 매치할까요? ^a|b$ 는 무엇에 매치할까요? 전체 문자열이나 b 문자열 전체에 매치하는가, 아니면 a 로 시작하는 문자열 혹은 “B”로 시작하는 문자열에 매치할까요? 이러한 질문의 답을 결정하는 것은 연산 우선순위인데, a + b * c 와 함께 학교에서 배운 PEMDAS 혹은 BEDMAS 법칙과 유사한 것입니다.

* 가 우선순위가 높고, + 가 우선순위가 낮기 때문에 (따라서 *+ 보다 먼저함), a + b * c(a + b) * c 가 아닌 a + (b * c) 와 같다는 것을 이미 알고 있습니다. 정규표현식에서, quantifier 는 우선순위가 높고, 대체구문은 우선순위가 낮습니다. 따라서 ab+a(b+) 와 같고, ^a|b$(^a)|(b$) 와 같습니다. 사칙연산과 같이, 이러한 순서를 덮어쓰기 위해 괄호를 사용할 수 있습니다 (괄호는 우선순위가 가장 높음).

이스케이프, 문자 클래스, 괄호는 모두 우선순위가 높은 연산입니다. 하지만, 이것들은 대부분의 경우 예상한대로 작동하기 때문에 혼란을 야기하는 경우가 거의 없습니다: \(s|d)(\s)|(\d) 라고 헷갈릴 가능성은 거의 없습니다.

20.4.6 Exercises

  1. How would you match the literal string "$^$"?

  2. Given the corpus of common words in stringr::words, create regular expressions that find all words that:

    1. Start with “y”.
    2. Don’t start with “y”.
    3. End with “x”.
    4. Are exactly three letters long. (Don’t cheat by using str_length()!)
    5. Have seven letters or more.

    Since words is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.

  3. Create regular expressions that match the British or American spellings of the following words: grey/gray, modelling/modeling, summarize/summarise, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut.

  4. What strings will $a match?

  5. Create a regular expression that will match telephone numbers as commonly written in your country.

  6. Write the equivalents of ?, +, * in {m,n} form.

  7. Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)

    1. ^.*$
    2. "\\{.+\\}"
    3. \d{4}-\d{2}-\d{2}
    4. "\\\\{4}"
  8. Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.

20.5 실무

실무에서 이것들을 해보기 위해 stringr 에 들어 있는 wordssentences 데이터셋을 이용하여 문제들을 풀어볼 것입니다. words 는 영어 단어의 리스트이고 sentences 는 목소리변환에 사용하려고 만들어진 간단한 문장 집합입니다.

str_view(head(words))
  • a
  • able
  • about
  • absolute
  • accept
  • account
str_view(head(sentences))
  • The birch canoe slid on the smooth planks.
  • Glue the sheet to the dark blue background.
  • It's easy to tell the depth of a well.
  • These days a chicken leg is a rare dish.
  • Rice is often served in round bowls.
  • The juice of lemons makes fine punch.

다음의 세 섹션에서 세가지 일반 기술을 논의하면서 패턴 요소들을 연습해 볼 것입니다: 간단한 양, 음 컨트롤을 생성해서 작업을 확인하고, 정규표현식을 불리안 연산과 조합하고, 문자열 작업을 해서 복잡한 패턴을 생성합니다.

20.5.1 작업 체크하기

처음으로, “The” 로 시작하는 문장 모두를 불러옵시다. ^ 앵커를 사용하는 것만으로는 할 수 없습니다:

str_view(sentences, "^The", match = TRUE)
  • The birch canoe slid on the smooth planks.
  • These days a chicken leg is a rare dish.
  • The juice of lemons makes fine punch.
  • The box was thrown beside the parked truck.
  • The hogs were fed chopped corn and garbage.
  • The boy was there when the sun rose.
  • The source of the huge river is the clear spring.
  • The soft cushion broke the man's fall.
  • The salt breeze came across from the sea.
  • The girl at the booth sold fifty bonds.
  • The small pup gnawed a hole in the sock.
  • The fish twisted and turned on the bent hook.
  • The swan dive was far short of perfect.
  • The beauty of the view stunned the young boy.
  • The colt reared and threw the tall rider.
  • The wrist was badly strained and hung limp.
  • The stray cat gave birth to kittens.
  • The young girl gave no clear response.
  • The meal was cooked before the bell rang.
  • The ship was torn apart on the sharp reef.
  • The wide road shimmered in the hot sun.
  • The lazy cow lay in the cool grass.
  • The rope will bind the seven books at once.
  • The friendly gang left the drug store.
  • The frosty air passed through the coat.
  • The crooked maze failed to fool the mouse.
  • The show was a flop from the very start.
  • The wagon moved on well oiled wheels.
  • The clock struck to mark the third period.
  • The set of china hit, the floor with a crash.
  • The dune rose from the edge of the water.
  • The two met while playing on the sand.
  • The ink stain dried on the finished page.
  • The walled town was seized without a fight.
  • The lease ran out in sixteen weeks.
  • The horn of the car woke the sleeping cop.
  • The heart beat strongly and with firm strokes.
  • The pearl was worn in a thin silver ring.
  • The fruit peel was cut in thick slices.
  • The Navy attacked the big task force.
  • There are more than two factors here.
  • The hat brim was wide and too droopy.
  • The lawyer tried to lose his case.
  • The grass curled around the fence post.
  • The slush lay deep along the street.
  • The fin was sharp and cut the clear water.
  • The play seems dull and quite stupid.
  • The term ended in late June that year.
  • The bill as paid every third week.
  • The pipe began to rust while new.
  • The ripe taste of cheese improves with age.
  • The hog crawled under the high fence.
  • The bark of the pine tree was shiny and dark.
  • The pennant waved when the wind blew.
  • The harder he tried the less he got done.
  • The boss ran the show with a watchful eye.
  • The cup cracked and spilled its contents.
  • The slang word for raw whiskey is booze.
  • The wharf could be seen at the farther shore.
  • The tiny girl took off her hat.
  • The glow deepened in the eyes of the sweet girl.
  • The young kid jumped the rusty gate.
  • The just claim got the right verdict.
  • These thistles bend in a high wind.
  • The tree top waved in a graceful way.
  • The spot on the blotter was made by green ink.
  • The cigar burned a hole in the desk top.
  • The empty flask stood on the tin tray.
  • The coffee stand is too high for the couch.
  • The urge to write short stories is rare.
  • The pencils have all been used.
  • The pirates seized the crew of the lost ship.
  • The sofa cushion is red and of light weight.
  • The jacket hung on the back of the wide chair.
  • The office paint was a dull sad tan.
  • The child almost hurt the small dog.
  • There was a sound of dry leaves outside.
  • The sky that morning was clear and bright blue.
  • The doctor cured him with these pills.
  • The new girl was fired today at noon.
  • They felt gay when the ship arrived in port.
  • The third act was dull and tired the players.
  • There the flood mark is ten inches.
  • The fruit of a fig tree is apple-shaped.
  • The paper box is full of thumb tacks.
  • The tongs lay beside the ice pail.
  • The petals fall with the next puff of wind.
  • They could laugh although they were sad.
  • The brown house was on fire to the attic.
  • The lure is used to catch trout and flounder.
  • The club rented the rink for the fifth night.
  • The hostess taught the new maid to serve.
  • The cement had dried when he moved it.
  • The loss of the second ship was hard to take.
  • The fly made its way along the wall.
  • The large house had hot water taps.
  • The doorknob was made of bright clean brass.
  • The wreck occurred by the bank on Main Street.
  • The lamp shone with a steady green flame.
  • They took the axe and the saw to the forest.
  • The ancient coin was quite dull and worn.
  • The shaky barn fell with a loud crash.
  • They are pushed back each time they attack.
  • They floated on the raft to sun their white backs.
  • The map had an X that meant nothing.
  • The play began as soon as we sat down.
  • The rush for funds reached its peak Tuesday.
  • The birch looked stark white and lonesome.
  • The box is held by a bright red snapper.
  • The first worm gets snapped early.
  • They are men who walk the middle of the road.
  • The prince ordered his head chopped off.
  • The houses are built of red clay bricks.
  • These pills do less good than others.
  • The dark pot hung in the front closet.
  • The train brought our hero to the big town.
  • The rude laugh filled the empty room.
  • The horse trotted around the field at a brisk pace.
  • The red tape bound the smuggled food.
  • The cold drizzle will halt the bond drive.
  • The junk yard had a mouldy smell.
  • The flint sputtered and lit a pine torch.
  • The shelves were bare of both jam or crackers.
  • The ridge on a smooth surface is a bump or flaw.
  • The mute muffled the high tones of the horn.
  • The gold ring fits only a pierced ear.
  • The old pan was covered with hard fudge.
  • The node on the stalk of wheat grew daily.
  • The heap of fallen leaves was set on fire.
  • The barrel of beer was a brew of malt and hops.
  • The plant grew large and green in the window.
  • The beam dropped down on the workmen's head.
  • The tube was blown and the tire flat and useless.
  • The last switch cannot be turned off.
  • The fight will end in just six minutes.
  • The store walls were lined with colored frocks.
  • The peace league met to discuss their plans.
  • The rise to fame of a person takes luck.
  • The quick fox jumped on the sleeping cat.
  • The nozzle of the fire hose was bright brass.
  • The purple tie was ten years old.
  • The crunch of feet in the snow was the only sound.
  • The copper bowl shone in the sun's rays.
  • The plush chair leaned against the wall.
  • The beach is dry and shallow at low tide.
  • The idea is to sew both edges straight.
  • The kitten chased the dog down the street.
  • The zones merge in the central part of town.
  • The vane on top of the pole revolved in the wind.
  • The clan gathered on each dull night.
  • The man went to the woods to gather sticks.
  • The dirt piles were lines along the road.
  • The logs fell and tumbled into the clear stream.
  • The thaw came early and freed the stream.
  • The key you designed will fit the lock.
  • The lake sparkled in the red hot sun.
  • The fur of cats goes by many names.
  • The drip of the rain made a pleasant sound.
  • The sun came up to light the eastern sky.
  • The stale smell of old beer lingers.
  • The desk was firm on the shaky floor.
  • The cone costs five cents on Mondays.
  • The list of names is carved around the base.
  • The sheep were led home by a dog.
  • The sense of smell is better than that of touch.
  • The news struck doubt into restless minds.
  • The sand drifts over the sill of the old house.
  • The point of the steel pen was bent and twisted.
  • There is a lag between thought and act.
  • The boy owed his pal thirty cents.
  • The chap slipped into the crowd and was lost.
  • The ramp led up to the wide highway.
  • The straw nest housed five robins.
  • The dry wax protects the deep scratch.
  • These coins will be needed to pay his debt.
  • The nag pulled the frail cart along.
  • The vamp of the shoe had a gold buckle.
  • The smell of burned rags itches my nose.
  • The marsh will freeze when cold enough.
  • They slice the sausage thin with a knife.
  • The bloom of the rose lasts a few days.
  • The man wore a feather in his felt hat.
  • The desk and both chairs were painted tan.
  • The couch cover and hall drapes were blue.
  • The stems of the tall glasses cracked and broke.
  • The wall phone rang loud and often.
  • The clothes dried on a thin wooden rack.
  • The cleat sank deeply into the soft turf.
  • The bills were mailed promptly on the tenth of the month.
  • The price is fair for a good antique clock.
  • The music played on while they talked.
  • The bunch of grapes was pressed into wine.
  • The hinge on the door creaked with old age.
  • The screen before the fire kept in the sparks.
  • The chair looked strong but had no bottom.
  • The kite flew wildly in the high wind.
  • The tin box held priceless stones.
  • The case was puzzling to the old and wise.
  • The bright lanterns were gay on the dark lawn.
  • The youth drove with zest, but little skill.
  • The way to save money is not to spend much.
  • The odor of spring makes young hearts jump.
  • They told wild tales to frighten him.
  • The three story house was built of stone.
  • Their eyelids droop for want. of sleep.
  • The sip of tea revives his tired friend.
  • There are many ways to do these things.
  • The work of the tailor is seen on each side.
  • The dusty bench stood by the stone wall.
  • The square wooden crate was packed to be shipped.
  • The water in this well is a source of good health.
  • The little tales they tell are false.
  • The door was barred, locked, and bolted as well.
  • The kite dipped and swayed, but stayed aloft.
  • The pleasant hours fly by much too soon.
  • The room was crowded with a wild mob.
  • The beetle droned in the hot June sun.
  • The black trunk fell from the landing.
  • The bank pressed for payment of the debt.
  • The theft of the pearl pin was kept secret.
  • The vast space stretched into the far distance.
  • The leaf drifts along with a slow spin.
  • The pencil was cut to be sharp at both ends.
  • The best method is to fix it in place with clips.
  • The small red neon lamp went out.
  • The fan whirled its round blades softly.
  • The line where the edges join was clean.
  • The child crawled into the dense grass.
  • The hilt. of the sword was carved with fine designs.
  • The pipe ran almost the length of the ditch.
  • The weight. of the package was seen on the high scale.
  • The green light in the brown box flickered.
  • The brass tube circled the high wall.
  • The lobes of her ears were pierced to hold rings.
  • They took their kids from the public school.
  • The cloud moved in a stately way and was gone.
  • There is a strong chance it will happen once more.
  • The duke left the park in a silver coach.
  • The ram scared the school children off.
  • The team with the best timing looks good.
  • The farmer swapped his horse for a brown ox.
  • The early phase of life moves fast.
  • The latch on the beck gate needed a nail.
  • The goose was brought straight from the old market.
  • The sink is the thing in which we pile dishes.
  • The facts don't always show who is right.
  • The loss of the cruiser was a blow to the fleet.
  • The square peg will settle in the round hole.
  • They sang the same tunes at each party.
  • The sky in the west is tinged with orange red.
  • The pods of peas ferment in bare fields.
  • The horse balked and threw the tall rider.
  • The hitch between the horse and cart broke.
  • The gold vase is both rare and costly.
  • The knife was hung inside its bright sheath.
  • The rarest spice comes from the far East.
  • The roof should be tilted at a sharp slant.
  • The mule trod the treadmill day and night.
  • The aim of the contest is to raise a great fund.
  • There is a fine hard tang in salty air.
  • The slab was hewn from heavy blocks of slate.
  • The poor boy missed the boat again.
  • The first part of the plan needs changing.
  • The good book informs of what we ought to know.
  • The mail comes in three batches per day.
  • The night shift men rate extra pay.
  • The red paper brightened the dim stage.
  • The steady drip is worse than a drenching rain.
  • The stitch will serve but needs to be shortened.
  • The gloss on top made it unfit to read.
  • The hail pattered on the burnt brown grass.
  • The store was jammed before the sale could start.
  • The pot boiled, but the contents failed to jell.
  • The baby puts his right foot in his mouth.
  • The bombs left most of the town in ruins.
  • The streets are narrow and full of sharp turns.
  • The pup jerked the leash as he saw a feline shape.
  • The big red apple fell to the ground.
  • The curtain rose and the show was on.
  • The young prince became heir to the throne.
  • The corner store was robbed last night.
  • The long journey home took a year.
  • The grass and bushes were wet with dew.
  • The blind man counted his old coins.

TheyThose 로 시작하는 문장에 모두 매치하기 때문입니다. “e” 가 단어의 마지막 글자인 단어로 제한해야 하는데, 단어 boundary 를 추가해서 할 수 있습니다:

str_view(sentences, "^The\\b", match = TRUE)
  • The birch canoe slid on the smooth planks.
  • The juice of lemons makes fine punch.
  • The box was thrown beside the parked truck.
  • The hogs were fed chopped corn and garbage.
  • The boy was there when the sun rose.
  • The source of the huge river is the clear spring.
  • The soft cushion broke the man's fall.
  • The salt breeze came across from the sea.
  • The girl at the booth sold fifty bonds.
  • The small pup gnawed a hole in the sock.
  • The fish twisted and turned on the bent hook.
  • The swan dive was far short of perfect.
  • The beauty of the view stunned the young boy.
  • The colt reared and threw the tall rider.
  • The wrist was badly strained and hung limp.
  • The stray cat gave birth to kittens.
  • The young girl gave no clear response.
  • The meal was cooked before the bell rang.
  • The ship was torn apart on the sharp reef.
  • The wide road shimmered in the hot sun.
  • The lazy cow lay in the cool grass.
  • The rope will bind the seven books at once.
  • The friendly gang left the drug store.
  • The frosty air passed through the coat.
  • The crooked maze failed to fool the mouse.
  • The show was a flop from the very start.
  • The wagon moved on well oiled wheels.
  • The clock struck to mark the third period.
  • The set of china hit, the floor with a crash.
  • The dune rose from the edge of the water.
  • The two met while playing on the sand.
  • The ink stain dried on the finished page.
  • The walled town was seized without a fight.
  • The lease ran out in sixteen weeks.
  • The horn of the car woke the sleeping cop.
  • The heart beat strongly and with firm strokes.
  • The pearl was worn in a thin silver ring.
  • The fruit peel was cut in thick slices.
  • The Navy attacked the big task force.
  • The hat brim was wide and too droopy.
  • The lawyer tried to lose his case.
  • The grass curled around the fence post.
  • The slush lay deep along the street.
  • The fin was sharp and cut the clear water.
  • The play seems dull and quite stupid.
  • The term ended in late June that year.
  • The bill as paid every third week.
  • The pipe began to rust while new.
  • The ripe taste of cheese improves with age.
  • The hog crawled under the high fence.
  • The bark of the pine tree was shiny and dark.
  • The pennant waved when the wind blew.
  • The harder he tried the less he got done.
  • The boss ran the show with a watchful eye.
  • The cup cracked and spilled its contents.
  • The slang word for raw whiskey is booze.
  • The wharf could be seen at the farther shore.
  • The tiny girl took off her hat.
  • The glow deepened in the eyes of the sweet girl.
  • The young kid jumped the rusty gate.
  • The just claim got the right verdict.
  • The tree top waved in a graceful way.
  • The spot on the blotter was made by green ink.
  • The cigar burned a hole in the desk top.
  • The empty flask stood on the tin tray.
  • The coffee stand is too high for the couch.
  • The urge to write short stories is rare.
  • The pencils have all been used.
  • The pirates seized the crew of the lost ship.
  • The sofa cushion is red and of light weight.
  • The jacket hung on the back of the wide chair.
  • The office paint was a dull sad tan.
  • The child almost hurt the small dog.
  • The sky that morning was clear and bright blue.
  • The doctor cured him with these pills.
  • The new girl was fired today at noon.
  • The third act was dull and tired the players.
  • The fruit of a fig tree is apple-shaped.
  • The paper box is full of thumb tacks.
  • The tongs lay beside the ice pail.
  • The petals fall with the next puff of wind.
  • The brown house was on fire to the attic.
  • The lure is used to catch trout and flounder.
  • The club rented the rink for the fifth night.
  • The hostess taught the new maid to serve.
  • The cement had dried when he moved it.
  • The loss of the second ship was hard to take.
  • The fly made its way along the wall.
  • The large house had hot water taps.
  • The doorknob was made of bright clean brass.
  • The wreck occurred by the bank on Main Street.
  • The lamp shone with a steady green flame.
  • The ancient coin was quite dull and worn.
  • The shaky barn fell with a loud crash.
  • The map had an X that meant nothing.
  • The play began as soon as we sat down.
  • The rush for funds reached its peak Tuesday.
  • The birch looked stark white and lonesome.
  • The box is held by a bright red snapper.
  • The first worm gets snapped early.
  • The prince ordered his head chopped off.
  • The houses are built of red clay bricks.
  • The dark pot hung in the front closet.
  • The train brought our hero to the big town.
  • The rude laugh filled the empty room.
  • The horse trotted around the field at a brisk pace.
  • The red tape bound the smuggled food.
  • The cold drizzle will halt the bond drive.
  • The junk yard had a mouldy smell.
  • The flint sputtered and lit a pine torch.
  • The shelves were bare of both jam or crackers.
  • The ridge on a smooth surface is a bump or flaw.
  • The mute muffled the high tones of the horn.
  • The gold ring fits only a pierced ear.
  • The old pan was covered with hard fudge.
  • The node on the stalk of wheat grew daily.
  • The heap of fallen leaves was set on fire.
  • The barrel of beer was a brew of malt and hops.
  • The plant grew large and green in the window.
  • The beam dropped down on the workmen's head.
  • The tube was blown and the tire flat and useless.
  • The last switch cannot be turned off.
  • The fight will end in just six minutes.
  • The store walls were lined with colored frocks.
  • The peace league met to discuss their plans.
  • The rise to fame of a person takes luck.
  • The quick fox jumped on the sleeping cat.
  • The nozzle of the fire hose was bright brass.
  • The purple tie was ten years old.
  • The crunch of feet in the snow was the only sound.
  • The copper bowl shone in the sun's rays.
  • The plush chair leaned against the wall.
  • The beach is dry and shallow at low tide.
  • The idea is to sew both edges straight.
  • The kitten chased the dog down the street.
  • The zones merge in the central part of town.
  • The vane on top of the pole revolved in the wind.
  • The clan gathered on each dull night.
  • The man went to the woods to gather sticks.
  • The dirt piles were lines along the road.
  • The logs fell and tumbled into the clear stream.
  • The thaw came early and freed the stream.
  • The key you designed will fit the lock.
  • The lake sparkled in the red hot sun.
  • The fur of cats goes by many names.
  • The drip of the rain made a pleasant sound.
  • The sun came up to light the eastern sky.
  • The stale smell of old beer lingers.
  • The desk was firm on the shaky floor.
  • The cone costs five cents on Mondays.
  • The list of names is carved around the base.
  • The sheep were led home by a dog.
  • The sense of smell is better than that of touch.
  • The news struck doubt into restless minds.
  • The sand drifts over the sill of the old house.
  • The point of the steel pen was bent and twisted.
  • The boy owed his pal thirty cents.
  • The chap slipped into the crowd and was lost.
  • The ramp led up to the wide highway.
  • The straw nest housed five robins.
  • The dry wax protects the deep scratch.
  • The nag pulled the frail cart along.
  • The vamp of the shoe had a gold buckle.
  • The smell of burned rags itches my nose.
  • The marsh will freeze when cold enough.
  • The bloom of the rose lasts a few days.
  • The man wore a feather in his felt hat.
  • The desk and both chairs were painted tan.
  • The couch cover and hall drapes were blue.
  • The stems of the tall glasses cracked and broke.
  • The wall phone rang loud and often.
  • The clothes dried on a thin wooden rack.
  • The cleat sank deeply into the soft turf.
  • The bills were mailed promptly on the tenth of the month.
  • The price is fair for a good antique clock.
  • The music played on while they talked.
  • The bunch of grapes was pressed into wine.
  • The hinge on the door creaked with old age.
  • The screen before the fire kept in the sparks.
  • The chair looked strong but had no bottom.
  • The kite flew wildly in the high wind.
  • The tin box held priceless stones.
  • The case was puzzling to the old and wise.
  • The bright lanterns were gay on the dark lawn.
  • The youth drove with zest, but little skill.
  • The way to save money is not to spend much.
  • The odor of spring makes young hearts jump.
  • The three story house was built of stone.
  • The sip of tea revives his tired friend.
  • The work of the tailor is seen on each side.
  • The dusty bench stood by the stone wall.
  • The square wooden crate was packed to be shipped.
  • The water in this well is a source of good health.
  • The little tales they tell are false.
  • The door was barred, locked, and bolted as well.
  • The kite dipped and swayed, but stayed aloft.
  • The pleasant hours fly by much too soon.
  • The room was crowded with a wild mob.
  • The beetle droned in the hot June sun.
  • The black trunk fell from the landing.
  • The bank pressed for payment of the debt.
  • The theft of the pearl pin was kept secret.
  • The vast space stretched into the far distance.
  • The leaf drifts along with a slow spin.
  • The pencil was cut to be sharp at both ends.
  • The best method is to fix it in place with clips.
  • The small red neon lamp went out.
  • The fan whirled its round blades softly.
  • The line where the edges join was clean.
  • The child crawled into the dense grass.
  • The hilt. of the sword was carved with fine designs.
  • The pipe ran almost the length of the ditch.
  • The weight. of the package was seen on the high scale.
  • The green light in the brown box flickered.
  • The brass tube circled the high wall.
  • The lobes of her ears were pierced to hold rings.
  • The cloud moved in a stately way and was gone.
  • The duke left the park in a silver coach.
  • The ram scared the school children off.
  • The team with the best timing looks good.
  • The farmer swapped his horse for a brown ox.
  • The early phase of life moves fast.
  • The latch on the beck gate needed a nail.
  • The goose was brought straight from the old market.
  • The sink is the thing in which we pile dishes.
  • The facts don't always show who is right.
  • The loss of the cruiser was a blow to the fleet.
  • The square peg will settle in the round hole.
  • The sky in the west is tinged with orange red.
  • The pods of peas ferment in bare fields.
  • The horse balked and threw the tall rider.
  • The hitch between the horse and cart broke.
  • The gold vase is both rare and costly.
  • The knife was hung inside its bright sheath.
  • The rarest spice comes from the far East.
  • The roof should be tilted at a sharp slant.
  • The mule trod the treadmill day and night.
  • The aim of the contest is to raise a great fund.
  • The slab was hewn from heavy blocks of slate.
  • The poor boy missed the boat again.
  • The first part of the plan needs changing.
  • The good book informs of what we ought to know.
  • The mail comes in three batches per day.
  • The night shift men rate extra pay.
  • The red paper brightened the dim stage.
  • The steady drip is worse than a drenching rain.
  • The stitch will serve but needs to be shortened.
  • The gloss on top made it unfit to read.
  • The hail pattered on the burnt brown grass.
  • The store was jammed before the sale could start.
  • The pot boiled, but the contents failed to jell.
  • The baby puts his right foot in his mouth.
  • The bombs left most of the town in ruins.
  • The streets are narrow and full of sharp turns.
  • The pup jerked the leash as he saw a feline shape.
  • The big red apple fell to the ground.
  • The curtain rose and the show was on.
  • The young prince became heir to the throne.
  • The corner store was robbed last night.
  • The long journey home took a year.
  • The grass and bushes were wet with dew.
  • The blind man counted his old coins.

대명사로 시작하는 모든 문장을 불러오는 것은 어떻게 할까요?

str_view(sentences, "^She|He|It|They\\b", match = TRUE)
  • It's easy to tell the depth of a well.
  • Help the woman get back to her feet.
  • Her purse was full of useless trash.
  • It snowed, rained, and hailed the same morning.
  • He ran half way to the hardware store.
  • He lay prone and hardly moved a limb.
  • He ordered peach pie with ice cream.
  • Hemp is a weed found in parts of the tropics.
  • It caught its hind paw in a rusty trap.
  • He said the same phrase thirty times.
  • He broke a new shoelace that day.
  • She sewed the torn coat quite neatly.
  • He knew the skill of the great young actress.
  • They felt gay when the ship arrived in port.
  • He carved a head from the round block of marble.
  • She has st smart way of wearing clothes.
  • They could laugh although they were sad.
  • He wrote his last novel there at the inn.
  • It is hard to erase blue or red ink.
  • They took the axe and the saw to the forest.
  • They are pushed back each time they attack.
  • He broke his ties with groups of former friends.
  • They floated on the raft to sun their white backs.
  • They are men who walk the middle of the road.
  • Hedge apples may stain your hands green.
  • She danced like a swan, tall and graceful.
  • It is late morning on the old wall clock.
  • He smoke a big pipe with strong contents.
  • He used the lathe to make brass objects.
  • It takes a good trap to capture a bear.
  • He took the lead and kept it the whole distance.
  • He crawled with care along the ledge.
  • It takes a lot of help to finish these.
  • He asks no person to vouch for him.
  • He wrote down a long list of items.
  • Heave the line over the port side.
  • It's a dense crowd in two distinct ways.
  • It takes heat to bring out the odor.
  • He takes the oath of office each March.
  • He picked up the dice for a second roll.
  • They slice the sausage thin with a knife.
  • He wheeled the bike past. the winding road.
  • He sent the figs, but kept the ripe cherries.
  • He offered proof in the form of a large chart.
  • They told wild tales to frighten him.
  • She was kind to sick old people.
  • She blushed when he gave her a white orchid.
  • He wrote his name boldly at the top of tile sheet.
  • It matters not if he reads these words or those.
  • She was waiting at my front lawn.
  • It is a band of steel three inches wide.
  • It was hidden from sight by a mass of leaves and shrubs.
  • He put his last cartridge into the gun and fired.
  • They took their kids from the public school.
  • Help the weak to preserve their strength.
  • He lent his coat to the tall gaunt stranger.
  • She flaps her cape as she parades the street.
  • It was done before the boy could see it.
  • They sang the same tunes at each party.
  • It was a bad error on the part of the new judge.
  • He sent the boy on a short errand.
  • She saw a cat in the neighbor's house.
  • She called his name many times.

결과를 빠르게 보면, 이상한 매치가 있다는 것을 알 수 있습니다. 괄호를 안 썼기 때문입니다:

str_view(sentences, "^(She|He|It|They)\\b", match = TRUE)
  • It's easy to tell the depth of a well.
  • It snowed, rained, and hailed the same morning.
  • He ran half way to the hardware store.
  • He lay prone and hardly moved a limb.
  • He ordered peach pie with ice cream.
  • It caught its hind paw in a rusty trap.
  • He said the same phrase thirty times.
  • He broke a new shoelace that day.
  • She sewed the torn coat quite neatly.
  • He knew the skill of the great young actress.
  • They felt gay when the ship arrived in port.
  • He carved a head from the round block of marble.
  • She has st smart way of wearing clothes.
  • They could laugh although they were sad.
  • He wrote his last novel there at the inn.
  • It is hard to erase blue or red ink.
  • They took the axe and the saw to the forest.
  • They are pushed back each time they attack.
  • He broke his ties with groups of former friends.
  • They floated on the raft to sun their white backs.
  • They are men who walk the middle of the road.
  • She danced like a swan, tall and graceful.
  • It is late morning on the old wall clock.
  • He smoke a big pipe with strong contents.
  • He used the lathe to make brass objects.
  • It takes a good trap to capture a bear.
  • He took the lead and kept it the whole distance.
  • He crawled with care along the ledge.
  • It takes a lot of help to finish these.
  • He asks no person to vouch for him.
  • He wrote down a long list of items.
  • It's a dense crowd in two distinct ways.
  • It takes heat to bring out the odor.
  • He takes the oath of office each March.
  • He picked up the dice for a second roll.
  • They slice the sausage thin with a knife.
  • He wheeled the bike past. the winding road.
  • He sent the figs, but kept the ripe cherries.
  • He offered proof in the form of a large chart.
  • They told wild tales to frighten him.
  • She was kind to sick old people.
  • She blushed when he gave her a white orchid.
  • He wrote his name boldly at the top of tile sheet.
  • It matters not if he reads these words or those.
  • She was waiting at my front lawn.
  • It is a band of steel three inches wide.
  • It was hidden from sight by a mass of leaves and shrubs.
  • He put his last cartridge into the gun and fired.
  • They took their kids from the public school.
  • He lent his coat to the tall gaunt stranger.
  • She flaps her cape as she parades the street.
  • It was done before the boy could see it.
  • They sang the same tunes at each party.
  • It was a bad error on the part of the new judge.
  • He sent the boy on a short errand.
  • She saw a cat in the neighbor's house.
  • She called his name many times.

첫번째 몇 매치에서 발생하지 않은 경우 이러한 실수를 어떻게 찾아내야 하는지 궁금할 것입니다. 양성 매치와 음성 매치 몇개를 생성해서 패턴이 예상대로 작동하는지 테스트하는 것은 좋은 방법입니다.

pos <- c("He is a boy", "She had a good time")
neg <- c("Shells come from the sea", "Hadley said 'It's a great day'")
pattern <- "^(She|He|It|They)\\b"
str_detect(pos, pattern)
#> [1] TRUE TRUE
str_detect(neg, pattern)
#> [1] FALSE FALSE

음성 예제보다 양성 예제를 생각해 내기가 훨씬 쉬운데, 약점이 어디인지 예측하기 위해 정규표현식을 충분하게 만들 때까지 시간이 걸리기 때문입니다. 그럼에도 불구하고 두 예제들은 유용합니다; 빠르게 바로잡지 않아도, 문제를 풀면서 천천히 누적해 갈 수 있습니다. 나중에 프로그래밍에 깊이 들어가고, 유닛테스트에 대해 배운다면, 이러한 예제를 같은 실수를 두 번 하지 않게 해주는 자동화 테스트로 변환시킬 수 있습니다.

20.5.2 불리안 연산

자음만 포함하는 단어를 구하고 싶다고 해봅시다. 하나의 방법은 모음 ([^aeiou]) 을 제외한 글자 모두를 포함하는 문자형 클래스를 생성한 뒤 처음과 마지막에 앵커를 사용하여 전체 문자열을 매치하도록 하는 것((^[^aeiou]+$))입니다:

str_view(words, "^[^aeiou]+$", match = TRUE)
  • by
  • dry
  • fly
  • mrs
  • try
  • why

하지만 문제를 뒤바꿔서 조금 더 쉽게 만들 수 있습니다. 자음만 포함하는 단어들을 찾아보는 대신, 모음을 포함하지 않는 단어를 구할 수 있습니다:

words[!str_detect(words, "[aeiou]")]
#> [1] "by"  "dry" "fly" "mrs" "try" "why"

이러한 방법은 “and” 나 “not” 가 있는 논리형 조합을 다룰 때마다 사용할 수 있는 유용한 방법입니다. 예를 들어, “a” 와 “b” 를 포함한 모든 단어를 구한다고 해 봅시다. 정규표현식에는 내장된 “and” 연산자가 없기 때문에, “a” 뒤에 “b” 혹은 “b” 뒤에 “a” 가 있는 단어 모두를 찾아보는 것으로 해결해야 합니다:

words[str_detect(words, "a.*b|b.*a")]
#>  [1] "able"      "about"     "absolute"  "available" "baby"      "back"     
#>  [7] "bad"       "bag"       "balance"   "ball"      "bank"      "bar"      
#> [13] "base"      "basis"     "bear"      "beat"      "beauty"    "because"  
#> [19] "black"     "board"     "boat"      "break"     "brilliant" "britain"  
#> [25] "debate"    "husband"   "labour"    "maybe"     "probable"  "table"

두 호출 결과를 것이 더 쉬울 것입니다:

words[str_detect(words, "a") & str_detect(words, "b")]
#>  [1] "able"      "about"     "absolute"  "available" "baby"      "back"     
#>  [7] "bad"       "bag"       "balance"   "ball"      "bank"      "bar"      
#> [13] "base"      "basis"     "bear"      "beat"      "beauty"    "because"  
#> [19] "black"     "board"     "boat"      "break"     "brilliant" "britain"  
#> [25] "debate"    "husband"   "labour"    "maybe"     "probable"  "table"

모든 모음을 포함하는 단어가 있는지를 보고 싶었다면 어떻게 해야 할까요? If we did it with patterns we’d need to generate 5! (120) different patterns:

words[str_detect(words, "a.*e.*i.*o.*u")]
#> character(0)
# ...
words[str_detect(words, "u.*o.*i.*e.*a")]
#> character(0)

It’s much simpler to combine six calls to str_detect():

words[
  str_detect(words, "a") &
  str_detect(words, "e") &
  str_detect(words, "i") &
  str_detect(words, "o") &
  str_detect(words, "u")
]
#> character(0)

문제를 해결해 줄 단일한 정규표현식이 떠오르지 않는다면 잠시 뒤로 물러서서 문제를 작은 조각으로 분해하여, 작은 문제들을 하나씩 해결하면서 다음 단계로 나아갈 수 있는지 생각해보라.

20.5.3 코드로 패턴생성

색상을 포함하는 모든 sentences 를 어떻게 구할까요? 기초방법은 간단합니다: 대체구문을 단어 경계와 조합합니다.

str_view(sentences, "\\b(red|green|blue)\\b", match = TRUE)
  • Glue the sheet to the dark blue background.
  • Two blue fish swam in the tank.
  • A wisp of cloud hung in the blue air.
  • The spot on the blotter was made by green ink.
  • The sofa cushion is red and of light weight.
  • The sky that morning was clear and bright blue.
  • A blue crane is a tall wading bird.
  • It is hard to erase blue or red ink.
  • The lamp shone with a steady green flame.
  • The box is held by a bright red snapper.
  • The houses are built of red clay bricks.
  • The red tape bound the smuggled food.
  • Hedge apples may stain your hands green.
  • The plant grew large and green in the window.
  • Bathe and relax in the cool green grass.
  • The lake sparkled in the red hot sun.
  • Mark the spot with a sign painted red.
  • The couch cover and hall drapes were blue.
  • A man in a blue sweater sat at the desk.
  • The small red neon lamp went out.
  • Paint the sockets in the wall dull green.
  • Wake and rise, and step into the green outdoors.
  • The green light in the brown box flickered.
  • The sky in the west is tinged with orange red.
  • The red paper brightened the dim stage.
  • The big red apple fell to the ground.

하지만 이 패턴을 수동으로 만드는 것은 귀찮을 수 있습니다. 색상을 벡터에 저장하면 더 낫지 않을까요?

rgb <- c("red", "green", "blue")

잘 되었네요!

str_flatten() 을 사용하여 이 벡터로부터 패턴을 생성해야 합니다.

str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
#> [1] "\\b(red|green|blue)\\b"

더 나은 색상목록을 사용한다면 이 패턴을 더 포괄적이 될 것입니다. R 이 플롯을 만들 때 사용하는 내장된 색상 목록에서 시작해 봅시다:

colors()[1:50]
#>  [1] "white"          "aliceblue"      "antiquewhite"   "antiquewhite1" 
#>  [5] "antiquewhite2"  "antiquewhite3"  "antiquewhite4"  "aquamarine"    
#>  [9] "aquamarine1"    "aquamarine2"    "aquamarine3"    "aquamarine4"   
#> [13] "azure"          "azure1"         "azure2"         "azure3"        
#> [17] "azure4"         "beige"          "bisque"         "bisque1"       
#> [21] "bisque2"        "bisque3"        "bisque4"        "black"         
#> [25] "blanchedalmond" "blue"           "blue1"          "blue2"         
#> [29] "blue3"          "blue4"          "blueviolet"     "brown"         
#> [33] "brown1"         "brown2"         "brown3"         "brown4"        
#> [37] "burlywood"      "burlywood1"     "burlywood2"     "burlywood3"    
#> [41] "burlywood4"     "cadetblue"      "cadetblue1"     "cadetblue2"    
#> [45] "cadetblue3"     "cadetblue4"     "chartreuse"     "chartreuse1"   
#> [49] "chartreuse2"    "chartreuse3"

숫자가 매겨진 변수를 제거해 봅시다.

cols <- colors()
cols <- cols[!str_detect(cols, "\\d")]
cols
#>   [1] "white"                "aliceblue"            "antiquewhite"        
#>   [4] "aquamarine"           "azure"                "beige"               
#>   [7] "bisque"               "black"                "blanchedalmond"      
#>  [10] "blue"                 "blueviolet"           "brown"               
#>  [13] "burlywood"            "cadetblue"            "chartreuse"          
#>  [16] "chocolate"            "coral"                "cornflowerblue"      
#>  [19] "cornsilk"             "cyan"                 "darkblue"            
#>  [22] "darkcyan"             "darkgoldenrod"        "darkgray"            
#>  [25] "darkgreen"            "darkgrey"             "darkkhaki"           
#>  [28] "darkmagenta"          "darkolivegreen"       "darkorange"          
#>  [31] "darkorchid"           "darkred"              "darksalmon"          
#>  [34] "darkseagreen"         "darkslateblue"        "darkslategray"       
#>  [37] "darkslategrey"        "darkturquoise"        "darkviolet"          
#>  [40] "deeppink"             "deepskyblue"          "dimgray"             
#>  [43] "dimgrey"              "dodgerblue"           "firebrick"           
#>  [46] "floralwhite"          "forestgreen"          "gainsboro"           
#>  [49] "ghostwhite"           "gold"                 "goldenrod"           
#>  [52] "gray"                 "green"                "greenyellow"         
#>  [55] "grey"                 "honeydew"             "hotpink"             
#>  [58] "indianred"            "ivory"                "khaki"               
#>  [61] "lavender"             "lavenderblush"        "lawngreen"           
#>  [64] "lemonchiffon"         "lightblue"            "lightcoral"          
#>  [67] "lightcyan"            "lightgoldenrod"       "lightgoldenrodyellow"
#>  [70] "lightgray"            "lightgreen"           "lightgrey"           
#>  [73] "lightpink"            "lightsalmon"          "lightseagreen"       
#>  [76] "lightskyblue"         "lightslateblue"       "lightslategray"      
#>  [79] "lightslategrey"       "lightsteelblue"       "lightyellow"         
#>  [82] "limegreen"            "linen"                "magenta"             
#>  [85] "maroon"               "mediumaquamarine"     "mediumblue"          
#>  [88] "mediumorchid"         "mediumpurple"         "mediumseagreen"      
#>  [91] "mediumslateblue"      "mediumspringgreen"    "mediumturquoise"     
#>  [94] "mediumvioletred"      "midnightblue"         "mintcream"           
#>  [97] "mistyrose"            "moccasin"             "navajowhite"         
#> [100] "navy"                 "navyblue"             "oldlace"             
#> [103] "olivedrab"            "orange"               "orangered"           
#> [106] "orchid"               "palegoldenrod"        "palegreen"           
#> [109] "paleturquoise"        "palevioletred"        "papayawhip"          
#> [112] "peachpuff"            "peru"                 "pink"                
#> [115] "plum"                 "powderblue"           "purple"              
#> [118] "red"                  "rosybrown"            "royalblue"           
#> [121] "saddlebrown"          "salmon"               "sandybrown"          
#> [124] "seagreen"             "seashell"             "sienna"              
#> [127] "skyblue"              "slateblue"            "slategray"           
#> [130] "slategrey"            "snow"                 "springgreen"         
#> [133] "steelblue"            "tan"                  "thistle"             
#> [136] "tomato"               "turquoise"            "violet"              
#> [139] "violetred"            "wheat"                "whitesmoke"          
#> [142] "yellow"               "yellowgreen"

하나의 큰 패턴으로 바꿀 수 있습니다:

pattern <- str_c("\\b(", str_flatten(cols, "|"), ")\\b")
str_view(sentences, pattern, match = TRUE)
  • Glue the sheet to the dark blue background.
  • A rod is used to catch pink salmon.
  • Two blue fish swam in the tank.
  • Cars and buses stalled in snow drifts.
  • A wisp of cloud hung in the blue air.
  • Leaves turn brown and yellow in the fall.
  • The spot on the blotter was made by green ink.
  • Mud was spattered on the front of his white shirt.
  • The sofa cushion is red and of light weight.
  • The office paint was a dull sad tan.
  • The sky that morning was clear and bright blue.
  • The brown house was on fire to the attic.
  • A blue crane is a tall wading bird.
  • It is hard to erase blue or red ink.
  • A pencil with black lead writes best.
  • The lamp shone with a steady green flame.
  • Slash the gold cloth into fine ribbons.
  • They floated on the raft to sun their white backs.
  • The birch looked stark white and lonesome.
  • The box is held by a bright red snapper.
  • The houses are built of red clay bricks.
  • Tea served from the brown jug is tasty.
  • The red tape bound the smuggled food.
  • Look in the corner to find the tan shirt.
  • Hedge apples may stain your hands green.
  • The gold ring fits only a pierced ear.
  • The node on the stalk of wheat grew daily.
  • The plant grew large and green in the window.
  • The purple tie was ten years old.
  • The crunch of feet in the snow was the only sound.
  • Bathe and relax in the cool green grass.
  • A ripe plum is fit for a king's palate.
  • Feed the white mouse some flower seeds.
  • The lake sparkled in the red hot sun.
  • Mark the spot with a sign painted red.
  • A sash of gold silk will trim her dress.
  • Draw the chart with heavy black lines.
  • The vamp of the shoe had a gold buckle.
  • A gray mare walked before the colt.
  • The desk and both chairs were painted tan.
  • The couch cover and hall drapes were blue.
  • A man in a blue sweater sat at the desk.
  • She blushed when he gave her a white orchid.
  • The black trunk fell from the landing.
  • A thick coat of black paint covered all.
  • The small red neon lamp went out.
  • A brown leather bag hung from its strap.
  • A white silk jacket goes with any shoes.
  • Paint the sockets in the wall dull green.
  • Wake and rise, and step into the green outdoors.
  • The green light in the brown box flickered.
  • Jerk the cord, and out tumbles the gold.
  • The farmer swapped his horse for a brown ox.
  • Tear a thin sheet from the yellow pad.
  • The sky in the west is tinged with orange red.
  • The gold vase is both rare and costly.
  • Dots of light betrayed the black cat.
  • The red paper brightened the dim stage.
  • Dig deep in the earth for pirate's gold.
  • The hail pattered on the burnt brown grass.
  • The big red apple fell to the ground.
  • A gold ring will please most any girl.
  • A pink shell was found on the sandy beach.

이 예에서 cols 은 숫자와 문자만 포함하기 때문에 metacharacters 에 대해 걱정할 필요가 없습니다. 하지만, 일반적으로 기존 문자열에서 패턴을 생성할 때, 특수 문자 앞에 자동으로 \ 를 추가하는 str_escape() 를 적용하는 것이 좋습니다.

20.5.4 Exercises

  1. Construct patterns to find evidence for and against the rule “i before e except after c”?
  2. colors() contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and removed what is being modified).
  3. Create a regular expression that finds any use of base R dataset. You can get a list of these datasets via a special use of the data() function: data(package = "datasets")$results[, "Item"]. Note that a number of old datasets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so you’ll need to also strip these off.

20.6 Grouping and capturing

앞에서 배운 것 같이, 수학에서와 같이 괄호는 정규표현식에서 연산 우선순위를 조정하는 중요한 도구입니다. 이 외에도 괄호는 중요한 추가 기능이 있습니다: 매치의 sub-component 를 하는 데에 사용하는 capturing groups 를 생성합니다. 이를 사용하는 세가지 방법이 있습니다.

  • 반복 패턴을 매치.
  • 대체구문에서 매치된 패턴을 포함.
  • 매치의 개별 요소를 추출.

그룹캡쳐를 생성하지 않고 연산자 우선순위에 영향만 주는 특별한 형태의 괄호도 있습니다. 아래에 설명 되어 있습니다.

20.6.1 반복 패턴 매치하기

역참조(back reference) 를 사용하여 앞서 괄호 안의 매치된 텍스트를 역으로 참조할 수 있습니다. 역참조는 일반적으로 숫자가 붙습니다: \1 은 첫 괄호에 포함된 매치를 참조하고, \2 는 두번째 괄호를 참조합니다. 예를 들어 다음의 정규표현식은 두 글자가 반복되는 과일 이름을 불러온다:

str_view(fruit, "(..)\\1", match = TRUE)
  • banana
  • coconut
  • cucumber
  • jujube
  • papaya
  • salal berry

다음 명령어는 같은 글자로 시작하고 끝나는 단어 모두를 구합니다:

str_view(words, "^(..).*\\1$", match = TRUE)
  • church
  • decide
  • photograph
  • require
  • sense

20.6.2 매치된 패턴으로 대체하기

str_replace()str_replace_all() 로 대체할 때 역참조를 사용할 수도 있습니다. 다음 코드는 두번째와 세번째 단어 순서를 바꿉니다:

sentences %>% 
  str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") %>% 
  head(5)
#> [1] "The canoe birch slid on the smooth planks." 
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."     
#> [4] "These a days chicken leg is a rare dish."   
#> [5] "Rice often is served in round bowls."

단일 매치를 추출하기 위해 str_replace() 를 사용하는 사람들을 때때로 볼 수 있습니다:

pattern <- "^.*the ([^ .,]+).*$"
sentences %>% 
  str_subset(pattern) %>% 
  str_replace(pattern, "\\1") %>% 
  head(10)
#>  [1] "smooth"  "dark"    "depth"   "parked"  "sun"     "clear"   "ball"   
#>  [8] "woman"   "evening" "man's"

하지만, str_match()tidyr::separate_groups() 를 사용하는 것이 일반적으로 더 좋다고 생각하는데 다음에 배울 것입니다.

20.6.3 그룹 추출

stringr 에는 매치를 추출하는 str_match() 라고 부르는 lower-level 함수가 있습니다. 하지만 matrix 를 반환하기 때문에, 작업하기 쉽지 않습니다:

sentences %>% 
  str_match("the (\\w+) (\\w+)") %>% 
  head()
#>      [,1]                [,2]     [,3]    
#> [1,] "the smooth planks" "smooth" "planks"
#> [2,] "the sheet to"      "sheet"  "to"    
#> [3,] "the depth of"      "depth"  "of"    
#> [4,] NA                  NA       NA      
#> [5,] NA                  NA       NA      
#> [6,] NA                  NA       NA

대신, 캡쳐링 그룹 각각에 대해 열을 생성하는 tidyr 의 separate_groups() 을 사용할 것을 추천합니다.

20.6.4 명명된 그룹

그룹이 많이 있다면, 위치로 참조하는 것은 헷갈릴 수 있습니다. (?<name>…) 로 이름을 주는 것이 가능합니다. \k<name> 로 참조할 수 있습니다.

str_view(words, "^(?<first>.).*\\k<first>$", match = TRUE)
  • america
  • area
  • dad
  • dead
  • depend
  • educate
  • else
  • encourage
  • engine
  • europe
  • evidence
  • example
  • excuse
  • exercise
  • expense
  • experience
  • eye
  • health
  • high
  • knock
  • level
  • local
  • nation
  • non
  • rather
  • refer
  • remember
  • serious
  • stairs
  • test
  • tonight
  • transport
  • treat
  • trust
  • window
  • yesterday

이 기능은 comments = TRUE 와 궁합이 좋습니다:

pattern <- regex(
  r"(
    ^           # start at the beginning of the string
    (?<first>.) # and match the <first> letter
    .*          # then match any other letters
    \k<first>$  # ensuring the last letter is the same as the <first>
  )", 
  comments = TRUE
)

명명된 그룹을 tidyr::separate_groups()col_names 의 대체구문으로 사용할 수도 있습니다.

20.6.5 비-캡쳐링 그룹

때때로, 매칭 그룹을 생성하지 않고 괄호를 사용하고 싶을 수 있습니다. (?:) 를 사용하여 비-캡쳐링 그룹을 생성할 수 있습니다.

x <- c("a gray cat", "a grey dog")
str_match(x, "(gr(e|a)y)")
#>      [,1]   [,2]   [,3]
#> [1,] "gray" "gray" "a" 
#> [2,] "grey" "grey" "e"
str_match(x, "(gr(?:e|a)y)")
#>      [,1]   [,2]  
#> [1,] "gray" "gray"
#> [2,] "grey" "grey"

하지만 일반적으로, col_nameNA 로 설정해서 결과를 무시하는 것이 쉬울 수 있습니다:

20.6.6 Exercises

  1. Describe, in words, what these expressions will match:

    1. (.)\1\1
    2. "(.)(.)\\2\\1"
    3. (..)\1
    4. "(.).\\1.\\1"
    5. "(.)(.)(.).*\\3\\2\\1"
  2. Construct regular expressions to match words that:

    1. Who’s first letter is the same as the last letter, and the second letter is the same as the second to last letter.
    2. Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

20.7 플래그

flags 라고 부르는, 패턴 언어 상세내용을 조정하기 위해 사용할 수 있는 설정들이 많이 있습니다. stringr 에서, regex() 가 생성한 객체를 전달하여 패턴으로서 간단한 문자열을 패싱하는 것 대신 이 설정을 제공할 수 있습니다:

# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))

이 설정은 매치의 세무사항을 컨트롤 하는 추가 인수를 전달하게 하기 때문이기 때문에 유용하다. 문자를 대문자나 소문자 형태 중 하나에 매치하게 하기 때문에, ignore_case = TRUE 가 아마도 가장 유용합니다:

bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
  • banana
  • Banana
  • BANANA
str_view(bananas, regex("banana", ignore_case = TRUE))
  • banana
  • Banana
  • BANANA

다중라인 문자열 (즉, \n 을 포함하는 문자열) 과 작업을 많이 하고 있다면, multilinedotall 도 유용할 수 있습니다. dotall = TRUE 을 하면 .\n 을 포함한 모든 것을 매치하게 됩니다:

x <- "Line 1\nLine 2\nLine 3"
str_view_all(x, ".L")
  • Line 1
    Line 2
    Line 3
str_view_all(x, regex(".L", dotall = TRUE))
  • Line 1
    Line 2
    Line 3

multiline = TRUE 을 하면 ^$ 이 전체 스트링의 처음과 마지막이 아닌 각 라인의 처음과 마지막을 매치하게 됩니다:

x <- "Line 1\nLine 2\nLine 3"
str_view_all(x, "^Line")
  • Line 1
    Line 2
    Line 3
str_view_all(x, regex("^Line", multiline = TRUE))
  • Line 1
    Line 2
    Line 3

작성한 복잡한 정규표현식을 나중에 이해하지 못하게 될 상황을 걱정하고 있다면, comments = TRUE 가 매우 유용할 수 있습니다. 커멘트와 공백을 사용하여 복잡한 표현식을 더 이해가능하게 만들 수 있습니다. # 이후 모든 것과 같이 공백과 뉴라인이 무시됩니다. (원문자열을 사용해서 필요한 이스케이프 개수를 최소화하는 것을 주목하세요)

phone <- regex(r"(
  \(?     # optional opening parens
  (\d{3}) # area code
  [) -]?  # optional closing parens, space, or dash
  (\d{3}) # another three numbers
  [ -]?   # optional space or dash
  (\d{3}) # three more numbers
  )", comments = TRUE)
str_match("514-791-8141", phone)
#>      [,1]          [,2]  [,3]  [,4] 
#> [1,] "514-791-814" "514" "791" "814"

커멘트를 사용하고 공백이나 뉴라인이나 # 를 매치하고 싶다면, 이스케이프 해야 합니다:

str_view("x x #", regex("x #", comments = TRUE))
  • x x #
str_view("x x #", regex(r"(x\ \#)", comments = TRUE))
  • x x #