Chapter 10 文字與字串資料處理

傳統上, 統計人員較少直接處裡文字或字串資料, 多數時候是由資料管理人元處理後, 轉換成數值資料, 然後交由統計人員進行後續分析. 由於大數據時代來臨包含者多樣性的資料型態, 統計人員必須必須直接處裡文字或字串資料的機會也越來越多. {R} 內有許多函數可以處理文字型態的資料物件或文字資料 (Character Data), 常用之文字函式有 paste(), substr(), substring(), grep(), gsub(), strsplit() 等. {R} 套件 stringr 有更多處理文字或字串資料函式.

10.1 文字與字串資料基礎

輸入文字遠比數字複雜, 必須考慮大小寫, 空格或 Tab, 單引號或雙引號, 特殊符號與字元等等. {R} 輸入特殊符號的顯示與實際想要輸入的特殊符號有些差別.

st1 <- "This is a book"
st1
## [1] "This is a book"
st2 <- 'To include a double "quote" inside a string, use single quotes'
st2
## [1] "To include a double \"quote\" inside a string, use single quotes"
st3 <- "To include a single 'quote' inside a string, use double quotes"
st3
## [1] "To include a single 'quote' inside a string, use double quotes"
double_quote <- "\"" # or '"'
double_quote
## [1] "\""
single_quote <- '\'' # or "'"
single_quote
## [1] "'"

類似情形, 若要輸入反斜線 \, 則須輸入連續 2 個反斜線: \\.

backslash <- c("\\")
backslash
## [1] "\\"

{R} 輸入特殊符號反斜線 \ 的顯示 "\\" 與實際想要輸入的單一個反斜線有些差別`. 若要呈現實際想要輸入的特殊符號, 可使用函式 writeLines().

x.char <- c("\"", "\\")
x.char
## [1] "\"" "\\"
writeLines(x.char)
## "
## \

利用指令 ?'"' 或 ?"'" 可以得到特殊符號的輸入方式.

\n newline
\r carriage return
\t tab
\b backspace
\a alert (bell)
\f form feed
\v vertical tab
\\ backslash \
\' ASCII apostrophe '
\" ASCII quotation mark "
\ $ASCII grave accent (backtick)$
\nnn character with given octal code (1, 2 or 3 digits)
\xnn character with given hex code (1 or 2 hex digits)
\unnnn Unicode character with given code (1–4 hex digits)
\Unnnnnnnn Unicode character with given code (1–8 hex digits)

x.utf <- "\u00b5"
x.utf
## [1] "μ"

10.2 套件 stringr

{R} base 有許多函式處理文字或字串, 但函式的引數並不具有一致性, 容易混淆, 因此 tidyverse 系列的套件 stringr 內的函式都以 str_ 為起始, 例如, str_length() 回傳文字向量內的文字字數.

library(stringr)
str_length(c("a", "Biostatistics", "Medical Statistics", "\'\b\t", NA))
## [1]  1 13 18  3 NA

10.3 合併文字或字串 str_c()

函式 str_c() 可以合併文字或字串, 類似 {R} base 函式 paste(). 使用引數 sep 設定合併的中間字元.

str_c("medical", "statistics")
## [1] "medicalstatistics"
str_c("medical", "statistics", sep = " ")
## [1] "medical statistics"
str_c("medical", "statistics", sep = "-")
## [1] "medical-statistics"
str_c("medical", "statistics", sep = " + ")
## [1] "medical + statistics"
str_c("|-", "medical", "statistics", "-|")
## [1] "|-medicalstatistics-|"

若是遇到缺失值 NA, 則仍回傳 NA, 若要改變遇到缺失值 NA, 回傳列印 NA, 可以使用加用函式 str_replace_na().

x.char <- c("bio", NA, "statistics")
str_c("pre-", x.char, "-end")
## [1] "pre-bio-end"        NA                   "pre-statistics-end"
str_c("pre-", str_replace_na(x.char), "-end")
## [1] "pre-bio-end"        "pre-NA-end"         "pre-statistics-end"

若要合併 2 個字串向量為單一字串, 可以使用引數 collpse.

char.vec <- c("I", "love", "biostatistics")
str_c(char.vec, collapse = ", ")
## [1] "I, love, biostatistics"
str_c(char.vec, collapse = "+")
## [1] "I+love+biostatistics"
str_c(char.vec, collapse = " ")
## [1] "I love biostatistics"
str_c(char.vec, sep = " ")
## [1] "I"             "love"          "biostatistics"
str_c("I", "love", "biostatistics", sep = " ")
## [1] "I love biostatistics"

10.4 取出文字或字串向量中的部分元素 str_sub()

函式 str_sub() 可以取出取出文字或字串向量中元素的部分文字.

str_sub(string, start = 1L, end = -1L)

引數 start 與 end 分別為向量中元素內文字之起始位置與結束位置. 即使向量長度不足仍會回傳.

char.vec <- c("I", "love", "medical", "statistics")
str_sub(char.vec, start = 1, end = 3)
## [1] "I"   "lov" "med" "sta"

10.5 語言設定地區與文字大小寫排序

不同地區的文字, 可能有類似文字, 在大小寫轉換實常會出現轉換錯誤, 若要確保大小寫轉換或排序正確, 套件 stringr 內的函式可改設定 {R} 程式使用文字的地區. 例如, 大小寫轉換函式 str_to_lower(),str_to_upper()或str_to_title()` 的使用.

char.vec <- c("I", "Love", "Medical", "Statistics")
str_to_upper(char.vec)
## [1] "I"          "LOVE"       "MEDICAL"    "STATISTICS"
str_to_lower(char.vec)
## [1] "i"          "love"       "medical"    "statistics"
str_to_title(str_to_upper(char.vec))
## [1] "I"          "Love"       "Medical"    "Statistics"
str_to_title(str_to_lower(char.vec))
## [1] "I"          "Love"       "Medical"    "Statistics"

{R} base 函式 sort() 與 order() 定 {R} 程式登入使用文字的地區

套件 stringr 內的函式 str_sort() 與 str_order(), 可以使用引數locale` 設定使用文字的地區.

veg.vec <- c("apple", "eggplant", "banana")
sort(veg.vec)
## [1] "apple"    "banana"   "eggplant"
order(veg.vec)
## [1] 1 3 2
str_sort(veg.vec, locale = "en")  # English
## [1] "apple"    "banana"   "eggplant"
str_sort(veg.vec, locale = "haw") # Hawaiian
## [1] "apple"    "eggplant" "banana"

10.6 移除空白, 加入空白, 截斷文字 `str_trim()` 與 `str_pad()`

套件 stringr 內的函式 str_trim() 與 str_pad() 可以對文字或字串向量內的首尾之空白 (white space) 移除, 或是加入.

str_trim(string, side = c("both", "left", "right"))
str_pad(string, width, side = c("left", "right", "both"), pad = " ")
str_trunc(string, width, side = c("right", "left", "center"), ellipsis = "...")

引數 both, left, right 分別處理在首尾二端, 左端, 右端之空白. width 為加入空白後字串的長度, pad 為替代加入空白的文字或符號.

veg.vec <- c("apple   ", "   eggplant   ", "   banana")
str_trim(veg.vec, side = c("both"))
## [1] "apple"    "eggplant" "banana"
str_trim(veg.vec, side = c("left"))
## [1] "apple   "    "eggplant   " "banana"
str_trim(veg.vec, side = c("right"))
## [1] "apple"       "   eggplant" "   banana"
veg.vec <- c("apple ", " eggplant ", " banana")
str_pad("a", width = 15, side = c("both"), pad = " ")
## [1] "       a       "
str_pad("a", width = 15, side = c("both"), pad = c("_"))
## [1] "_______a_______"
str_pad(veg.vec, width = 15, side = c("both"))
## [1] "    apple      " "   eggplant    " "     banana    "
str_pad(veg.vec, width = 15, side = c("left"))
## [1] "         apple " "      eggplant " "         banana"
str_pad(veg.vec, width = 15, side = c("right"))
## [1] "apple          " " eggplant      " " banana        "
str_pad(veg.vec, width = 15, side = c("both"), pad = c("_"))
## [1] "____apple _____" "__ eggplant ___" "____ banana____"
char.vec <- c("I love biostatistics")
str_trunc(char.vec, width = 10, side = c("center"))
## [1] "I lo...ics"
str_trunc(char.vec, width = 10, side = c("left"))
## [1] "...tistics"
str_trunc(char.vec, width = 10, side = c("right"))
## [1] "I love ..."

10.7 尋找特定形式文字或字串

文字或字串處理中一項重要的工作是尋找特定形式文字或字串 (pattern), 然後進行 detect (偵測), locate (確認位置), extract (取出), match (配對), replace (替代置換) 與 split (分割).

10.7.1 偵測函式 `str_detect()`

套件 stringr 內的函式 str_detect() 偵測字串向量是否包含特定形式文字, 回傳邏輯向量. 這與 {R} base 函式 grep(pattern, x) 類似. 函式 str_count() 計算字串內配對成功的次數.

str_detect(string, pattern, negate = FALSE)
str_count(string, pattern = "")

引數 pattern 定義所要尋找特定形式的文字, 若 negate = TRUE 同時回傳沒有配對成功的邏輯向量.

char.vec <- c("statistics", "biostatistics", 
              "probability", "distribution")
str_detect(char.vec, pattern = "statistics", negate = FALSE)
## [1]  TRUE  TRUE FALSE FALSE
str_detect(char.vec, pattern = "statistics", negate = TRUE)
## [1] FALSE FALSE  TRUE  TRUE
str_detect(char.vec, pattern = "ti", negate = FALSE)
## [1]  TRUE  TRUE FALSE  TRUE
str_detect(char.vec, pattern = "function", negate = FALSE)
## [1] FALSE FALSE FALSE FALSE
#
str_count(char.vec, pattern = "ti")
## [1] 2 2 0 1
str_count(char.vec, pattern = "b")
## [1] 0 1 2 1

10.7.2 確認位置函式 `str_detect()`

函式 str_locate() 尋找配對成功的字串之第 1 次位置, 回傳矩陣, 包含起始以末端的位置. 這與 {R} base 函式 regexpr() 與 gregexpr() 類似.

str_locate(string, pattern)
str_locate_all(string, pattern)

另外函式 str_locate_all() 尋找配對成功的字串之所有位置, 回傳列表.

char.vec <- c("statistics", "biostatistics", 
              "probability", "distribution")
str_locate(char.vec, pattern = "ti")
##      start end
## [1,]     4   5
## [2,]     7   8
## [3,]    NA  NA
## [4,]     9  10
str_locate_all(char.vec, pattern = "ti")
## [[1]]
##      start end
## [1,]     4   5
## [2,]     7   8
## 
## [[2]]
##      start end
## [1,]     7   8
## [2,]    10  11
## 
## [[3]]
##      start end
## 
## [[4]]
##      start end
## [1,]     9  10

10.7.3 確認索引函式 `str_subset()` 與 `str_which()`

函式 str_subset() 尋找字串向量內配對成功的之第 1 次的元素內容, 而函式 str_which() 尋找字串向量內配對成功的之第 1 次索引 (index).

str_subset(string, pattern, negate = FALSE)
str_which(string, pattern, negate = FALSE)

若引數 negate = TRUE 回傳沒有配對成功的元素內容或索引. 函式 str_subset() 與函式 x[str_detect(x, pattern)] 類似功能, 等同於 R base 函式 grep(pattern, x, value = TRUE). 而函式 str_which() 與函式 which(str_detect(x, pattern)) 類似功能, 等同於 R base 函式 grep(pattern, x), 如同函式 str_detect() 同於 R base 函式 grepl(pattern, x).

char.vec <- c("statistics", "biostatistics", 
              "probability", "distribution")
str_subset(char.vec, pattern = "ti")
## [1] "statistics"    "biostatistics" "distribution"
str_which(char.vec, pattern = "ti")
## [1] 1 2 4

10.7.4 取出函式 `str_extract()`

函式 str_extract() 尋找配對成功的字串之第 1 次位置, 回傳字串向量.

str_extract(string, pattern)
str_extract_all(string, pattern, simplify = FALSE)

另外函式 str_extract_all() 尋找配對成功的字串之所有位置, 回傳所有字串向量形成列表. 引數 simplify = TRUE 簡化成文字矩陣.

char.vec <- c("statistics", "biostatistics", 
              "probability", "distribution")
str_extract(char.vec, pattern = "ti")
## [1] "ti" "ti" NA   "ti"
str_extract_all(char.vec, pattern = "ti")
## [[1]]
## [1] "ti" "ti"
## 
## [[2]]
## [1] "ti" "ti"
## 
## [[3]]
## character(0)
## 
## [[4]]
## [1] "ti"

10.7.5 配對函式 `str_match()`

函式 str_match() 使用在群組尋找特定形式文字或字串, 若尋到找配對成功的字串之第 1 次位置, 回傳文字矩陣,第一欄位為完全配對成功的文字, 其餘欄位為群組內個別配對成功的文字.

str_match(string, pattern)
str_match_all(string, pattern)

另外函式 str_match_all() 尋找配對成功的字串之所有位置.

char.vec <- c("statistics", "biostatistics", 
              "probability", "distribution")
str_match(char.vec, pattern = "(a|ti)")
##      [,1] [,2]
## [1,] "a"  "a" 
## [2,] "a"  "a" 
## [3,] "a"  "a" 
## [4,] "ti" "ti"
str_match_all(char.vec, pattern = "(a|ti)")
## [[1]]
##      [,1] [,2]
## [1,] "a"  "a" 
## [2,] "ti" "ti"
## [3,] "ti" "ti"
## 
## [[2]]
##      [,1] [,2]
## [1,] "a"  "a" 
## [2,] "ti" "ti"
## [3,] "ti" "ti"
## 
## [[3]]
##      [,1] [,2]
## [1,] "a"  "a" 
## 
## [[4]]
##      [,1] [,2]
## [1,] "ti" "ti"

10.7.6 替代置換函式 `str_replace()`

函式 str_match() 使用在群組尋找特定形式文字或字串, 若尋找到配對成功的字串之第 1 次位置, 則使用其他特定字串替代置換.

str_replace(string, pattern, replacement)
str_replace_all(string, pattern, replacement)

引數 replacement 設定新的替代字串置換原有尋找特定形式文字或字串. 另外函式 str_replace_all() 尋找配對成功的字串之所有位置, 同時使用其他特定字串替代置換.

char.vec <- c("statistics", "biostatistics", 
              "probability", "distribution")
str_replace(char.vec, pattern = "ti", replacement = "--")
## [1] "sta--stics"    "biosta--stics" "probability"   "distribu--on"
str_replace_all(char.vec, pattern = "b", replacement = "+++")
## [1] "statistics"      "+++iostatistics" "pro+++a+++ility" "distri+++ution"

10.7.7 分割函式 `str_split()`

函式 str_split() 使用在群組尋找特定形式文字或字串, 若尋找到配對成功的字串之第 1 次位置, 則從特定形式文字或字串分割字串向量, 回傳分割結果為列表物件.

str_split(string, pattern, n = Inf, simplify = FALSE)
str_split_fixed(string, pattern, n)
str_split_n(string, pattern, n)

其中引數 n 設定回傳物件的數目, simplify = TRUE 回傳物件簡化成文字矩陣. 另外函式 str_split_fixed() 回傳物件簡化成文字矩陣且欄位 (column) 數目為 n. str_split_n() 回傳物件簡化成文字向量, 長度為 n.

char.vec <- c("a b c", "d e", "bio-statistics required-courses")
str_split(char.vec, pattern = " ", n = Inf, simplify = FALSE)
## [[1]]
## [1] "a" "b" "c"
## 
## [[2]]
## [1] "d" "e"
## 
## [[3]]
## [1] "bio-statistics"   "required-courses"
str_split(char.vec, pattern = " ", n = Inf, simplify = TRUE)
##      [,1]             [,2]               [,3]
## [1,] "a"              "b"                "c" 
## [2,] "d"              "e"                ""  
## [3,] "bio-statistics" "required-courses" ""
str_split_fixed(char.vec, pattern = " ", n = 2)
##      [,1]             [,2]              
## [1,] "a"              "b c"             
## [2,] "d"              "e"               
## [3,] "bio-statistics" "required-courses"
str_split_fixed(char.vec, pattern = "-", n = 2)
##      [,1]    [,2]                         
## [1,] "a b c" ""                           
## [2,] "d e"   ""                           
## [3,] "bio"   "statistics required-courses"

10.8 群組尋找特定形式的文字與字串

有些時候在尋找特定形式的文字與字串, 須要尋找不只一種特定的形式, 此時須藉由 alternate, anchor 與 look around 概念處理. 例如, 同時尋找 b 或 ti, 可以輸入 b|ti.

char.vec <- c("statistics", "biostatistics", 
              "probability", "distribution")
str_replace(char.vec, pattern = "b|ti", replacement = "--")
## [1] "sta--stics"     "--iostatistics" "pro--ability"   "distri--ution"
str_replace_all(char.vec, pattern = "b|ti", replacement = "+++")
## [1] "sta+++s+++cs"      "+++iosta+++s+++cs" "pro+++a+++ility"   "distri+++u+++on"

anchor 起始符號 ^ 可以尋找字串的起始具有特定形式, 尾端符號 $ 可以尋找字串的尾端具有特定形式. 例如, ^b, 尋找字串的起始具有 b, 或 n$, 尋找字串的尾端具 n.

char.vec <- c("statistics", "biostatistics", 
              "probability", "distribution")
str_replace(char.vec, pattern = "^b", replacement = "--")
## [1] "statistics"     "--iostatistics" "probability"    "distribution"
str_replace_all(char.vec, pattern = "n$", replacement = "+++")
## [1] "statistics"     "biostatistics"  "probability"    "distributio+++"

有些時候需要尋找字串前後具有特定形式的文字與字串, 例如, 尋找在 ti 之前的字元, 在 p 之後的字元等等. 使用小括號 () 代表特定形式的前後順序. 輸入 a(?=c) 表示在 a 之後有 c 字元, 輸入 a(?!c) 表示在 a 之後無 c 字元, 輸入 (?<=b)a 表示在 a 之前有 b 字元, 輸入 (?<!b)a 表示在 a 之前無 b 字元.

char.vec <- c("statistics", "biostatistics", 
              "probability", "distribution")
str_replace(char.vec, pattern = "t(?=i)", replacement = "--")
## [1] "sta--istics"    "biosta--istics" "probability"    "distribu--ion"
str_replace(char.vec, pattern = "t(?!i)", replacement = "--")
## [1] "s--atistics"    "bios--atistics" "probabili--y"   "dis--ribution"
str_replace(char.vec, pattern = "(?<=i)o", replacement = "--")
## [1] "statistics"     "bi--statistics" "probability"    "distributi--n"
str_replace_all(char.vec, pattern = "(?<=t)i", replacement = "--")
## [1] "stat--st--cs"    "biostat--st--cs" "probability"     "distribut--on"
str_replace(char.vec, pattern = "(?<!t)i", replacement = "--")
## [1] "statistics"     "b--ostatistics" "probab--lity"   "d--stribution"
str_replace_all(char.vec, pattern = "(?<!t)i", replacement = "--")
## [1] "statistics"     "b--ostatistics" "probab--l--ty"  "d--str--bution"

10.9 尋找連續重覆特定形式的文字與字串

一個字串可能不只一個特定形式的文字與字串連續重覆出現, 套件 stringr 尋找特定形式的文字與字串, 可以合併考量連續重覆出現次數. 其中 {} 內不可有空格. stringr 在群組 () 之後加上 \\1, \\2, … 等, 如同可以設定尋找連續重覆出現次數.

stringr 輸入	意義
a?	zero or one
a*	zero or more
a+	one or more
a{n}	exactly n
a{n, }	n or more
a{n, m}	between n and m

x.vec <- c(".a.aa.aaa.aaaa")
str_replace(x.vec, pattern = "a?", replacement = "-")
## [1] "-.a.aa.aaa.aaaa"
str_replace(x.vec, pattern = "a*", replacement = "-")
## [1] "-.a.aa.aaa.aaaa"
str_replace(x.vec, pattern = "a+", replacement = "-")
## [1] ".-.aa.aaa.aaaa"
str_replace(x.vec, pattern = "a{2}", replacement = "-")
## [1] ".a.-.aaa.aaaa"
str_replace(x.vec, pattern = "a{2,}", replacement = "-")
## [1] ".a.-.aaa.aaaa"
str_replace(x.vec, pattern = "a{2,3}", replacement = "-")
## [1] ".a.-.aaa.aaaa"
char.vec <- c("statistics", "biostatistics", 
              "probability", "distribution")
str_replace(char.vec, pattern = "i?", replacement = "-")
## [1] "-statistics"    "-biostatistics" "-probability"   "-distribution"
str_replace(char.vec, pattern = "i*", replacement = "-")
## [1] "-statistics"    "-biostatistics" "-probability"   "-distribution"
str_replace(char.vec, pattern = "i+", replacement = "-")
## [1] "stat-stics"    "b-ostatistics" "probab-lity"   "d-stribution"
str_replace(char.vec, pattern = "i{2}", replacement = "-")
## [1] "statistics"    "biostatistics" "probability"   "distribution"
str_replace(char.vec, pattern = "i{2}", replacement = "-")
## [1] "statistics"    "biostatistics" "probability"   "distribution"
str_replace(char.vec, pattern = "i{2,3}", replacement = "-")
## [1] "statistics"    "biostatistics" "probability"   "distribution"

10.10 正規表示文字與字串 (萬用字元)

{R} 尋找特定形式的文字與字串, 可以使用程式語言通用的正規表示 (regular expression), 在使用套件 stringr 輸入時有些差異, 以下表摘要說明.

stringr 輸入	正規表示	真實的文字與字串
\\.	\.	.
\\!	\!	!
\\?	\?	?
\\\\	\\	\
\\(	\(	(
\\)	\)	)
\\{	\{	{
\\}	\}	}
\\n	\n	new line (return)
\t	\t	tab
\\s	\s	any whitespace (\S for non-whitespaces)
\\d	\d	any digit (\D for non-digits)
\\w	\w	any word character (\W for non-word chars)
\\b	\b	word boundaries
\\k	\k	k = integer, repeated number
[:digit:]		digits
[:alpha:]		letters
[:lower:]		lowercase letters
[:upper:]		uppercase letters
[:alnum:]		letters and numbers
[:punct:]		punctuation
[:graph:]		letters, numbers, and punctuation
[:space:]		space characters (i.e. \s)
[:blank:]		space and tab (but not new line)
.		every character except a new line

Table 3: 正規表示文字與字串 (萬用字元)

char.vec <- c("statistics.123", "biostatistics.a.b.c", 
              "probability.a.c", "distribution.a c")
str_replace(char.vec, pattern = ".i.", replacement = "-")
## [1] "sta-tics.123"      "-statistics.a.b.c" "proba-ity.a.c"     "-tribution.a c"
str_replace_all(char.vec, pattern = ".i.", replacement = "-")
## [1] "sta--s.123"    "-sta--s.a.b.c" "proba-ity.a.c" "-t-u-n.a c"
str_replace(char.vec, pattern = "y\\.a", replacement = "-")
## [1] "statistics.123"      "biostatistics.a.b.c" "probabilit-.c"      
## [4] "distribution.a c"
str_replace(char.vec, pattern = "a[.]c", replacement = "-")
## [1] "statistics.123"      "biostatistics.a.b.c" "probability.-"      
## [4] "distribution.a c"
str_replace(char.vec, pattern = "a[ ]", replacement = "-")
## [1] "statistics.123"      "biostatistics.a.b.c" "probability.a.c"    
## [4] "distribution.-c"
str_replace(char.vec, pattern = "b[ab]+", replacement = "-")
## [1] "statistics.123"      "biostatistics.a.b.c" "pro-ility.a.c"      
## [4] "distribution.a c"
y.vec <- c("set", "sat", "sit", "sout")
str_replace(y.vec, pattern = "s(a|i)t", replacement = "-")
## [1] "set"  "-"    "-"    "sout"
fruits.vec <- c("banana", "coconut", "cucumber", "jujube", "papaya", "berry")
str_replace(fruits.vec, pattern = "(..)\\1", replacement = "-")
## [1] "b-a"   "-nut"  "-mber" "-be"   "-ya"   "berry"
str_replace(fruits.vec, pattern = "(.)(.)\\2\\1", replacement = "-")
## [1] "banana"   "coconut"  "cucumber" "jujube"   "papaya"   "berry"
z.vec <- c("3 house", "4 cars", "5 dogs")
str_replace_all(z.vec, c("3" = "three", "4" = "four", "5" = "five"))
## [1] "three house" "four cars"   "five dogs"
sent.vec <- sentences[1:5]
sent.vec
## [1] "The birch canoe slid on the smooth planks." 
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."     
## [4] "These days a chicken leg is a rare dish."   
## [5] "Rice is often served in round bowls."
sent.vec %>% str_subset(pattern = "(a|the) ([^ ]+)") %>%
  str_extract(pattern = "(a|the) ([^ ]+)")
## [1] "the smooth" "the sheet"  "the depth"  "a chicken"
sent.vec %>% str_subset(pattern = "(a|the) ([^ ]+)") %>%
  str_match(pattern = "(a|the) ([^ ]+)")
##      [,1]         [,2]  [,3]     
## [1,] "the smooth" "the" "smooth" 
## [2,] "the sheet"  "the" "sheet"  
## [3,] "the depth"  "the" "depth"  
## [4,] "a chicken"  "a"   "chicken"
sent.vec %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2")
## [1] "The canoe birch slid on the smooth planks." 
## [2] "Glue sheet the to the dark blue background."
## [3] "It's to easy tell the depth of a well."     
## [4] "These a days chicken leg is a rare dish."   
## [5] "Rice often is served in round bowls."

Chapter 10 文字與字串資料處理

10.1 文字與字串資料基礎

10.2 套件 stringr

10.3 合併文字或字串 str_c()

10.4 取出文字或字串向量中的部分元素 str_sub()

10.5 語言設定地區與文字大小寫排序

10.6 移除空白, 加入空白, 截斷文字 str_trim() 與 str_pad()

10.7 尋找特定形式文字或字串

10.7.1 偵測函式 str_detect()

10.7.2 確認位置函式 str_detect()

10.7.3 確認索引函式 str_subset() 與 str_which()

10.7.4 取出函式 str_extract()

10.7.5 配對函式 str_match()

10.7.6 替代置換函式 str_replace()

10.7.7 分割函式 str_split()