第 5 章正则表达式

Douglas Bates: If you really want to be cautious you could use an octal representation like sep="\\007" to get a character that is very unlikely to occur in a factor level.

Ed L. Cashin: I definitely want to be cautious. Instead of the bell character I think I’ll use the field separator character, "\\034", just because this is the first time I’ve been able to use it for it’s intended purpose! ;)

Douglas Bates: Yes, but with "\\034" you don’t get to make obscure James Bond references :-)

— Douglas Bates and Ed L. Cashin R-help (April 2004)

维基百科关于正则表达式的描述，学习正则表达式

# 毒鸡汤用来做文本分析
# https://github.com/egotong/nows/blob/master/soul.sql

R 内置的三种匹配模式

fixed = TRUE: 字面意思匹配 exact matching.
perl = TRUE: 使用 Perl 正则表达式.
fixed = FALSE, perl = FALSE: 使用 POSIX 1003.2 extended 正则表达式 (默认设置).

不要拘泥于一种解决方案，比如清理数据中正则表达式有 Base R 提供的一套，stringr 又一套，提高效率的工具 RStudio 插件 regexplain 和辅助创建正则表达式 RVerbalExpressions 包。

有几个名词需要单独拎出来解释的

literal character strings 字面字符串
metacharacters 元字符
extended regular expressions 在下文中约定翻译为默认正则表达式
character class 字符集 [abc]
Perl-like regular expressions Perl 风格的正则表达式

以下所述，都不考虑函数中参数 perl=TRUE 的情况，R 语言中提供了扩展的（默认的）和 Perl 风格的两套正则表达式。作为入门，我们这里只关注前者，启用 Perl 正则表达式只需在函数如 grep 中将选项 perl = TRUE 即可，并将后者统一命名为 Perl 正则表达式¹⁵。

正则表达式 (regular expression，简称 regexp)，函数 regexpr 和 gregexpr 的名称就好理解了，在控制台输入 ?regex 查看 R 支持的正则表达式，这个文档看上百八十回也不过分。R 内支持正则表达式的函数有 grep、grepl、sub、gsub、regexpr、gregexpr 、 regexec 和 strsplit。函数 apropos，browseEnv，help.search，list.files 和 ls 是通过函数 grep 来使用正则表达式的，它们全都使用 extended regular expressions

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
     fixed = FALSE, useBytes = FALSE, invert = FALSE)

匹配模式 pattern 的内容可以用函数 cat 打印出来，注意反斜杠进入 R 字符串中时，需要用两个，反斜杠 \ 本身是转义符，否则会报错。

cat("\\") # \ 反斜杠是转义字符

## \

cat("\\.")

## \.

cat("\\\n") # 注意 \n 表示换行

## \

推荐的学习正则表达式的路径可以见统计之都论坛 https://d.cosx.org/d/420410 ↩︎

第 5 章 正则表达式

第 5 章正则表达式