3  数据清洗

从非结构的、半结构的数据中抽取有用的信息,常常需要一番数据清洗操作,最重要的工具之一是正则表达式。R 语言内置一系列函数,组成一套工具,详见 ?regex

以 CRAN 上 R 包的元数据作为本章数据清洗的对象。数据清洗主要用于文本分析,元数据中挑选 Package 、Maintainer、Title 、Description 和 Authors@R 等 5个字段。

pdb <- readRDS(file = "data/cran-package-db-20241231.rds")
pdb <- subset(
  x = pdb, subset = !duplicated(Package),
  select = c("Package", "Maintainer", "Title", "Description", "Authors@R")
)

3.1 正则表达式

简单起见,考虑 Rcpp 包的几个字段

pdb[pdb$Package == "Rcpp","Description"] 
[1] "The 'Rcpp' package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."

3.1.1 量词

字段 Description 中有多个换行符 \n 、多对括号,查找和替换

grep(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], fixed = TRUE)
[1] 1

3.1.2 级联

3.1.3 断言

正向查找 / 反向查找

3.1.4 反向引用

3.1.5 命名捕捉

3.2 字符串操作

3.2.1 查找

grep()grepl() 是一对字符串匹配函数,返回是否匹配到字符串的结果,前者返回值是整数向量,后者是逻辑向量。

grep(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], fixed = TRUE)
[1] 1
# 如果匹配上,则返回原字符串
grep(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], value = T, fixed = TRUE)
[1] "The 'Rcpp' package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."
# 等同于 grep(..., value = T)
grepv(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], fixed = TRUE)
[1] "The 'Rcpp' package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."
# 如果匹配上,则返回逻辑向量
grepl(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], fixed = TRUE)
[1] TRUE

3.2.2 替换

sub()gsub() 是一对替换字符串的函数,前者匹配和替换一次,而后者可以全部替换。

sub(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], replacement = "", fixed = TRUE)
[1] "The 'Rcpp' package provides R functions as well as C++ classes which offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."
# 换行符
gsub(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], replacement = "", fixed = TRUE)
[1] "The 'Rcpp' package provides R functions as well as C++ classes which offer a seamless integration of R and C++. Many R data types and objects can be mapped back and forth to C++ equivalents which facilitates both writing of new code as well as easier integration of third-party libraries. Documentation about 'Rcpp' is provided by several vignettes included in this package, via the 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013, <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018, <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."
# 单引号
gsub(pattern = "'", x = pdb[pdb$Package == "Rcpp", "Description"], replacement = "", fixed = TRUE)
[1] "The Rcpp package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about Rcpp is provided by several vignettes included in this package, via the\n Rcpp Gallery site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see citation(\"Rcpp\") for details."
# 双引号
gsub(pattern = '\"', x = pdb[pdb$Package == "Rcpp", "Description"], replacement = "", fixed = TRUE)
[1] "The 'Rcpp' package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(Rcpp)' for details."

3.2.3 提取

前面的 sub()gsub() 是一对关于字符串替换的函数,regexpr()gregexpr() 是另一对字符串提取函数,函数 regexpr() 前加字幕 g 的含义与之相同,均表示全局操作的意思。

x = gsub(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], replacement = "", fixed = TRUE)
x = gsub(pattern = '\"', x = x, replacement = "", fixed = TRUE)
# 提取 URL 链接
str_extract <- function(text, pattern, ...) regmatches(text, regexpr(pattern, text, ...))
str_extract(text = x, pattern = "(<.*?>)", perl = T)
[1] "<https://gallery.rcpp.org>"
# 提取括号内容
str_extract(text = x, pattern = "(\\(.*?\\))", perl = T)
[1] "(2011, <doi:10.18637/jss.v040.i08>)"

描述字段中含有多个括号包裹 doi 链接,都提取出来

str_extract_g <- function(text, pattern, ...) regmatches(text, gregexpr(pattern, text, ...))
str_extract_g(text = x, pattern = "(\\(.*?\\))", perl = T)
[[1]]
[1] "(2011, <doi:10.18637/jss.v040.i08>)"        
[2] "(2013, <doi:10.1007/978-1-4614-6868-4>)"    
[3] "(2018, <doi:10.1080/00031305.2017.1375990>)"
[4] "(Rcpp)"