3 数据清洗
从非结构的、半结构的数据中抽取有用的信息,常常需要一番数据清洗操作,最重要的工具之一是正则表达式。R 语言内置一系列函数,组成一套工具,详见 ?regex
。
以 CRAN 上 R 包的元数据作为本章数据清洗的对象。数据清洗主要用于文本分析,元数据中挑选 Package 、Maintainer、Title 、Description 和 Authors@R 等 5个字段。
3.1 正则表达式
简单起见,考虑 Rcpp 包的几个字段
[1] "The 'Rcpp' package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."
3.1.1 量词
字段 Description 中有多个换行符 \n
、多对括号,查找和替换
3.1.2 级联
3.1.3 断言
正向查找 / 反向查找
3.1.4 反向引用
3.1.5 命名捕捉
3.2 字符串操作
3.2.1 查找
grep()
和 grepl()
是一对字符串匹配函数,返回是否匹配到字符串的结果,前者返回值是整数向量,后者是逻辑向量。
[1] 1
# 如果匹配上,则返回原字符串
grep(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], value = T, fixed = TRUE)
[1] "The 'Rcpp' package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."
# 等同于 grep(..., value = T)
grepv(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], fixed = TRUE)
[1] "The 'Rcpp' package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."
[1] TRUE
3.2.2 替换
sub()
和 gsub()
是一对替换字符串的函数,前者匹配和替换一次,而后者可以全部替换。
[1] "The 'Rcpp' package provides R functions as well as C++ classes which offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."
# 换行符
gsub(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], replacement = "", fixed = TRUE)
[1] "The 'Rcpp' package provides R functions as well as C++ classes which offer a seamless integration of R and C++. Many R data types and objects can be mapped back and forth to C++ equivalents which facilitates both writing of new code as well as easier integration of third-party libraries. Documentation about 'Rcpp' is provided by several vignettes included in this package, via the 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013, <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018, <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."
# 单引号
gsub(pattern = "'", x = pdb[pdb$Package == "Rcpp", "Description"], replacement = "", fixed = TRUE)
[1] "The Rcpp package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about Rcpp is provided by several vignettes included in this package, via the\n Rcpp Gallery site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see citation(\"Rcpp\") for details."
# 双引号
gsub(pattern = '\"', x = pdb[pdb$Package == "Rcpp", "Description"], replacement = "", fixed = TRUE)
[1] "The 'Rcpp' package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(Rcpp)' for details."
3.2.3 提取
前面的 sub()
和 gsub()
是一对关于字符串替换的函数,regexpr()
和 gregexpr()
是另一对字符串提取函数,函数 regexpr()
前加字幕 g 的含义与之相同,均表示全局操作的意思。
x = gsub(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], replacement = "", fixed = TRUE)
x = gsub(pattern = '\"', x = x, replacement = "", fixed = TRUE)
# 提取 URL 链接
str_extract <- function(text, pattern, ...) regmatches(text, regexpr(pattern, text, ...))
str_extract(text = x, pattern = "(<.*?>)", perl = T)
[1] "<https://gallery.rcpp.org>"
[1] "(2011, <doi:10.18637/jss.v040.i08>)"
描述字段中含有多个括号包裹 doi 链接,都提取出来
str_extract_g <- function(text, pattern, ...) regmatches(text, gregexpr(pattern, text, ...))
str_extract_g(text = x, pattern = "(\\(.*?\\))", perl = T)
[[1]]
[1] "(2011, <doi:10.18637/jss.v040.i08>)"
[2] "(2013, <doi:10.1007/978-1-4614-6868-4>)"
[3] "(2018, <doi:10.1080/00031305.2017.1375990>)"
[4] "(Rcpp)"