附录 C R 指南

R 语言 (Ihaka and Gentleman 1996) 是一个统计计算和绘图的环境,以下各个节不介绍具体 R 包函数用法和参数设置,重点在历史发展趋势脉络,详细介绍去见《现代统计图形》的相应章节。R 语言的目标在于统计计算和绘图,设计优势在数据结构、图形语法、动态文档和交互图形

C.1 编程指南

R 与其它语言的异同,降低编程门槛

I’d like to prefix all these solutions with ‘Here’s how to do it, but don’t actually do it you crazy fool’. It’s on a par with redefining pi, or redefining ‘+’. And then redefining ‘<-’. These techniques have their proper place, and that would be in the currently non-existent obfuscated R contest. No, the R-ish (iRish?) way is to index vectors from 1. That’s what the R gods intended!

— Barry Rowlingson32

C.1.1 泛型函数

如果要让下标从 0 开始的话,我们需要在现有的向量类型 vector 上定义新的向量类型 vector0,在其上并且实现索引运算 [ 和赋值修改元素的运算 [<-

# https://stat.ethz.ch/pipermail/r-help/2004-March/048682.html
as.vector0 <- function(x) structure(x, class = "vector0") # 创建一种新的数据结构 vector0
as.vector.vector0 <- function(x) unclass(x)
"[.vector0" <- function(x, i) as.vector0(as.vector.vector0(x)[i + 1]) # 索引操作
"[<-.vector0" <- function(x, i, value) { # 赋值操作
  x <- as.vector.vector0(x)
  x[i + 1] <- value
  as.vector0(x)
}
print.vector0 <- function(x) print(as.vector.vector0(x)) # 实现 print 方法

举个例子看看

1:10 # 是一个内置的现有向量类型 vector
##  [1]  1  2  3  4  5  6  7  8  9 10
x <- as.vector0(1:10) # 转化为新建的 vector0 类型
x[0:4] <- 100 * x[0:4] # 对 x 的元素替换修改
x
##  [1] 100 200 300 400 500   6   7   8   9  10

C.1.2 函数源码

methods(predict)
##  [1] predict.ar*                predict.Arima*            
##  [3] predict.arima0*            predict.glm               
##  [5] predict.HoltWinters*       predict.lm                
##  [7] predict.loess*             predict.mlm*              
##  [9] predict.nls*               predict.poly*             
## [11] predict.ppr*               predict.prcomp*           
## [13] predict.princomp*          predict.smooth.spline*    
## [15] predict.smooth.spline.fit* predict.StructTS*         
## see '?methods' for accessing help and source code

stats 包里找不到这个函数

ls("package:stats", all.names = TRUE, pattern = "predict.poly")
## character(0)
predict.poly
## Error in eval(expr, envir, enclos): object 'predict.poly' not found

可见函数 predict.poly() 默认没有导出

stats:::predict.poly
## function (object, newdata, ...) 
## {
##     if (missing(newdata)) 
##         object
##     else if (is.null(attr(object, "coefs"))) 
##         poly(newdata, degree = max(attr(object, "degree")), raw = TRUE, 
##             simple = TRUE)
##     else poly(newdata, degree = max(attr(object, "degree")), 
##         coefs = attr(object, "coefs"), simple = TRUE)
## }
## <bytecode: 0x2786408>
## <environment: namespace:stats>

或者

getAnywhere(predict.poly)
## A single object matching 'predict.poly' was found
## It was found in the following places
##   registered S3 method for predict from namespace stats
##   namespace:stats
## with value
## 
## function (object, newdata, ...) 
## {
##     if (missing(newdata)) 
##         object
##     else if (is.null(attr(object, "coefs"))) 
##         poly(newdata, degree = max(attr(object, "degree")), raw = TRUE, 
##             simple = TRUE)
##     else poly(newdata, degree = max(attr(object, "degree")), 
##         coefs = attr(object, "coefs"), simple = TRUE)
## }
## <bytecode: 0x2786408>
## <environment: namespace:stats>
getAnywhere("predict.poly")$where
## [1] "registered S3 method for predict from namespace stats"
## [2] "namespace:stats"

函数参数个数

sort(names(formals(read.table)))
##  [1] "allowEscapes"     "as.is"            "blank.lines.skip"
##  [4] "check.names"      "col.names"        "colClasses"      
##  [7] "comment.char"     "dec"              "encoding"        
## [10] "file"             "fileEncoding"     "fill"            
## [13] "flush"            "header"           "na.strings"      
## [16] "nrows"            "numerals"         "quote"           
## [19] "row.names"        "sep"              "skip"            
## [22] "skipNul"          "stringsAsFactors" "strip.white"     
## [25] "text"

C.1.3 apply 函数族

函数 输入 输出
apply 矩阵、数据框 向量
lapply 向量、列表 列表
sapply 向量、列表 向量、矩阵
mapply 多个向量 列表
tapply 数据框、数组 向量
vapply 列表 矩阵
eapply 列表 列表
rapply 嵌套列表 嵌套列表

除此之外,还有 dendrapply() 专门处理层次聚类或分类回归树型结构, 而函数 kernapply() 用于时间序列的平滑处理

# Reproduce example 10.4.3 from Brockwell and Davis (1991) [@Brockwell_1991_Time]
spectrum(sunspot.year, kernel = kernel("daniell", c(11,7,3)), log = "no")
太阳黑子的频谱

图 C.1: 太阳黑子的频谱

将函数应用到多个向量,返回一个列表,生成四组服从正态分布 \(\mathcal{N}(\mu_i,\sigma_i)\) 的随机数,它们的均值和方差依次是 \(\mu_i = \sigma_i = 1 \ldots 4\)

means <- 1:4
sds <- 1:4
set.seed(2020)
samples <- mapply(rnorm, mean = means, sd = sds, MoreArgs = list(n = 50), SIMPLIFY = FALSE)
samples
## [[1]]
##  [1]  1.37697212  1.30154837 -0.09802317 -0.13040590 -1.79653432
##  [6]  1.72057350  1.93912102  0.77062225  2.75913135  1.11736679
## [11]  0.14687718  1.90925918  2.19637296  0.62841610  0.87673977
## [16]  2.80004312  2.70399588 -2.03876461 -1.28897495  1.05830349
## [21]  3.17436525  2.09818265  1.31822032  0.92685244  1.83426874
## [26]  1.19875064  2.29784138  1.93671831  0.85256681  1.11043199
## [31]  0.18749534  0.25629783  2.09534507  3.43537371  1.38811847
## [36]  1.29062767  0.71440171  1.07601472  0.43970140  1.44718837
## [41]  1.90850113  0.49494040  0.69899599  0.27396402 -0.18007703
## [46]  1.25307471  0.62928870  1.02217956  1.66004412  1.48879364
## 
## [[2]]
##  [1]  1.62242017  3.20271904  0.65247989  2.95210048  2.23750646
##  [6]  2.24245257  1.62790641 -0.65654238  0.86615410  3.15766787
## [11]  5.81807446  2.50151409 -1.19663012  8.40326349  3.91047075
## [16]  2.73728923  3.84583814  1.58895731  2.18593340  2.33652436
## [21]  3.59167825  5.29201121 -1.43384863  1.36331379  0.19172006
## [26]  0.59201441 -1.55619892  0.55548967  2.09230842  2.48731604
## [31]  3.25666261  1.95072284  6.62830662  2.35442051 -0.04882953
## [36]  6.54936260 -1.77811333  4.18790320  5.69233405  3.04206535
## [41] -1.06592422 -1.87872994  2.97383308  4.49047338  1.56545315
## [46]  0.44081377  2.69774900  3.36344850  0.93707723  0.64521316
## 
## [[3]]
##  [1] -2.18635182  0.02621703  1.24348331  4.15056524  5.23999476
##  [6]  0.21473726  1.98547111  7.63534205  3.79952663  3.89860180
## [11]  2.03159394  7.30604230  6.01958161 -2.15824091  3.89676139
## [16]  0.52582309 -0.59876950 -0.82916135  2.63046580  9.49782660
## [21]  2.06315108  4.10231768  6.80831778 -3.79456152 -0.86954973
## [26]  3.56365781  5.25492857  8.35404401  7.52481521  0.14051487
## [31]  3.31026822  1.18331851  2.70719812  2.62965568 -0.13403855
## [36]  2.77538898  8.28040600 -1.29635876 10.98660308 -0.87357460
## [41]  3.04530532  2.88095828  9.57432097 -2.92931265  4.39099699
## [46]  2.21336353 -0.40714352  3.63389844  3.29828378 -6.17005096
## 
## [[4]]
##  [1]  2.6114534 -3.7865046  3.1356934 -1.7967859  5.3817362
##  [6]  4.7528096 -0.5114111  4.2018014  1.2692036  6.5922153
## [11]  6.4414579  1.9493723  7.0176236  4.7524199 -3.7131425
## [16]  8.9432854  5.3131093  0.5803420 -3.5827529  7.3867697
## [21]  8.9761004  4.9450398  0.9572258  7.0888838  6.6462270
## [26] -2.3749335 -4.7613877 -0.7070158  8.2000205  3.9771877
## [31]  2.8959507  7.3940852 -0.1957386 -2.9343453 13.6565563
## [36]  3.2174329  7.7071478  1.1461455 -1.1989669  7.5371856
## [41]  8.8025661 -0.6813591  7.0458875  7.4803610  1.0910292
## [46]  6.5064829 -0.3657709  1.9356219  4.0677359  6.6439628

我们借用图C.2来看一下 mapply 的效果,多组随机数生成非常有助于快速模拟。

par(mfrow = c(2, 2), mar = c(2, 2, 2, 2))
invisible(lapply(samples, function(x) {
  plot(x, pch = 16, col = "grey")
  abline(h = mean(x), lwd = 2, col = "darkorange")
}))
 lapply 函数

图 C.2: lapply 函数

分别计算每个样本的平均值

sapply(samples, mean)
## [1] 1.125622 2.184323 2.731533 3.432760

分别计算每个样本的1,2,3 分位点

lapply(samples, quantile, probs = 1:3/4)
## [[1]]
##       25%       50%       75% 
## 0.6286342 1.1580587 1.8899430 
## 
## [[2]]
##       25%       50%       75% 
## 0.6470298 2.2399795 3.2431767 
## 
## [[3]]
##        25%        50%        75% 
## 0.05479149 2.74129355 4.33088906 
## 
## [[4]]
##          25%          50%          75% 
## -0.001718463  4.022461769  6.924774426

仅用 sapply() 函数替换上面的 lapply(),我们可以得到一个矩阵,值得注意的是函数 quantile()fivenum() 算出来的结果有一些差异

sapply(samples, quantile, probs = 1:3/4)
##          [,1]      [,2]       [,3]         [,4]
## 25% 0.6286342 0.6470298 0.05479149 -0.001718463
## 50% 1.1580587 2.2399795 2.74129355  4.022461769
## 75% 1.8899430 3.2431767 4.33088906  6.924774426
vapply(samples, fivenum, c(Min. = 0, "1st Qu." = 0, Median = 0, "3rd Qu." = 0, Max. = 0))
##               [,1]       [,2]        [,3]       [,4]
## Min.    -2.0387646 -1.8787299 -6.17005096 -4.7613877
## 1st Qu.  0.6284161  0.6452132  0.02621703 -0.1957386
## Median   1.1580587  2.2399795  2.74129355  4.0224618
## 3rd Qu.  1.9085011  3.2566626  4.39099699  7.0176236
## Max.     3.4353737  8.4032635 10.98660308 13.6565563

以数据集 presidents 为例,它是一个 ts 对象类型的时间序列数据,记录了 1945 年至 1974 年每个季度美国总统的支持率,这组数据中存在缺失值,以 NA 表示。支持率的变化趋势见图 C.3

plot(presidents)
1945-1974美国总统的支持率

图 C.3: 1945-1974美国总统的支持率

计算这 30 年每个季度的平均支持率

tapply(presidents, cycle(presidents), mean, na.rm = TRUE)
##        1        2        3        4 
## 58.44828 56.43333 57.22222 53.07143

cycle() 函数计算序列中每个观察值在周期中的位置,presidents 的周期为 4,根据位置划分组,然后分组求平均,也可以化作如下计算步骤,虽然看起来复杂,但是数据操作的过程很清晰,不再看起来像是一个黑箱。

# Base R
cbind(expand.grid(quarter = c("Qtr1", "Qtr2", "Qtr3", "Qtr4"), year = 1945:1974), rate = as.vector(presidents)) %>%
  reshape(., v.names = "rate", idvar = "year", timevar = "quarter", direction = "wide", sep = "") %>%
  `colnames<-`(., gsub(pattern = "(rate)", x = colnames(.), replacement =  "")) %>% 
  `[`(., -1) %>% 
  apply(., 2, mean, na.rm = TRUE)
##     Qtr1     Qtr2     Qtr3     Qtr4 
## 58.44828 56.43333 57.22222 53.07143

tapply 函数来做分组求和

# 一个变量分组求和
tapply(warpbreaks$breaks, warpbreaks[, 3, drop = FALSE], sum)
## tension
##   L   M   H 
## 655 475 390
# 两个变量分组计数
with(warpbreaks, table(wool, tension))
##     tension
## wool L M H
##    A 9 9 9
##    B 9 9 9
# 两个变量分组求和
aggregate(breaks ~ wool + tension, data = warpbreaks,  sum) %>% 
  reshape(., v.names = "breaks", idvar = "wool", timevar = "tension", direction = "wide", sep = "") %>% 
  `colnames<-`(., gsub(pattern = "(breaks)", x = colnames(.), replacement =  ""))
##   wool   L   M   H
## 1    A 401 216 221
## 2    B 254 259 169

C.2 案例学习

C.2.1 格式化输出

formatC(round(runif(1, 1e8, 1e9)), digits = 10, big.mark = ",")
## [1] "309,554,026"
# Sys.setlocale(locale = "C") # 如果是 Windows 系统,必须先设置,否则转化结果是 NA
as.Date(paste("1990-January", 1, sep = "-"), format = "%Y-%B-%d")
## [1] "1990-01-01"

获取当日零点

format(as.POSIXlt(Sys.Date()), "%Y-%m-%d %H:%M:%S")
## [1] "2020-06-27 00:00:00"
表 C.1: 日期表格
Code Meaning Code Meaning
%a Abbreviated weekday %A Full weekday
%b Abbreviated month %B Full month
%c Locale-specific date and time %d Decimal date
%H Decimal hours (24 hour) %I Decimal hours (12 hour)
%j Decimal day of the year %m Decimal month
%M Decimal minute %p Locale-specific AM/PM
%S Decimal second %U Decimal week of the year (starting on Sunday)
%w Decimal Weekday (0=Sunday) %W Decimal week of the year (starting on Monday)
%x Locale-specific Date %X Locale-specific Time
%y 2-digit year %Y 4-digit year
%z Offset from GMT %Z Time zone (character)

C.2.2 斐波那契数列

# 递归 Recall
fibonacci <- function(n) {
  if (n <= 2) {
    if (n >= 0) 1 else 0
  } else {
    Recall(n - 1) + Recall(n - 2)
  }
}
fibonacci(10) # 55
## [1] 55

C.2.3 字符串加密

字符串编码加密, openssl 包提供了 sha1 函数33

library(openssl)
encode_mobile <- function(phone_number) paste("*", paste(toupper(sha1(sha1(charToRaw(paste(phone_number, "$1$mobile$", sep = ""))))), collapse = ""), sep = "")
# 随意模拟两个手机号
mobile_vec <- c("18601013453", "13811674545")
sapply(mobile_vec, encode_mobile)
##                                 18601013453 
## "*B1D46D1D62C7280137F0E14249EE500865247B7B" 
##                                 13811674545 
## "*0554DA6E403491F58F1567DF2EDEB19186B77173"

C.2.4 DOI 引用文献

根据文章 DOI 生成引用 knitcitations

https://d.cosx.org/d/421286-r-markdown/26 https://doi.org/10.1145/3313831.3376466

library(knitcitations)
citep(x ='10.1145/3313831.3376466')

表格中引用

knitr::kable(data.frame(citation = c("[@xie2019]", "[@xie2015]")), format = 'pandoc')
citation
(Xie 2019)
(Xie 2015)

C.3 编译环境

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 8 (Core)
## 
## Matrix products: default
## BLAS:   /opt/R/R-3.6.3/lib64/R/lib/libRblas.so
## LAPACK: /opt/R/R-3.6.3/lib64/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods  
## [7] base     
## 
## other attached packages:
## [1] openssl_1.4.1 magrittr_1.5 
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.3  bookdown_0.20   htmltools_0.5.0 tools_3.6.3    
##  [5] yaml_2.2.1      stringi_1.4.6   rmarkdown_2.3   highr_0.8      
##  [9] knitr_1.29      stringr_1.4.0   digest_0.6.25   xfun_0.15      
## [13] rlang_0.4.6     evaluate_0.14   askpass_1.1

参考文献

Ihaka, Ross, and Robert Gentleman. 1996. “R: A Language for Data Analysis and Graphics.” Journal of Computational and Graphical Statistics 5 (3): 299–314.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

Xie, Yihui. 2019. “TinyTeX: A Lightweight, Cross-Platform, and Easy-to-Maintain Latex Distribution Based on TeX Live.” TUGboat, no. 1: 30–32. https://tug.org/TUGboat/Contents/contents40-1.html.