13 Basic: define function

如果说常用的 function 是一把工具，那么知道怎么 define (定义) your own function 就是可以给自己打造一把趁手的工具。

本章将以定义 function 为目标，来介绍所需要的知识。

13.1 Basic structure of function definition

function(arglist) {
  expr
  return(value)
}

例如which()：

function (x, arr.ind = FALSE, useNames = TRUE) 
{
    wh <- .Internal(which(x))
    if (isTRUE(arr.ind) && !is.null(d <- dim(x))) 
        arrayInd(wh, d, dimnames(x), useNames = useNames)
    else wh
}
<bytecode: 0x0000016e83f94820>
<environment: namespace:base>

如果是自定义 function，则只需执行x <- value创建一个 function object，其中该 function 的基本结构为value，x为该 function 的名字

name_fun <- function(arglist) {
  expr
  return(value)
}

例如：

mean_byme <- function(x) {
  mean_x <- sum(x)/length(x)
  return(mean_x)
}

可以通过 Snippets 快速输入 function 的基本结构：

通过之前的学习可以知道，只要把x <- value的代码执行，就会把产生的 object 存到 environment 里去。同样的，只要把 define function 的所有代码从上到下逐行执行，就会把创建的 function object 存到 environment 里去，然后就可以和正常的 function 一样调用了。

在调用 function 的时候，可以将arglist部分视作执行了赋值的操作，赋值的结果是创建了一个只存在于 function 内部的 object，该 object 的 name 为 function 作者提供的 argument name，value 为 function 的调用者提供的 argument value：

toy_fun <- function(arg1) {
  print(arg1)
  return(NULL)
}
obj_tmp <- toy_fun(arg1 = c(1, 2, 3))

#> [1] 1 2 3

# when using toy_fun, an temporary object named "arg1" with value = c(1, 2, 3) is created

13.2 A few caveats

return会直接终止所在的 function 并返回结果；

return_larger_than_two <- function(x) {
  for (i in seq_along(x)) {
    if (x[i] > 2) return(x[i])
    print(x[i])
  }
}
test <- return_larger_than_two(1:4)

#> [1] 1
#> [1] 2

test

#> [1] 3

如果不写return，默认返回最后一行代码的执行结果；

mean_byme_with_return <- function(x) {
  mean_x <- sum(x)/length(x)
  return(mean_x)
}
test <- mean_byme_with_return(c(1, 2, 3))
test

#> [1] 2

# 等价于
mean_byme_without_return <- function(x) {
  mean_x <- sum(x)/length(x)
}
test <- mean_byme_without_return(c(1, 2, 3))
test

#> [1] 2

如果expr非常简单，{}可以省略不写，

mean_byme <- function(x) mean_x <- sum(x)/length(x)

从 4.1.0 以上版本的 R 开始，支持使用\替代function前缀，上述代码可以进一步简写为：

mean_byme <- \(x) sum(x)/length(x)

这种写法常用于 anonymous function，使代码更加简洁。

如果自定义 function 时，不提供 function 名，就相当于创建了一个 anonymous function，该 function 所在的代码被执行多少次，这个 function 就执行多少次，执行完毕就“销声匿迹”，好比是一次性的 function。

(\(x) x + 1)(1)

#> [1] 2

for (i in 1:3) {
  print((\(x) x + 1)(i))
}

#> [1] 2
#> [1] 3
#> [1] 4

anonymous function 通常并不像上面的例子中的第一行展示的一样单独使用，而是搭配诸如apply()等高阶 function 使用（详见 the apply family 14 ），例如：

m1 <- matrix(1:4, nrow = 2, ncol = 2)
m1

#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    2    4

apply(m1, MARGIN = 1, FUN = \(x) x + 1)

#>      [,1] [,2]
#> [1,]    2    3
#> [2,]    4    5

13.3 Argument

必选 argument

function 是针对某个或某些 objects 的一组操作，一般都会有必选 argument。例如mean(x)，就必须要提供x，否则就会报错：

mean()

Error in mean.default(): argument "x" is missing, with no default

由于 R 是以统计分析见长，而统计分析必然离不开数据，所以，R 中相当数量的 function 都会将数据作为其必选 argument，对应的 argument 通常命名为x、data或其他类似意思的单词。此外，argument 一般都会在所有的 arguments 中处于相对靠前的位置。这些规律都会帮助理解本章中的 evaluation 小节（详见13.4）给出的建议。

可选 argument

除了必选 argument 以外的所有 arguments，就是可选 argument，通常这些 arguments 都预先提供了默认值。例如：

mean(x, trim = 0, na.rm = FALSE, ...)，其中na.rm是可选 argument，其值是一个长度为 1 的 logical vector，默认是FALSE，表示运算时NA不会被忽略。

mean(c(1, 2, NA))

#> [1] NA

mean(c(1, 2, NA), na.rm = TRUE)

#> [1] 1.5

13.4 Evaluation

执行 function 的基本形式是fun(arglist)，但根据13.3中的内容可以知道，每一个 argument 都有对应的 name，所以，在执行 function 的时候，匹配 argument 的方式有两种，argument name 或位置（没提供 argument name 时），例如：

matrix(data = NA, nrow = 1, ncol = 1)

# when arument names are available, 
#  positions of argument will not affect evaluation
matrix(nrow = 2, data = 1, ncol = 2)

#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    1    1

# when argument names are unavailable, 
#   arguments value will be evaluated according to their positions
matrix(2, 1, 2)

#>      [,1] [,2]
#> [1,]    2    2

因此，在执行 function 时，建议如下：

略写必选 argument（如数据）的 name。因为这些 arguments 的位置往往都是在最前面。省略 argument name，按顺序写它们的argument value 即可。
提供可选 argument 的 name。因为这些 arguments 的位置都往往靠后，且要用的可选 argument 可能会随着具体情况而变化，不一定会按照顺序来，如果不提供 argument ，直接按照位置来匹配会容易出现匹配错误。

例如：paste (..., sep = " ", collapse = NULL, recycle0 = FALSE)

paste(c(1, 2, 3), collapse = "_")

[1] "1_2_3"

paste(c(1, 2, 3), "_")

[1] "1 _" "2 _" "3 _"

13.5 Environment (optional)

Function 在执行的时候，会生成一个该 function 专属的独立 environment，这就使得在执行时有一些需要注意的细节。

所有在 function 内部定义的 object 都是临时局部变量，在 function 执行结束后会和该 function 专属的独立 environment 一起被“销毁”；

x <- 1
my_fun <- \(y) y + 1
my_fun(x)

#> [1] 2

#> Error: object 'y' not found

所有没有在 function 内部定义的 object， R 都会到上一级的 environment 里面找，没找到就再上一级，直到 Global Environment；

x <- 1
my_fun <- \() x + 1
my_fun()

#> [1] 2

同名 objects：
- function内外部定义有同名 objects，例如a，执行 function 时使用的是临时局部变量a，且不会改变外部 environment 中a的值。
- 定义了和某 function 同名的 object，通常 R 会根据实际情况来自动区分；

a <- 1
my_fun <- function() {
  a <- 2
  b <- 1
  return(a + b)
}
my_fun()

#> [1] 3

#> [1] 1

my_fun <- function() {
  mean <- c(1, 2, 3)
  mean(mean)
  return(mean)
}
my_fun()

#> [1] 1 2 3

上述两种同名的情况，虽然看起来不会造成什么严重的问题，但是极其容易让人混淆，请注意避免。

在 function 内部更改上一级 environment 中 object 的 value。

<<-

a <- 1
my_fun <- function() {
  rnd <- runif(1)
  print(rnd)
  if (rnd > 0.5) a <<- 0
  return(a)
}
my_fun()

#> [1] 0.3421004

#> [1] 1

13.6 Be an attentive coder (optional)

在定义 function 时，一个具备用户思维的码农会将各种可能的情况考虑得比较完善，并针对不同的情况给出清晰的提示。相反，如果考虑不够完善，就会在用户使用时出现令人摸不着头脑的错误，比如，

item_par <- c(alphaj = 1, betaj1 = 0, betaj2 = 0.5)
response <- 1
catR::thetaEst(it = item_par, x = response, model = "GRM")

#> Error in it[ind, ]: incorrect number of dimensions

根本的原因在于thetaEst源码的第 25 行，直接默认it为支持使用考虑维度的方式 subsetting 的结构类型（如 matrix），所以一但it是诸如 vector 这种不可以根据维度来 subsetting 的 object，就会直接报错

function (it, x, model = NULL, D = 1, method = "BM", priorDist = "norm", 
  priorPar = c(0, 1), weight = "Huber", tuCo = 1, range = c(-4, 
    4), parInt = c(-4, 4, 33), constantPatt = NULL, current.th = 0, 
  bRange = c(-2, 2)) 
{
  constantPattern <- function(t) ifelse(sum(t) == 0 | sum(t) == 
    length(t), TRUE, FALSE)
  METHOD <- NULL
  if (!constantPattern(x) | !is.null(model) | is.null(constantPatt)) 
    METHOD <- method
  else {
    if (sum(constantPatt == c("BM", "EAP", "WL")) == 1) 
      METHOD <- constantPatt
    else {
      if (sum(x) == 0) 
        res <- switch(constantPatt, fixed4 = current.th - 
          0.4, fixed7 = current.th - 0.7, var = 0.5 * 
          (current.th + bRange[1]))
      else res <- switch(constantPatt, fixed4 = current.th + 
        0.4, fixed7 = current.th + 0.7, var = 0.5 * 
        (current.th + bRange[2]))
    }
  }
  ind <- which(!is.na(x))
  it <- it[ind, ]

那么如何才是考虑较为完善？以定义一个寻找众数的 function 为例：

get_mode <- function(x) {
  fre_tab <- table(x)
  return(as.numeric(names(fre_tab)[fre_tab == max(fre_tab)]))
}
a <- sample(10, size = 10, replace = TRUE)
a

#>  [1]  6  4  5  1  2  8 10  9 10  5

get_mode(a)

#> [1]  5 10

通常都是从 argument 的角度考虑。

考虑 argument 的 structure type

get_mode <- function(x) {
  if (is.expression(x)) stop("argument 'x' can not be an expression")
  if (is.list(x) & !is.data.frame(x)) {
    x <- unlist(x)
    warning("argument 'x' is a list and has been expanded into a vector before loacting its mode")
  }
  if (is.data.frame(x)) x <- as.matrix(x)
  fre_tab <- table(x)
  return(as.numeric(names(fre_tab)[fre_tab == max(fre_tab)]))
}
get_mode(expression(1, 2))

#> Error in get_mode(expression(1, 2)): argument 'x' can not be an expression

get_mode(list(1, 2, 3))

#> Warning in get_mode(list(1, 2, 3)): argument 'x' is a list
#> and has been expanded into a vector before loacting its
#> mode

#> [1] 1 2 3

get_mode(data.frame(x = c(1, 2, 3), y = c(2, 3, 4)))

#> [1] 2 3

考虑 argument 的 element type

get_mode <- function(x) {
  if (is.expression(x)) stop("argument 'x' can not be an expression")
  if (is.list(x) & !is.data.frame(x)) {
    x <- unlist(x)
    warning("argument 'x' is a list and has been expanded into a vector before loacting its mode")
  }
  if (is.data.frame(x)) x <- as.matrix(x)
  if (!is.numeric(x) & !is.logical(x)) {
    warning("argument 'x' is not numeric or logical: returning NA")
    return(NA)
  }
  fre_tab <- table(x)
  return(as.numeric(names(fre_tab)[fre_tab == max(fre_tab)]))
}
get_mode(expression(1, 2))

#> Error in get_mode(expression(1, 2)): argument 'x' can not be an expression

get_mode(list(1, 2, 3))

#> Warning in get_mode(list(1, 2, 3)): argument 'x' is a list
#> and has been expanded into a vector before loacting its
#> mode

#> [1] 1 2 3

get_mode(c("a", "a", "b"))

#> Warning in get_mode(c("a", "a", "b")): argument 'x' is not
#> numeric or logical: returning NA

#> [1] NA

考虑操作的完备性

get_mode <- function(x, margin = 0) {
  if (is.expression(x)) stop("argument 'x' can not be an expression")
  if (is.list(x) & !is.data.frame(x)) {
    x <- unlist(x)
    warning("argument 'x' is a list and has been expanded into a vector before loacting its mode")
  }
  if (is.data.frame(x)) x <- as.matrix(x)
  if (!is.numeric(x) & !is.logical(x)) {
    warning("argument 'x' is not numeric or logical: returning NA")
    return(NA)
  }
  if ((margin == 1 | margin == 2) & length(dim(x)) == 2) {
    n_mode <- ifelse(margin == 1, nrow(x), ncol(x))
    all_modes <- vector("list", n_mode)
    for (r in 1:n_mode) {
      fre_tab <- table(if (margin == 1) x[r, ] else x[, r])
      all_modes[[r]] <- as.numeric(names(fre_tab)[fre_tab == max(fre_tab)])
    }
    return(all_modes)
  } else if ((margin == 1 | margin == 2) & length(dim(x)) > 2) {
    stop("the dimensionality of argument 'x' must not exceed 2 when locating mode rowwise or colwise") 
  } else {
    fre_tab <- table(x)
    return(as.numeric(names(fre_tab)[fre_tab == max(fre_tab)]))
  }
}
a <- matrix(sample(12, size = 15, replace = TRUE), nrow = 3, ncol = 5)
a

#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    9   11    9   11   11
#> [2,]   10   11    3    2    9
#> [3,]    5   10   12    1    7

get_mode(a, margin = 1)

#> [[1]]
#> [1] 11
#> 
#> [[2]]
#> [1]  2  3  9 10 11
#> 
#> [[3]]
#> [1]  1  5  7 10 12

get_mode(a, margin = 2)

#> [[1]]
#> [1]  5  9 10
#> 
#> [[2]]
#> [1] 11
#> 
#> [[3]]
#> [1]  3  9 12
#> 
#> [[4]]
#> [1]  1  2 11
#> 
#> [[5]]
#> [1]  7  9 11

a1 <- array(sample(16, size = 16, replace = TRUE), dim = c(2, 4, 2))
get_mode(a1, margin = 1)

#> Error in get_mode(a1, margin = 1): the dimensionality of argument 'x' must not exceed 2 when locating mode rowwise or colwise

减少冗余代码

mode_bytable <- function(input) {
  fre_tab <- table(input)
  return(as.numeric(names(fre_tab)[fre_tab == max(fre_tab)]))
}
    
get_mode <- function(x, margin = 0) {
  if (is.expression(x)) stop("argument 'x' can not be an expression")
  if (is.list(x)) {
    x <- unlist(x)
    warning("argument 'x' is a list and has been expanded into a vector before loacting its mode")
  }
  if (is.data.frame(x)) x <- as.matrix(x)
  if (!is.numeric(x) & !is.logical(x)) {
    warning("argument 'x' is not numeric or logical: returning NA")
    return(NA)
  }
  if ((margin == 1 | margin == 2) & length(dim(x)) == 2) {
    n_mode <- ifelse(margin == 1, nrow(x), ncol(x))
    all_modes <- vector("list", n_mode)
    for (r in 1:n_mode) {
      all_modes[[r]] <- mode_bytable(if (margin == 1) x[r, ] else x[, r])
    }
  } else if ((margin == 1 | margin == 2) & length(dim(x)) > 2) {
    stop("the dimensionality of argument 'x' must not exceed 2 when computing mode rowwise or colwise")
  } else {
    all_modes <- mode_bytable(x)
  }
  return(all_modes)
}
a <- matrix(sample(12, size = 15, replace = TRUE), nrow = 3, ncol = 5)
a

#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    9    2    2    1    1
#> [2,]   10    8   12    2   11
#> [3,]    4    4    2    4    4

get_mode(a)

#> [1] 2 4

get_mode(a, margin = 1)

#> [[1]]
#> [1] 1 2
#> 
#> [[2]]
#> [1]  2  8 10 11 12
#> 
#> [[3]]
#> [1] 4

13.7 Be a curious coder (optional)

如果要提高编程的水平，一个重要的方式就是多读高手的代码，特别是一些包中的源码。

已经加载的包中的 function

mean # 在 Console 输出

function (x, ...) 
UseMethod("mean")
<bytecode: 0x0000016eff65b158>
<environment: namespace:base>

使用View(mean)更方便。

未加载的包中的 function

已经安装，但是没有加载的包中的 function，可以使用::来查看，例如：

openxlsx::read.xlsx

#> function (xlsxFile, sheet, startRow = 1, colNames = TRUE, rowNames = FALSE, 
#>     detectDates = FALSE, skipEmptyRows = TRUE, skipEmptyCols = TRUE, 
#>     rows = NULL, cols = NULL, check.names = FALSE, sep.names = ".", 
#>     namedRegion = NULL, na.strings = "NA", fillMergedCells = FALSE) 
#> {
#>     UseMethod("read.xlsx", xlsxFile)
#> }
#> <bytecode: 0x0000016e88028760>
#> <environment: namespace:openxlsx>

::也可以用来在不加载整个包的情况下使用包中的指定 function，例如：

openxlsx::read.xlsx("1.xlsx")

更多具体的查看源码的方式可以参考：

冷知识：

To understand computations in R, two slogans are helpful:

Everything that exists is an object.

Everything that happens is a function call.

— John Chambers

`+`

#> function (e1, e2)  .Primitive("+")

13.8 Recap

自定义 function 基本结构是：

name_fun <- function(arglist) {
  expr
  return(value)
}

\(arglist) expr是4.1.0以上版本的 R 支持的 function 基本结构简便写法；
function 的 argument 包括必选 argument 和可选 argument，可选 argument 通常都会有默认值；
使用 function 时，可以考虑略写必选 argument 的 name，而提供可选 argument 的 name；
function 在执行的时候会生成一个该 function 专属的独立 environment；
尽量避免同名的情况，包括不同 environment 中的变量同名或 function 和变量同名。

12 Basic: control flow

14 Basic: apply family