1. 收集
2. 清洗
3. 分析

## 15.1 Prerequisite

### 15.1.1 Tidy data

1. 每一个变量都要有专属的一列（最重要）；
2. 每一份观察记录要有专属的一行；
3. 每一个数据点要有专属的单元格。
• Tidy data 示例
  X.       Name Type.1 Type.2 Total HP Attack Defense
1  1  Bulbasaur  Grass Poison   318 45     49      49
2  2    Ivysaur  Grass Poison   405 60     62      63
3  3   Venusaur  Grass Poison   525 80     82      83
4  4 Charmander   Fire          309 39     52      43
5  5 Charmeleon   Fire          405 58     64      58
6  6  Charizard   Fire Flying   534 78     84      78
Sp..Atk Sp..Def Speed Stage Legendary
1      65      65    45     1     FALSE
2      80      80    60     2     FALSE
3     100     100    80     3     FALSE
4      60      50    65     1     FALSE
5      80      65    80     2     FALSE
6     109      85   100     3     FALSE
• not tidy data 示例

                      A B  C D  E F G
virtual_imaginary    13 9 12 7 10 8 9
no_virtual_imaginary  4 2 10 8  6 6 4

# A tibble: 14 × 3
participant group                count
<chr>       <chr>                <int>
1 A           virtual_imaginary       13
2 A           no_virtual_imaginary     4
3 B           virtual_imaginary        9
4 B           no_virtual_imaginary     2
5 C           virtual_imaginary       12
6 C           no_virtual_imaginary    10
7 D           virtual_imaginary        7
8 D           no_virtual_imaginary     8
9 E           virtual_imaginary       10
10 E           no_virtual_imaginary     6
11 F           virtual_imaginary        8
12 F           no_virtual_imaginary     6
13 G           virtual_imaginary        9
14 G           no_virtual_imaginary     4

### 15.1.2 Pipe operator

#### 15.1.2.1%>%

%>%，最早来自 magrittr 包，作者是 Stefan Milton Bache.

#### 15.1.2.2 What is %>%

%>%的最基本用法是x %>% fun(...)，本质上相当于fun(x, ...)，即把x输送给fun，并默认作为第一个 argument，并且执行fun

vec <- c(1, 2, 3)
vec %>% mean
[1] 2
mean(vec)
[1] 2

#### 15.1.2.3 Placeholder

df <- data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
df %>%
ncol %>%
seq(from = 1, to = .)
[1] 1 2

#### 15.1.2.4 why %>%

car_data <-
transform(aggregate(. ~ cyl,
data = subset(datasets::mtcars, hp > 100), # mtcars[mtcars$hp > 100, ] FUN = function(x) round(mean(x), 2)), kpl = mpg*0.4251) car_data cyl mpg disp hp drat wt qsec vs am gear 1 4 25.90 108.05 111.00 3.94 2.15 17.75 1.00 1.00 4.50 2 6 19.74 183.31 122.29 3.59 3.12 17.98 0.57 0.43 3.86 3 8 15.10 353.10 209.21 3.23 4.00 16.77 0.00 0.14 3.29 carb kpl 1 2.00 11.010090 2 3.43 8.391474 3 3.50 6.419010 要读懂上一段代码，首先要做的是根据()来判断到底哪部分先执行。可以看得出来，是先执行内层的aggregate()，再执行外层的transform()。除此自外，aggregate()的 argument 是 function 的执行结果，并且还是 function 套 function，这也增加了理解的难度。 car_data <- datasets::mtcars %>% subset(hp > 100) %>% aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2)) %>% # the second dot is a placeholder, the 1st and 3rd ones are the dot notation of aggregate transform(kpl = mpg * 0.4251) car_data cyl mpg disp hp drat wt qsec vs am gear 1 4 25.90 108.05 111.00 3.94 2.15 17.75 1.00 1.00 4.50 2 6 19.74 183.31 122.29 3.59 3.12 17.98 0.57 0.43 3.86 3 8 15.10 353.10 209.21 3.23 4.00 16.77 0.00 0.14 3.29 carb kpl 1 2.00 11.010090 2 3.43 8.391474 3 3.50 6.419010 使用%>%改写以后，就变成语义和结构上都更加清晰的一段代码，从上到下可以解读为： 1. mtcarshp>100的部分取出； 2. 将上一步得到的数据使用aggregate()cyl分组求均值，结果取整，保留 2 位小数； 3. 将上一步得到的数据做 transformation，将mpg这一列乘以0.4251，所得结果存新的一列kpl 这样的代码读起来认知负担就远小于原始的代码。同时，从这个例子也可以看出来，%>%适合针对同一个 object 的序列操作。 #### 15.1.2.5 Native pipe operator 随着 magrittr 包的影响力越来越大，使用%>%的人越来越多，R core team 也在 4.1.0 中加入了 native pipe operator ——|>，从此以后再也不用另外再加载 magrittr 包了。 那么|>%>%有什么区别需要注意的吗？有，一共 7 点区别，其中前 4 点十分常用。 1. |>中右侧调用的 function 必须有明确的() vec <- c(1, 2, 3) vec %>% mean [1] 2 vec %>% mean() # equivalent [1] 2 vec |> mean() [1] 2 1. |>的 placeholder 是_ df <- data.frame(x = c(1, 2, 3), y = c(4, 5, 6)) df %>% ncol %>% seq(from = 1, to = .) [1] 1 2 df |> ncol() |> seq(from = 1, to = _) [1] 1 2 1. |>_必须用于有 name 的 argument； letters[1:3] %>% paste(., collapse = " and ") [1] "a and b and c" letters[1:3] |> paste(_, collapse = " and ") Error: pipe placeholder can only be used as a named argument (<text>:1:17) 这就使得如果...在这种灵活且没有 name 的 argument 上使用_，需要取点巧 # make up an argument with unused name in paste letters[1:3] |> paste(x = _, collapse = " and ") [1] "a and b and c" # use anonymous function letters[1:3] |> ($$x) paste(x, collapse = " and "))() [1] "a and b and c" 1. 一个|>中的_只能在右侧使用一次； letters[1:3] %>% paste(., .) [1] "a a" "b b" "c c" letters[1:3] |> paste(x = _, y = _) Error: pipe placeholder may only appear once (<text>:1:17) 1. |>_不允许在嵌套 function 中使用； # magrittr supports nested function call wrapped by braces 1:2 %>% {sqrt(sum(.))} [1] 1.732051 # use anonymous function for nested calls 1:2 %>% (\(x) sqrt(sum(x))) [1] 1.732051 1. |>使用时需要更加明确是一个 function； mtcars %>% (cyl) [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 [29] 8 6 8 4 mtcars |> ()(cyl) [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 [29] 8 6 8 4 mtcars |> base::(cyl) [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 [29] 8 6 8 4 fun <-  mtcars |> fun(cyl) [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 [29] 8 6 8 4 1. |>执行时不会生成额外的 environment。 x <- 1 "x" %>% assign(4) x [1] 1 # just like a function, %>% generates an additional # independent env when evaluating, everything therefore # happen in this environment, change will not be passed # to the parent environments "x" |> assign(4) x [1] 4 # |> lead to direct assignment of x in the global # environment because no additional env. "x" |> (\(y) assign(y, 3))() x [1] 4 # anonymous function acts in a way similar to %>%, # since function call always generate additional env What are the differences between R’s new native pipe |> and the magrittr pipe %>%? ## 15.2 Create data set 创建数据集指的是将收集来的数据创建为一个完整的数据集。根据具体情况的不同，可能有录入、合并等多种形式。 ### 15.2.1 Import 采用线下回收纸质问卷的方式时，第一个要做的事就是人工将数据录入至电脑。这个过程中录入员非常容易因为数据量大、长时间录入而出现录入错误，所以一般会采取双人背对背录入同一批问卷数据的方式来交叉·核对。所以，这种场景下的任务目标就是对比两个数据集是否完全一致，如果不是，不一致的地方在哪里。 在开始对比之前，首先要保证两份数据集大体上的一致性，例如被试的数量、顺序，题目的数量、顺序、缺失数据的处理方式、是否有反向计分，等等。这些都是可以在录入之前就约定好采用相同的处理方式。 head(data_dirty_01) gender ethnicity first_name last_name age P1 P2 P3 P4 P5 1 1 1 Samara Lizer 20 3 3 5 4 3 2 1 2 Nur Sem 29 1 2 2 2 3 3 1 2 Juhi Ware 28 1 1 3 2 3 4 0 2 Joshua Vang 21 3 2 2 2 3 5 0 2 Ronak Hartley 20 3 4 3 3 5 6 0 2 Jacob Solongo 26 3 2 3 1 2 P6 P7 P8 P9 P10_att P11 P12 P13 P14 P15 P16 W17 W18 W19 1 4 4 5 4 3 5 3 5 3 3 4 4 4 3 2 3 1 2 3 3 1 2 3 2 3 3 2 2 2 3 1 3 2 1 3 1 1 1 1 1 3 3 2 2 4 2 3 3 2 3 2 2 3 2 4 3 2 3 4 5 3 3 4 4 3 4 4 4 2 4 4 5 3 4 6 3 2 2 3 3 3 4 1 3 2 2 2 3 3 W20 W21 W22 W23 W24 W25 W26 W27 W28 W29 W30 W31 W32_att 1 5 3 3 3 4 5 4 2 5 4 3 4 5 2 2 3 3 2 3 3 3 2 2 2 1 1 5 3 3 2 3 9999 4 2 3 3 1 4 2 3 5 4 2 4 3 3 4 3 4 3 3 3 4 3 5 5 3 4 4 3 4 5 4 4 3 3 4 4 5 6 3 2 3 2 3 2 2 2 4 3 3 2 5 g33 g34 g35 g36 g37 g38 g39_att g40 g41 g42 g43 g44 g45 1 4 4 2 3 4 3 3 4 3 4 4 4 3 2 5 5 4 3 4 3 3 3 5 3 3 5 5 3 2 2 3 3 3 4 3 3 3 3 2 3 2 4 2 2 2 1 3 3 3 1 3 2 2 4 3 5 3 3 4 3 4 3 3 4 3 4 3 3 3 6 3 2 3 3 2 3 3 3 3 2 2 2 4 g46 g47 g48 S49 S50 S51 S52 S53 S54 S55 S56 S57 S58 S59 1 4 3 3 3 3 3 3 2 3 4 3 4 3 4 2 5 5 4 3 3 4 3 3 3 2 3 4 3 2 3 2 4 3 1 3 1 2 2 1 1 2 1 2 2 4 1 3 1 1 1 1 2 1 2 1 1 1 1 2 5 2 2 3 5 4 4 4 4 5 4 4 3 4 5 6 2 3 3 2 2 2 3 2 3 4 3 3 3 4 S60 S61_recode S62 S63 S64 J65 J66 J67 J68 J69 J70 J71 1 2 3 3 5 4 4 4 4 4 3 1 4 2 2 3 3 3 2 3 3 3 3 3 1 3 3 1 5 2 1 2 3 3 3 4 2 3 3 4 2 5 2 1 3 2 2 1 2 2 3 4 5 3 1 3 4 5 3 3 3 3 4 3 2 6 2 3 2 1 3 4 3 2 3 2 2 2 J72 J73 J74 J75 J76 J77 J78 J79 J80 time 1 4 4 3 4 4 3 3 5 3 1632 2 3 2 2 2 2 2 2 2 2 1572 3 3 3 4 2 1 5 3 3 3 1616 4 2 3 4 2 3 4 2 3 4 1609 5 4 2 3 2 3 3 3 3 3 1639 6 2 3 3 3 3 2 2 3 3 215 head(data_dirty_02) gender ethnicity first_name last_name age P1 P2 P3 P4 P5 1 1 1 Samara Lizer 20 3 3 5 4 3 2 1 2 Nur Sem 29 1 2 2 2 3 3 1 2 Juhi Ware 28 1 1 3 2 3 4 0 2 Joshua Vang 21 3 2 2 2 3 5 0 2 Ronak Hartley 20 3 4 3 3 5 6 0 2 Jacob Solongo 26 3 2 3 1 2 P6 P7 P8 P9 P10_att P11 P12 P13 P14 P15 P16 W17 W18 W19 1 4 4 5 4 3 5 3 5 3 3 4 4 4 3 2 3 1 2 3 3 1 2 3 2 3 3 2 2 2 3 1 3 2 1 3 1 1 1 1 1 3 3 2 2 4 2 3 3 2 3 2 2 3 2 4 3 2 3 4 5 3 3 4 4 3 4 4 4 2 4 4 5 3 4 6 3 2 2 3 3 3 4 1 3 2 2 2 3 3 W20 W21 W22 W23 W24 W25 W26 W27 W28 W29 W30 W31 W32_att 1 5 3 3 3 4 5 4 2 5 4 3 4 5 2 2 3 3 2 3 3 3 2 2 2 1 1 5 3 3 2 3 999 4 2 3 3 1 4 2 3 5 4 2 4 3 3 4 3 4 3 3 3 4 3 5 5 3 4 4 3 4 5 4 4 3 3 4 4 5 6 3 2 3 2 3 2 2 2 4 3 3 2 5 g33 g34 g35 g36 g37 g38 g39_att g40 g41 g42 g43 g44 g45 1 4 4 2 3 4 3 3 4 3 4 4 4 3 2 5 5 4 3 4 3 3 3 5 3 3 5 5 3 2 2 3 3 3 4 3 3 3 3 2 3 2 4 2 2 2 1 3 3 3 1 3 2 2 4 3 5 3 3 4 3 4 3 3 4 3 4 3 3 3 6 3 2 3 3 2 3 3 3 3 2 2 2 4 g46 g47 g48 S49 S50 S51 S52 S53 S54 S55 S56 S57 S58 S59 1 4 3 3 3 3 3 3 2 3 4 3 4 3 4 2 5 5 4 3 3 4 3 3 3 2 3 4 3 2 3 2 4 3 1 3 1 2 2 1 1 2 1 2 2 4 1 3 1 1 1 1 2 1 2 1 1 1 1 2 5 2 2 3 5 4 4 4 4 5 4 4 3 4 5 6 2 3 3 2 2 2 3 2 3 4 3 3 3 4 S60 S61_recode S62 S63 S64 J65 J66 J67 J68 J69 J70 J71 1 2 3 3 5 4 4 4 4 4 3 1 4 2 2 3 3 3 2 3 3 3 3 3 1 3 3 1 5 2 1 2 3 3 3 4 2 3 3 4 2 5 2 1 3 2 2 1 2 2 3 4 5 3 1 3 4 5 3 3 3 3 4 3 2 6 2 3 2 1 3 4 3 2 3 2 2 2 J72 J73 J74 J75 J76 J77 J78 J79 J80 time 1 4 4 3 4 4 3 3 5 3 1632 2 3 2 2 2 2 2 2 2 2 1572 3 3 3 4 2 1 5 3 3 3 1616 4 2 3 4 2 3 4 2 3 4 1609 5 4 2 3 2 3 3 3 3 3 1639 6 2 3 3 3 3 2 2 3 3 215 1. 是否完全一样 identical(data_dirty_01, data_dirty_02) [1] FALSE 1. 哪里不一样 如果两份数据文件记录的内容完全一样，但只是 element type 不同，identical()函数也会返回FALSE的结果，所以首先要检查是否存在 element type 上的不同。 # 判断是不是元素类型导致的问题 elem_type_data01 <- sapply(data_dirty_01, typeof) elem_type_data01 gender ethnicity first_name last_name age "integer" "integer" "character" "character" "integer" P1 P2 P3 P4 P5 "double" "double" "double" "double" "double" P6 P7 P8 P9 P10_att "double" "double" "double" "double" "double" P11 P12 P13 P14 P15 "double" "double" "double" "double" "double" P16 W17 W18 W19 W20 "double" "double" "double" "double" "double" W21 W22 W23 W24 W25 "double" "double" "double" "double" "double" W26 W27 W28 W29 W30 "double" "double" "double" "double" "double" W31 W32_att g33 g34 g35 "double" "double" "double" "double" "double" g36 g37 g38 g39_att g40 "double" "double" "double" "double" "double" g41 g42 g43 g44 g45 "double" "double" "double" "double" "double" g46 g47 g48 S49 S50 "double" "double" "double" "double" "double" S51 S52 S53 S54 S55 "double" "double" "double" "double" "double" S56 S57 S58 S59 S60 "double" "double" "double" "double" "double" S61_recode S62 S63 S64 J65 "double" "double" "double" "double" "double" J66 J67 J68 J69 J70 "double" "double" "double" "double" "double" J71 J72 J73 J74 J75 "double" "double" "double" "double" "double" J76 J77 J78 J79 J80 "double" "double" "double" "double" "double" time "double" elem_type_data02 <- sapply(data_dirty_02, typeof) identical(elem_type_data01, elem_type_data02) [1] TRUE 确认了 element type 没有问题以后，就可以判断出来确实存在数据内容上的不同。接下来就是找到不同所在。 n_ob <- nrow(data_dirty_01) locations <- which(data_dirty_01 != data_dirty_02, arr.ind = TRUE) locations_row <- locations %% n_ob locations_col <- locations %/% n_ob + 1 locations <- cbind(row = locations_row, col = locations_col) locations <- locations[order(locations_row), ] Error in locations[order(locations_row), ]: subscript out of bounds head(locations) row col row col [1,] 439 6 1 1 [2,] 614 6 1 1 [3,] 740 6 1 1 [4,] 904 6 1 1 [5,] 927 6 1 1 [6,] 972 6 1 1 核对一下是不是不一样。 data_dirty_01[3, ] gender ethnicity first_name last_name age P1 P2 P3 P4 P5 3 1 2 Juhi Ware 28 1 1 3 2 3 P6 P7 P8 P9 P10_att P11 P12 P13 P14 P15 P16 W17 W18 W19 3 1 3 2 1 3 1 1 1 1 1 3 3 2 2 W20 W21 W22 W23 W24 W25 W26 W27 W28 W29 W30 W31 W32_att 3 3 2 3 9999 4 2 3 3 1 4 2 3 5 g33 g34 g35 g36 g37 g38 g39_att g40 g41 g42 g43 g44 g45 3 2 2 3 3 3 4 3 3 3 3 2 3 2 g46 g47 g48 S49 S50 S51 S52 S53 S54 S55 S56 S57 S58 S59 3 2 4 3 1 3 1 2 2 1 1 2 1 2 2 S60 S61_recode S62 S63 S64 J65 J66 J67 J68 J69 J70 J71 3 1 5 2 1 2 3 3 3 4 2 3 3 J72 J73 J74 J75 J76 J77 J78 J79 J80 time 3 3 3 4 2 1 5 3 3 3 1616 data_dirty_02[3, ] gender ethnicity first_name last_name age P1 P2 P3 P4 P5 3 1 2 Juhi Ware 28 1 1 3 2 3 P6 P7 P8 P9 P10_att P11 P12 P13 P14 P15 P16 W17 W18 W19 3 1 3 2 1 3 1 1 1 1 1 3 3 2 2 W20 W21 W22 W23 W24 W25 W26 W27 W28 W29 W30 W31 W32_att 3 3 2 3 999 4 2 3 3 1 4 2 3 5 g33 g34 g35 g36 g37 g38 g39_att g40 g41 g42 g43 g44 g45 3 2 2 3 3 3 4 3 3 3 3 2 3 2 g46 g47 g48 S49 S50 S51 S52 S53 S54 S55 S56 S57 S58 S59 3 2 4 3 1 3 1 2 2 1 1 2 1 2 2 S60 S61_recode S62 S63 S64 J65 J66 J67 J68 J69 J70 J71 3 1 5 2 1 2 3 3 3 4 2 3 3 J72 J73 J74 J75 J76 J77 J78 J79 J80 time 3 3 3 4 2 1 5 3 3 3 1616 ### 15.2.2 Merge 当数据是分散在不同的设备，或者是数据是在不同的时间点采集时，往往会得到多份记录相同变量但具体数据信息不同的文件，此时就需要将多份文件合并成为一个完整的数据集。 1. 不同被试相同变量 当只是针对不同被试测量了一次相同变量时，这样产生的多份数据只需要匹配变量的信息，即只需要将多份数据中相同的变量信息提取出来，然后纵向合并即可。 head(data_difsub_idevar_01) id humility egalitarian_belief stereotyping var_a var_b 1 1 3 5 2 3 5 2 2 3 3 1 2 3 3 3 2 1 5 5 4 4 4 2 2 5 4 2 5 5 3 1 1 2 3 6 6 5 2 1 1 2 var_c 1 2 2 1 3 5 4 3 5 4 6 5 head(data_difsub_idevar_02) id humility egalitarian_belief stereotyping var_d var_e 1 101 1 1 4 1 4 2 102 3 1 4 5 4 3 103 4 1 2 3 2 4 104 2 2 3 4 4 5 105 5 2 1 3 5 6 106 2 4 4 2 4 var_f 1 5 2 5 3 4 4 3 5 3 6 1 name_common_col <- intersect(names(data_difsub_idevar_01), names(data_difsub_idevar_02)) name_common_col [1] "id" "humility" [3] "egalitarian_belief" "stereotyping" data_merged <- rbind( data_difsub_idevar_01[name_common_col], data_difsub_idevar_02[name_common_col] ) head(data_merged) id humility egalitarian_belief stereotyping 1 1 3 5 2 2 2 3 3 1 3 3 2 1 5 4 4 2 2 5 5 5 3 1 1 6 6 5 2 1 1. 相同被试不同变量 当针对同一批被试，测试了多次不同的变量，或者是在不同的时间点测量了相同的变量时，也可能产生多份数据文件，此时就需要以被试的背景信息为依据，来合并多份数据。 head(data_idesub_diffvar_01) id humility_t1 egalitarian_belief_t1 stereotyping_t1 1 1 3 2 5 2 2 5 1 3 3 3 1 3 1 4 4 3 3 3 5 5 1 5 3 6 6 4 3 3 head(data_idesub_diffvar_02) id humility_t2 egalitarian_belief_t2 stereotyping_t2 1 1 3 5 5 2 2 2 2 1 3 3 4 1 1 4 4 3 3 1 5 5 2 1 1 6 6 1 2 5 data_merged <- merge(data_idesub_diffvar_01, data_idesub_diffvar_02) head(data_merged) id humility_t1 egalitarian_belief_t1 stereotyping_t1 1 1 3 2 5 2 2 5 1 3 3 3 1 3 1 4 4 3 3 3 5 5 1 5 3 6 6 4 3 3 humility_t2 egalitarian_belief_t2 stereotyping_t2 1 3 5 5 2 2 2 1 3 4 1 1 4 3 3 1 5 2 1 1 6 1 2 5 使用merge()中的额外参数满足更多需求。 # the extra rows in x (data_idesub_diffvar_01) that are not in # y (data_idesub_diffvar_02) will be reserved data_merged <- merge(data_idesub_diffvar_01, data_idesub_diffvar_02, all.x = TRUE) tail(data_merged) id humility_t1 egalitarian_belief_t1 stereotyping_t1 105 105 2 4 4 106 106 1 2 2 107 107 5 5 2 108 108 4 5 4 109 109 5 1 2 110 110 2 1 2 humility_t2 egalitarian_belief_t2 stereotyping_t2 105 NA NA NA 106 NA NA NA 107 NA NA NA 108 NA NA NA 109 NA NA NA 110 NA NA NA ## 15.3 Convert into tidy data 创建完数据集以后，很有可能数据并不是 tidy data 的形式，大多是两种情况： 1. Wider format: 一个变量的数据跨了好几列； 2. Longer format: 一个观察记录跨了好几行。 上述两种情况都需要转换数据的形态，前者需要将数据从 wider format 转换成 longer format。在心理学领域的数据分析任务中，前者最为常见，所以本节只讲前者。 ### 15.3.1 Wider to Longer 重复测量的数据是一个典型的 wider format。例如\(2\times2$$重复测量，  id A1B1 A2B1 A1B2 A2B2 A1B3 A2B3 1 1 3 5 4 2 1 1 2 4 1 2 4 4 5 2 3 3 3 2 4 5 4 3 4 2 3 3 1 5 3 4 5 3 1 2 4 5 3 5 6 5 4 3 1 5 5 6 实验处理一共是 4 个水平，但其实是把 A 和 B 这两个变量分在了 4 列。因此，需要执行 wider to longer 的转换，把 4 列的数据合并成 2 列，将原本的分组变量 A 和 B 分离出来，这就需要使用到 tidyr 包中的pivot_loger() # the most frequently used arguments of pivot_longer pivot_longer( data, cols, names_to = "name", names_sep = NULL, names_pattern = NULL, values_to = "value" ) • cols: 要转变为 longer format 的原 columns； • names_to: 字符串向量，原 columns 的 names 将储存在新 column 的单元格里，新 column 的 name 由本参数设定。 • 如果names_to的长度大于 1，将会创建多个新的 columns。在这个情况下，必须提供names_sepnames_pattern来指定原 columns 的 names 怎么分配到新 columns 的单元格里去。 • NA：将会遗弃对应的新 column， • .value：新 columns 的 names 是原 columns names 的一部分，新 columns 的单元格里储存对应的原 columns 单元格的数据，该参数会覆盖value_to参数的设定。 • names_sep: 指定原 columns names 的分割规则，既可以是 numeric vector，表示按照位置来分割，也可以是单个字符（类似read.table()中的sep参数，可以是正常的字符，如"_"，也可以是正则表达式，如"\."）。 • names_pattern: 正则表达式，原 columns 的 names 将按照该表达式的匹配结果分割并创建新 columns。 • values_to: 字符，原 column 的单元格中的数据将单独储存在一个新 column （value column）里，新 value column 的 name 由本参数设定。如果names_to中包含了.value（必须给定names_sepnames_pattern），本参数的设定会被忽略，并且新 value column 的 name 来自原 columns name 中的指定部分。 1. 基本用法：4 个必写参数，pivot(data, cols, names_to, values_to) data_virtual_imaginary <- read.table( "F:/Nutstore backup/R/codes/RBA/data/untidy data virtual imaginary.txt" ) head(data_virtual_imaginary) A B C D E F G virtual_imaginary 13 9 12 7 10 8 9 no_virtual_imaginary 4 2 10 8 6 6 4 data_virtual_imaginary |> t() |> as.data.frame(row.names = 1:ncol(data_virtual_imaginary)) |> cbind(participant = LETTERS[1:ncol(data_virtual_imaginary)]) |> pivot_longer(cols = c("virtual_imaginary", "no_virtual_imaginary"), names_to = "group", values_to = "count") |> dplyr::select(participant, group, count) # A tibble: 14 × 3 participant group count <chr> <chr> <int> 1 A virtual_imaginary 13 2 A no_virtual_imaginary 4 3 B virtual_imaginary 9 4 B no_virtual_imaginary 2 5 C virtual_imaginary 12 6 C no_virtual_imaginary 10 7 D virtual_imaginary 7 8 D no_virtual_imaginary 8 9 E virtual_imaginary 10 10 E no_virtual_imaginary 6 11 F virtual_imaginary 8 12 F no_virtual_imaginary 6 13 G virtual_imaginary 9 14 G no_virtual_imaginary 4 1. 额外参数：names_sep # numeric head(data_repeated_2f) id A1B1 A2B1 A1B2 A2B2 A1B3 A2B3 1 1 3 5 4 2 1 1 2 4 1 2 4 4 5 2 3 3 3 2 4 5 4 3 4 2 3 3 1 5 3 4 5 3 1 2 4 5 3 5 6 5 4 3 1 5 5 6 data_repeated_2f |> pivot_longer(cols = 2:7, names_to = c("A", "B"), names_sep = 2, values_to = "score") # A tibble: 600 × 4 id A B score <int> <chr> <chr> <int> 1 1 A1 B1 3 2 1 A2 B1 5 3 1 A1 B2 4 4 1 A2 B2 2 5 1 A1 B3 1 6 1 A2 B3 1 7 4 A1 B1 1 8 4 A2 B1 2 9 4 A1 B2 4 10 4 A2 B2 4 # ℹ 590 more rows # single string data_repeated_2f_underline <- data_repeated_2f names(data_repeated_2f_underline)[2:7] <- c("A1_B1", "A2_B1", "A1_B2", "A2_B2", "A1_B3", "A2_B3") head(data_repeated_2f_underline) id A1_B1 A2_B1 A1_B2 A2_B2 A1_B3 A2_B3 1 1 3 5 4 2 1 1 2 4 1 2 4 4 5 2 3 3 3 2 4 5 4 3 4 2 3 3 1 5 3 4 5 3 1 2 4 5 3 5 6 5 4 3 1 5 5 6 data_repeated_2f_underline |> pivot_longer(cols = 2:7, names_to = c("A", "B"), names_sep = "_", values_to = "score") # A tibble: 600 × 4 id A B score <int> <chr> <chr> <int> 1 1 A1 B1 3 2 1 A2 B1 5 3 1 A1 B2 4 4 1 A2 B2 2 5 1 A1 B3 1 6 1 A2 B3 1 7 4 A1 B1 1 8 4 A2 B1 2 9 4 A1 B2 4 10 4 A2 B2 4 # ℹ 590 more rows 1. 额外参数：names_pattern head(data_repeated_2f) id A1B1 A2B1 A1B2 A2B2 A1B3 A2B3 1 1 3 5 4 2 1 1 2 4 1 2 4 4 5 2 3 3 3 2 4 5 4 3 4 2 3 3 1 5 3 4 5 3 1 2 4 5 3 5 6 5 4 3 1 5 5 6 data_repeated_2f |> pivot_longer(cols = 2:7, names_to = c("A", "B"), names_pattern = "([A-Z][0-9])([A-Z][0-9])", values_to = "score") # A tibble: 600 × 4 id A B score <int> <chr> <chr> <int> 1 1 A1 B1 3 2 1 A2 B1 5 3 1 A1 B2 4 4 1 A2 B2 2 5 1 A1 B3 1 6 1 A2 B3 1 7 4 A1 B1 1 8 4 A2 B1 2 9 4 A1 B2 4 10 4 A2 B2 4 # ℹ 590 more rows 1. names_to的特殊用法 • NA: will discard the corresponding component of the column name. head(data_repeated_2f) id A1B1 A2B1 A1B2 A2B2 A1B3 A2B3 1 1 3 5 4 2 1 1 2 4 1 2 4 4 5 2 3 3 3 2 4 5 4 3 4 2 3 3 1 5 3 4 5 3 1 2 4 5 3 5 6 5 4 3 1 5 5 6 data_repeated_2f |> pivot_longer(cols = 2:7, names_to = c(NA, "B"), names_pattern = "([A-Z][0-9])([A-Z][0-9])", values_to = "score") # A tibble: 600 × 3 id B score <int> <chr> <int> 1 1 B1 3 2 1 B1 5 3 1 B2 4 4 1 B2 2 5 1 B3 1 6 1 B3 1 7 4 B1 1 8 4 B1 2 9 4 B2 4 10 4 B2 4 # ℹ 590 more rows • .value: indicates that the corresponding component of the column name defines the name of the output column containing the cell values, overriding values_to entirely. data_partipipant_id_in_col <- sapply(1:9, $$x) sample(5, 3, replace = T)) |> as.data.frame() |> setNames(c("var1_t1", "var2_t1", "var3_t1", "var1_t2", "var2_t2", "var3_t2", "var1_t3", "var2_t3", "var3_t3")) data_partipipant_id_in_col var1_t1 var2_t1 var3_t1 var1_t2 var2_t2 var3_t2 var1_t3 1 2 1 2 2 3 5 2 2 3 4 2 1 3 5 2 3 5 3 1 3 1 5 2 var2_t3 var3_t3 1 5 3 2 5 4 3 1 3 # component of the column name (i.e. "VAR1") defines the name of the output column data_partipipant_id_in_col |> pivot_longer(cols = 1:ncol(data_partipipant_id_in_col), names_to = c(".value", "time_point"), names_sep = "_") # A tibble: 9 × 4 time_point var1 var2 var3 <chr> <int> <int> <int> 1 t1 2 1 2 2 t2 2 3 5 3 t3 2 5 3 4 t1 3 4 2 5 t2 1 3 5 6 t3 2 5 4 7 t1 5 3 1 8 t2 3 1 5 9 t3 2 1 3 更多关于pivot_longer()的高级用法详见vignette("pivot") ## 15.4 Reverse coding Reverse coding is a technique frequently used in survey study. ### 15.4.1 What is reverse coding One common validation technique for survey items is to rephrase a “positive” item in a “negative” way. When done properly, this can be used to check if respondents are giving consistent answers. For example, consider the following two items concerning “extraversion”: 1. I see myself as someone who is talkative. • Disagree strongly • Disagree a little • Neither agree nor disagree • Agree a little • Agree strongly 1. I see myself as someone who tends to be quiet. • Disagree strongly • Disagree a little • Neither agree nor disagree • Agree a little • Agree strongly For question 1, “agree strongly” corresponds to the “most extraverted” option, and “disagree strongly” corresponds to the “least extraverted” option. However, for question 2, “disagree strongly” corresponds to the “most extraverted” option, and “agree strongly” corresponds to the “least extraverted” option. We say that question 2 is reverse-coded. ### 15.4.2 How to do reverse coding Manually recode. df <- data.frame(x = sample(5, 10, replace = TRUE)) df x 1 3 2 5 3 4 4 1 5 5 6 1 7 4 8 5 9 4 10 1 df[df == 1] <- 50 df[df == 2] <- 40 df[df == 3] <- 30 df[df == 4] <- 20 df[df == 5] <- 10 df <- df/10 df x 1 3 2 1 3 2 4 5 5 1 6 5 7 2 8 1 9 2 10 5 Since recoding Likert scale survey data are repeat operations, they can be converted into a for loop, df <- data.frame(x = sample(5, 10, replace = TRUE)) df x 1 1 2 5 3 5 4 3 5 4 6 1 7 1 8 5 9 1 10 2 scale_piont <- 5 coding_rule <- seq(scale_piont*10, 10, by = -10) for (i in 1:scale_piont) { df[df == i] <- coding_rule[i] } df <- df/10 df x 1 5 2 1 3 1 4 3 5 2 6 5 7 5 8 1 9 5 10 4 What is its apply()/sapply()/lapply() substitute? df <- data.frame(x = sample(5, 10, replace = TRUE)) scale_piont <- 5 coding_rule <- seq(scale_piont*10, 10, by = -10) # sapply(1:scale_piont, \(x) { # df[df == x] <<- coding_rule[x] # return(NULL) # }) # df <- df / 10 # a more straightforward solution by Shusheng Liu df <- sapply(df, \(i)coding_rule[i])/10 The code above illustrates the basic idea of recoding. However, a more flexible function is available, and reinventing the wheel is not recommended car::recode recode(var, recodes) • var: numeric vector, character vector, or factor. • recodes: character string of recode specifications: see below. Recode specifications appear in a character string, separated by default by semicolons (see the examples below), each of the form input=output. Several recode specifications are supported: • single value: For example, 0=NA. • vector of values: For example, c(7, 8, 9) = 'high'. • range of values: For example, 7:9 = 'C'. Note: : is not the R sequence operator. In addition, you may not use : with the c function within a recode specification, so for example c(1, 3, 5:7) will cause an error. The : is the default value of the recode interval operator; a non-default value may be specified. • else: everything that does not fit a previous specification. For example, else = NA. Note that else matches all otherwise unspecified values on input, including NA, and if present should appear last among the recode specifications. Character data and factor levels on the left-hand side of a recode specification must be quoted. Thus, e.g., c(a, b, c) = 'low' is not allowed, and should be c('a', 'b', 'c') = 'low'. Similarly, the colon is reserved for numeric data, and, e.g., c('a':'c') = 'low' is not allowed. If the var argument is a character variable with (some) values that are character representations of numbers, or a factor with (some) levels that are numbers (e.g., '12' or '-2'), then these too must be quoted and cannot be used with colons (e.g., '15':'19' = '15 to 19' is not allowed, and could be specified as c('15', '16', '17', '18', '19') = '15 to 19', assuming that all values are the character representation of whole numbers). x <- rep(1:3, 3) x [1] 1 2 3 1 2 3 1 2 3 car::recode(x, "c(1, 2) = 'A'; else = 'B'") [1] "A" "A" "B" "A" "A" "B" "A" "A" "B" Reverse-code a Liker-5 point variable. df <- data.frame( x = sample(5, 10, replace = TRUE), y = sample(5, 10, replace = TRUE) ) dfx [1] 1 5 1 4 5 5 3 5 1 3 # reverse code x dfx <- car::recode(dfx, "1=5;2=4;3=3;4=2;5=1") dfx [1] 5 1 5 2 1 1 3 1 5 3 # generate recode specifications with paste # dfx <- car::recode(dfx, paste(1:5, 5:1, sep = "=", collapse = ";")) # df < df - 6 # By Yuchen Liu, effective by less flexible than recode Reverse-code multiple variables simultaneously with a self-defined function. # author: Wenzheng Lin # collaborator: Yujun Li recode_vars <- function(data, max = 5, min = 1, vars = names(data)){ rule <- paste(min:max, max:min, sep = "=", collapse = ";") if (!is.data.frame(data) & is.vector(data)) { # recode single variable output <- car::recode(data, rule) return(output) } output <- as.data.frame(apply(data[vars], 2, \(x) car::recode(x, rule))) # recode some variables if (ncol(data) != length(vars)) { vars_rest <- !names(data) %in% vars output <- data.frame(data[, vars_rest], output) names(output)[1:(ncol(output) - length(vars))] <- names(data)[vars_rest] output <- output[names(data)] } return(output) } df <- data.frame( x = sample(7, 10, replace = TRUE), y = sample(7, 10, replace = TRUE), z = sample(7, 10, replace = TRUE) ) head(df) x y z 1 6 7 2 2 3 4 4 3 3 6 2 4 3 7 2 5 5 4 4 6 4 6 4 df_recode_xy <- recode_vars(df, max = 7, vars = c("x", "y")) head(df_recode_xy) x y z 1 2 1 2 2 5 4 4 3 5 2 2 4 5 1 2 5 3 4 4 6 4 2 4 df_recode_all <- recode_vars(df, max = 7) head(df_recode_all) x y z 1 2 1 6 2 5 4 4 3 5 2 6 4 5 1 6 5 3 4 4 6 4 2 4 ## 15.5 Aberrant response Aberrant response 的种类很多，常见的如下： 1. Validity measure/attention check, 2. Quantitative unrealistic response, 3. Speedness, 4. Repeated observations, 5. Recursive responses, 6. Outlier, 7. Missing data. 处理 aberrant response 的核心是通过一些特定的操作产生一个 logical object，TRUE可以表示是 aberrant response，FALSE则相反。再使用该 logical object 作为 subsetting 的 index 就可以提取出找到的所有 aberrant response，至于找出来以后是直接删除还是另作处理就很简单了。 ### 15.5.1 Validity measure/attention check, unrealistic response, speedness 这三类都适用相同的处理办法，即使用 relational + logical operators 的方式生成 logical object 作为 index，然后使用该 index 来 subsetting 即可。 #### 15.5.1.1 Validity measure/attention check Too Good to Be True: Bots and Bad Data From Mechanical Turk The attention checks consisted of the following: 1. embedded on the Grit scale—“Select‘Somewhat like me’ for this statement,” 2. embedded on the Beck Depression Inventory—“1 – Select this option,” and 3. embedded on the Borderline Personality Inventory—“Select ‘Yes’ for this statement.” A more detailed introduction on attention check: Prolific’s Attention and Comprehension Check Policy df <- data.frame( x = sample(1:5, 10, replace = T, prob = c(0.6, 0.1, 0.1, 0.1, 0.1)), y = sample(5, 10, replace = T), z = sample(5, 10, replace = T) ) head(df) x y z 1 1 2 4 2 5 5 4 3 1 1 5 4 3 3 2 5 5 4 2 6 1 5 4 # assume x is the attention check item and participant is requested to select 1. id_fail_atten_check <- (1:nrow(df))[dfx != 1] id_fail_atten_check [1] 2 4 5 df_filtered <- df[-id_fail_atten_check, ] #### 15.5.1.2 Quantitative unrealistic response 需要根据问卷的具体内容来判断到底什么样的作答才算 unrealistic。例如询问被试的日均手机使用时间，有人填了 25 小时，或者是同一份问卷内有多个询问手机使用时间的题目，结果被试在这些题目上填写的使用时间加起来大于 24 小时，显然就是 unrealistic。针对这种可以使用一定数值标准（例如>24）为依据判断作答是否 unrealistic 的情况，称之为 quantitative unrealistic response。文本类型的 qualitative unrealistic response 则需要人工检查。 #### 15.5.1.3 Speedness 当被试的作答时间异常短时，可以视作是出现了 speedness 的情况，有理由认为是 aberrant response。一般会根据 response time in total 来判断。例如一份问卷预计作答时间是 50 min，那么可以将下限定为 20 min。总用时低于这个下限的就标记为 speedness。如果记录了每一题被试的作答反应时数据，则可参考 recursive responses 小节针对 response time 的处理。 ### 15.5.2 Repeated observations 有些时候可能由于人工录入数据的时候，出现了同一份问卷录入了两次甚至更多的问题，此时就产生了 repeated observations，使用duplicated()检测。 df <- data.frame( x = sample(1:5, 10, replace = T, prob = c(0.6, 0.1, 0.1, 0.1, 0.1)), y = sample(5, 10, replace = T), z = sample(5, 10, replace = T) ) head(df) x y z 1 5 4 1 2 3 5 4 3 1 4 4 4 5 2 4 5 5 5 4 6 1 5 5 df <- df[sample(nrow(df), 15, replace = T), ] df x y z 2 3 5 4 9 5 4 5 4 5 2 4 10 1 1 2 1 5 4 1 2.1 3 5 4 9.1 5 4 5 3 1 4 4 1.1 5 4 1 6 1 5 5 9.2 5 4 5 2.2 3 5 4 7 3 1 3 1.2 5 4 1 9.3 5 4 5 id_dup_row <- duplicated(df) id_dup_row [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE [10] FALSE TRUE TRUE FALSE TRUE TRUE df[!id_dup_row, ] x y z 2 3 5 4 9 5 4 5 4 5 2 4 10 1 1 2 1 5 4 1 3 1 4 4 6 1 5 5 7 3 1 3 ### 15.5.3 Recursive responses 当问卷总题数较长且被试失去了兴趣和耐心时，被试很有可能会出现低动机（low motivation）的状况，并最终驱动 ta 开始随意作答，出现规律性的重复作答，如 345345345。但是这种情况的处理是有争议的，一般建议在没有非常强的证据时（如伴随有异常短的 response time），不将类似的作答作为 aberrant response。因为无法排除是正常作答的被试选择了类似 345345345345 的 recursive responses 的可能。因此，更建议的做法是找出有 recursive response 的被试，然后结合这些被试在每一题上的 response time 来 double check。 1. locate recursive responses 可选方案： • 使用paste()将所有作答横向拼接为字符串，然后使用grep()查找指定模式的 recursive responses。 head(df_all) var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11 1 1 5 3 3 3 4 5 5 1 3 5 2 3 3 4 4 1 2 1 1 3 5 2 3 3 4 3 1 4 5 1 3 1 2 3 4 5 4 1 3 5 5 1 5 3 2 3 5 2 3 4 4 2 2 2 1 5 4 5 6 2 1 1 2 3 2 3 4 5 2 2 var12 var13 var14 var15 var16 var17 var18 var19 var20 1 2 2 4 1 3 4 2 4 2 2 1 5 1 4 2 5 2 5 3 3 1 2 3 1 2 3 1 2 3 4 2 5 5 1 2 2 2 1 5 5 3 1 2 1 3 2 2 2 2 6 1 2 2 3 1 1 5 2 1 rt_var1 rt_var2 rt_var3 rt_var4 rt_var5 1 5222.713 8772.040 1696.600 1701.902 12042.126 2 13151.900 18339.086 11419.242 17095.191 14977.905 3 7778.153 6339.928 7087.107 16611.145 8593.895 4 9565.369 5096.424 10139.362 18054.408 7974.312 5 8613.145 11495.983 16376.831 8111.641 9193.016 6 5213.714 7653.672 5157.204 6819.726 7384.415 rt_var6 rt_var7 rt_var8 rt_var9 rt_var10 1 9829.149 8097.522 12789.335 10677.9951 4937.9442 2 1115.291 1467.144 11723.196 9850.0672 13820.8817 3 18875.407 2421.836 10641.922 157.4386 393.3905 4 9235.695 16773.862 5889.464 14123.9528 10475.4325 5 16830.824 14951.592 7819.244 19998.8858 15402.9620 6 10789.580 1914.283 6471.676 19028.5202 1228.3531 rt_var11 rt_var12 rt_var13 rt_var14 rt_var15 1 12982.1337 10921.8568 7940.0550 13356.9291 7416.2221 2 15475.1756 12132.0654 9886.2839 2029.1908 13760.8445 3 317.1343 316.9625 381.2607 360.5937 404.4828 4 8097.1048 7833.4123 10143.5020 6903.1930 17040.7080 5 7450.0287 18656.8629 18700.0247 9392.1637 2702.3892 6 14426.5087 11353.4421 1487.5482 9286.3959 6687.3900 rt_var16 rt_var17 rt_var18 rt_var19 rt_var20 1 13945.427 1370.3845 8664.8276 16522.7978 14275.0080 2 10509.150 12981.4090 12077.0342 15434.2804 18099.1148 3 212.315 493.6364 485.4686 111.2361 398.8267 4 1882.642 10607.8031 17938.9437 19786.3327 5845.9509 5 5973.787 7927.9637 19011.2234 19908.2588 6864.7616 6 4773.808 6886.6003 8995.4045 7107.3114 15189.9099 df_char <- apply(df_all[, 1:20], 1, \(x) paste(x, collapse = "")) df_char [1] "15333455135224134242" "33441211352151425253" [3] "34314513123123123123" "54135515323255122215" [5] "23442221545312132222" "21123234522122311521" [7] "54223253243222354245" "31232545251315331345" [9] "53345325211315212554" "35142143132234442214" grep("123123123123", df_char) [1] 3 1. check response time 可选方案，t test (more informative) or Kolmogorov-Smirnov Tests. df_char <- apply(df_all[, 1:20], 1, \(x) paste(x, collapse = "")) id_recur_res <- grep("123123123123", df_char) results_match <- regexec("123123123123", df_char) start_recur_res <- results_match[[3]][1] + 20 end_recur_res <- start_recur_res + attributes(results_match[[3]])match.length - 1 rt_normal_res <- df_all[id_recur_res, 21:(start_recur_res - 1)] rt_recur_res <- df_all[id_recur_res, start_recur_res:end_recur_res] t.test(rt_normal_res, rt_recur_res) Welch Two Sample t-test data: rt_normal_res and rt_recur_res t = 4.8974, df = 7.0046, p-value = 0.001755 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 4891.772 14023.452 sample estimates: mean of x mean of y 9793.6742 336.0622 ks.test(as.numeric(rt_recur_res), as.numeric(rt_normal_res)) Exact two-sample Kolmogorov-Smirnov test data: as.numeric(rt_recur_res) and as.numeric(rt_normal_res) D = 1, p-value = 1.588e-05 alternative hypothesis: two-sided ### 15.5.4 Outlier Outliers in statistics are considered as the data values which differ considerably from the bulk of a given data set. Outliers can strongly bias statistical inference. cor_matrix <- matrix(c(1, 0, 0, 1), nrow = 2, ncol = 2) data <- MASS::mvrnorm(1000, mu = rep(0, 2), Sigma = cor_matrix) plot(data) cor.test(data[, 1], data[, 2]) Pearson's product-moment correlation data: data[, 1] and data[, 2] t = -0.35719, df = 998, p-value = 0.721 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.07324757 0.05072280 sample estimates: cor -0.01130583 # add_outliers n_outlier <- 20 id_outlier <- order(rowSums(data), decreasing = T)[1:n_outlier] data[id_outlier, ] <- data[id_outlier, ] + matrix(runif(n_outlier * 2, 2, 3), nrow = n_outlier, ncol = 2) plot(data) cor.test(data[, 1], data[, 2]) Pearson's product-moment correlation data: data[, 1] and data[, 2] t = 6.8605, df = 998, p-value = 1.203e-11 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.1522271 0.2706498 sample estimates: cor 0.2122174  #### 15.5.4.1 Mahalanobis Distance $D^2_i=(\boldsymbol{x}_i-\boldsymbol{\hat{\mu}})'\boldsymbol{S}^{-1}(\boldsymbol{x}_i-\boldsymbol{\hat{\mu}}).$ \(D^2$$ theoretically follows a $$\chi^2$$ distribution with degrees of freedom ($$df$$) equal to the number of variables. dis_mh <- mahalanobis(data, center = colMeans(data), cov = cov(data)) id_outlier_mh <- which(dis_mh > qchisq(p = 1 - 0.05, df = ncol(data))) id_outlier_mh [1] 22 68 76 84 120 151 172 185 205 207 243 247 266 267 [15] 270 271 297 310 327 347 360 370 397 403 420 425 432 482 [29] 489 493 504 579 627 672 702 705 745 813 829 832 868 905 [43] 943 id_outlier %in% id_outlier_mh [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE cor.test(data[-id_outlier_mh, 1], data[-id_outlier_mh, 2]) Pearson's product-moment correlation data: data[-id_outlier_mh, 1] and data[-id_outlier_mh, 2] t = -0.40694, df = 955, p-value = 0.6841 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.07647432 0.05024608 sample estimates: cor -0.01316699  #### 15.5.4.2 Minimum Covariance Determinant It is clearly that Mahalanobis distance has an inflated type-I error rate, since it detects more than 20 outliers in the previous example. This is because Mahalanobis distance uses the mean vector and covariance matrix, which already contain considerable amount of bias due to outliers. It makes more sense to use more accuratly estimated mean and covariance when calculating Mahalanobis distance. This is the idea behind Minimum Covariance Determinant, which calculates the mean and covariance matrix based on the most central subset of the data. mean_cov_75 <- MASS::cov.mcd(data, quantile.used = nrow(data)*.75) mh_75 <- mahalanobis(data, mean_cov_75$center, mean_cov_75\$cov)
id_outlier_mh_75 <- which(mh_75 > qchisq(p = 1 - 0.05, df = ncol(data)))
id_outlier_mh_75
[1]  22  25  56  65  68  76  84  94 120 136 146 151 172 178
[15] 185 204 205 207 213 215 243 247 250 266 267 270 271 272
[29] 297 310 327 347 360 370 397 403 420 425 432 446 448 478
[43] 482 489 493 504 533 579 627 635 672 702 705 745 774 792
[57] 798 813 829 832 868 877 900 905 943 955 963 973 990
id_outlier %in% id_outlier_mh_75
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
cor.test(data[-id_outlier_mh_75, 1], data[-id_outlier_mh_75, 2])

Pearson's product-moment correlation

data:  data[-id_outlier_mh_75, 1] and data[-id_outlier_mh_75, 2]
t = -0.93584, df = 929, p-value = 0.3496
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.09475286  0.03362729
sample estimates:
cor
-0.03068936 

Although the type I error rate of MCD-based Mahalanobis distance is still inflated, it performs arguably better than the traditional one. Methods to deal with outliers abound, and none of them is perfect. Interested readers in more details about how to detect outliers are refered to Leys et al. (2018).

Leys, C., Klein, O., Dominicy, Y., & Ley, C. (2018). Detecting multivariate outliers: Use a robust variant of Mahalanobis distance. Journal of Experimental Social Psychology, 74, 150-156.

Optimising dplyr