16 Advance: data-cleaning
心理学中科学研究的一个重要环节就是和数据打交道,其中使用最广泛的一种数据类型就是问卷数据(survey data)。问卷数据的处理流程一般可以分为以下三步:
- 收集
- 清洗
- 分析
其中收集的部分可以采用两种形式,线上或线下,两种形式各有其优缺点。线上的优点是方便,可以接触到更大的被试群体,同时被试做完问卷以后数据就已经自动录入了;缺点是可能会因为没有主试在场,降低被试做题时的动机水平,从而影响数据质量。线下形式可以一定程度上克服线上形式的缺点,但同时也会增加数据录入的环节,工作量更大。
在收集完问卷的数据后,首先要做的事情就是数据清洗(data cleaning)。道理很简单,就好比吃水果前肯定会把水果清洗一下,把表面的脏东西、杂质都清洗掉,保证吃进去的是干净的水果。同样的,一批数据来了以后,要首先把里面的“杂质”清洗掉,只留下干净的数据。
数据清洗并没有准确、统一的定义,不同应用场景下对于什么样的数据才需要被清洗掉、数据清洗到底应该采用什么样的流程等问题都会有所不同的解答。除此之外,数据清洗的操作总体上是相对繁琐和复杂的,甚至比统计分析的操作还更琐碎。
数据清洗是重要的,因为如果数据中包含太多未被清除的杂质,就会使得数据中有效信息占比低,噪音占比高,最终影响基于数据所得的研究结论。然而,在绝大多数应用统计类教材或者是数据分析类教材中,都忽略掉了数据清洗的部分,这些教材中展示的数据都是已经清洗好的数据。这就导致使用者在面临实际问题时,往往需要自己根据经验来解决数据清洗的问题。所以,本节就针对心理学领域中最为常用的问卷类数据,专门讲解此类数据的数据清洗中的常见需求及其解决方案。
16.1 Prerequisite
16.1.1 Translation(*)
如果是使用本土学者开发的问卷或者是你自己开发的问卷,但研究成果计划发表在国外期刊上时,就涉及到翻译原始问卷的问题。此时需要使用准确、规范的学术语言。以下为 Likert 式计分量表的推荐英文表达。
16.1.2 Tidy data
数据清洗是为了将数据中的无效部分去除,并最终整理成一个清晰的形式,方便接下来直接使用统计分析方法处理清洗后的数据。因此,需要知道 R 中大多数统计分析函数对于数据的形式有什么样的要求,即什么样的数据才是干净、整洁的。这里介绍 Tidyverse 提出的 tidy data 的概念,本质上是遵循“一人一行、一列一变量”的基本原则,tidydata 包括以下三条规则:
- 每一个变量都要有专属的一列(最重要);
- 每一份观察记录要有专属的一行;
- 每一个数据点要有专属的单元格。
- Tidy data 示例
X. Name Type.1 Type.2 Total HP Attack Defense
1 1 Bulbasaur Grass Poison 318 45 49 49
2 2 Ivysaur Grass Poison 405 60 62 63
3 3 Venusaur Grass Poison 525 80 82 83
4 4 Charmander Fire 309 39 52 43
5 5 Charmeleon Fire 405 58 64 58
6 6 Charizard Fire Flying 534 78 84 78
Sp..Atk Sp..Def Speed Stage Legendary
1 65 65 45 1 FALSE
2 80 80 60 2 FALSE
3 100 100 80 3 FALSE
4 60 50 65 1 FALSE
5 80 65 80 2 FALSE
6 109 85 100 3 FALSE
- not tidy data 示例
例如,心理统计学(第2版)例6-11,有学者研究了视觉表象对视知觉的影响,实验中要求被试从屏幕短暂呈现的图片中搜索是否有蓝色小箭头,但在被试完成的任务中,有一半任务在搜索的同时要求被试构建一个心理表象(如想象火山喷发),结果如下,要求构建心理表象条件下的错误次数是否显著高于不要求构建心理表象条件下的错误次数?
A B C D E F G
virtual_imaginary 13 9 12 7 10 8 9
no_virtual_imaginary 4 2 10 8 6 6 4
上面展示的数据如果需要整理成tidy data
,应该是这个样子:
# A tibble: 14 × 3
participant group count
<chr> <chr> <int>
1 A virtual_imaginary 13
2 A no_virtual_imaginary 4
3 B virtual_imaginary 9
4 B no_virtual_imaginary 2
5 C virtual_imaginary 12
6 C no_virtual_imaginary 10
7 D virtual_imaginary 7
8 D no_virtual_imaginary 8
9 E virtual_imaginary 10
10 E no_virtual_imaginary 6
11 F virtual_imaginary 8
12 F no_virtual_imaginary 6
13 G virtual_imaginary 9
14 G no_virtual_imaginary 4
所以,如何将数据整理成为 tidy data 的形式,通常也是 data cleaning 的一个环节。需要注意的是,不是说不整理成 tidy data 的形式就是错误的,而是说整理成这种形式更便于理解和统计分析。
16.1.3 Pipe operator
16.1.3.2 What is %>%
%>%
的最基本用法是x %>% fun(...)
,本质上相当于fun(x, ...)
,即把x
输送给fun
,并默认作为第一个 argument,并且执行fun
。
16.1.3.3 Placeholder
当需要将%>%
左侧的x
传递给右侧的fun
,且不是作为首个 argument 时,就需要用到占位符(placeholder)了,与%>%
匹配的 placeholder 是.
。那么如何理解 placeholder?placeholder 就是先给以后要传递过来的x
占个座,等x
传过来了就它让作为这个位置的参数,然后执行fun
。
16.1.3.4 why %>%
那既然是1:3 %>% mean
与mean(1:3)
是等价的,为什么要开发%>%
?为了让代码看起来更加简洁易懂。
car_data <- transform(aggregate(
. ~ cyl,
data = subset(datasets::mtcars, hp > 100), # mtcars[mtcars$hp > 100, ]
FUN = function(x) round(mean(x), 2)
), kpl = mpg*0.4251)
car_data
cyl mpg disp hp drat wt qsec vs am gear
1 4 25.90 108.05 111.00 3.94 2.15 17.75 1.00 1.00 4.50
2 6 19.74 183.31 122.29 3.59 3.12 17.98 0.57 0.43 3.86
3 8 15.10 353.10 209.21 3.23 4.00 16.77 0.00 0.14 3.29
carb kpl
1 2.00 11.010090
2 3.43 8.391474
3 3.50 6.419010
要读懂上一段代码,首先要做的是根据()
来判断到底哪部分先执行。可以看得出来,是先执行内层的aggregate()
,再执行外层的transform()
。除此自外,aggregate()
的 argument 是 function 的执行结果,并且还是 function 套 function,这也增加了理解的难度。
car_data <-
datasets::mtcars %>%
subset(hp > 100) %>%
aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2)) %>% # the second dot is a placeholder, the 1st and 3rd ones are the dot notation of aggregate
transform(kpl = mpg * 0.4251)
car_data
cyl mpg disp hp drat wt qsec vs am gear
1 4 25.90 108.05 111.00 3.94 2.15 17.75 1.00 1.00 4.50
2 6 19.74 183.31 122.29 3.59 3.12 17.98 0.57 0.43 3.86
3 8 15.10 353.10 209.21 3.23 4.00 16.77 0.00 0.14 3.29
carb kpl
1 2.00 11.010090
2 3.43 8.391474
3 3.50 6.419010
使用%>%
改写以后,就变成语义和结构上都更加清晰的一段代码,从上到下可以解读为:
- 将
mtcars
中hp>100
的部分取出; - 将上一步得到的数据使用
aggregate()
按cyl
分组求均值,结果取整,保留 2 位小数; - 将上一步得到的数据做 transformation,将
mpg
这一列乘以0.4251
,所得结果存新的一列kpl
。
这样的代码读起来认知负担就远小于原始的代码。同时,从这个例子也可以看出来,%>%
适合针对同一个 object 的序列操作。
16.1.3.5 Native pipe operator
随着 magrittr 包的影响力越来越大,使用%>%
的人越来越多,R core team 也在 4.1.0 中加入了 native pipe operator ——|>
,从此以后再也不用另外再加载 magrittr 包了。
那么|>
和%>%
有什么区别需要注意的吗?有,一共 7 点区别,其中前 4 点十分常用。
-
|>
中右侧调用的 function 必须有明确的()
;
-
|>
的 placeholder 是_
;
df <- data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
df %>%
ncol %>%
seq(from = 1, to = .)
[1] 1 2
df |>
ncol() |>
seq(from = 1, to = _)
[1] 1 2
-
|>
中_
必须用于有 name 的 argument;
letters[1:3] |> paste(_, collapse = " and ")
Error in paste("_", collapse = " and "): pipe placeholder can only be used as a named argument (<input>:1:17)
这就使得如果...
在这种灵活且没有 name 的 argument 上使用_
,需要取点巧
# make up an argument with unused name in paste
letters[1:3] |> paste(x = _, collapse = " and ")
[1] "a and b and c"
# use anonymous function
letters[1:3] |> (\(x) paste(x, collapse = " and "))()
[1] "a and b and c"
- 一个
|>
中的_
只能在右侧使用一次;
letters[1:3] |> paste(x = _, y = _)
Error in paste(x = "_", y = "_"): pipe placeholder may only appear once (<input>:1:17)
-
|>
中_
不允许在嵌套 function 中使用;
-
|>
使用$
时需要更加明确$
是一个 function;
mtcars %>% `$`(cyl)
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4
[29] 8 6 8 4
mtcars |> (`$`)(cyl)
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4
[29] 8 6 8 4
mtcars |> base::`$`(cyl)
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4
[29] 8 6 8 4
fun <- `$`
mtcars |> fun(cyl)
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4
[29] 8 6 8 4
-
|>
执行时不会生成额外的 environment。
x <- 1
"x" %>% assign(4)
x
[1] 1
# just like a function, %>% generates an additional
# independent env when evaluating, everything therefore
# happen in this environment, change will not be passed
# to the parent environments
"x" |> assign(4)
x
[1] 4
# |> lead to direct assignment of x in the global
# environment because no additional env.
"x" |> (\(y) assign(y, 3))()
x
[1] 4
# anonymous function acts in a way similar to %>%,
# since function call always generate additional env
What are the differences between R’s new native pipe |>
and the magrittr pipe %>%
?
16.2 Create data set
创建数据集指的是将收集来的数据创建为一个完整的数据集。根据具体情况的不同,可能有录入、合并等多种形式。
16.2.1 Import
采用线下回收纸质问卷的方式时,第一个要做的事就是人工将数据录入至电脑。这个过程中录入员非常容易因为数据量大、长时间录入而出现录入错误,所以一般会采取双人背对背录入同一批问卷数据的方式来交叉·核对。所以,这种场景下的任务目标就是对比两个数据集是否完全一致,如果不是,不一致的地方在哪里。
在开始对比之前,首先要保证两份数据集大体上的一致性,例如被试的数量、顺序,题目的数量、顺序、缺失数据的处理方式、是否有反向计分,等等。这些都是可以在录入之前就约定好采用相同的处理方式。
head(data_dirty_01)
gender ethnicity first_name last_name age P1 P2 P3 P4 P5
1 1 1 Samara Lizer 20 3 3 5 4 3
2 1 2 Nur Sem 29 1 2 2 2 3
3 1 2 Juhi Ware 28 1 1 3 2 3
4 0 2 Joshua Vang 21 3 2 2 2 3
5 0 2 Ronak Hartley 20 3 4 3 3 5
6 0 2 Jacob Solongo 26 3 2 3 1 2
P6 P7 P8 P9 P10_att P11 P12 P13 P14 P15 P16 W17 W18 W19
1 4 4 5 4 3 5 3 5 3 3 4 4 4 3
2 3 1 2 3 3 1 2 3 2 3 3 2 2 2
3 1 3 2 1 3 1 1 1 1 1 3 3 2 2
4 2 3 3 2 3 2 2 3 2 4 3 2 3 4
5 3 3 4 4 3 4 4 4 2 4 4 5 3 4
6 3 2 2 3 3 3 4 1 3 2 2 2 3 3
W20 W21 W22 W23 W24 W25 W26 W27 W28 W29 W30 W31 W32_att
1 5 3 3 3 4 5 4 2 5 4 3 4 5
2 2 3 3 2 3 3 3 2 2 2 1 1 5
3 3 2 3 9999 4 2 3 3 1 4 2 3 5
4 2 4 3 3 4 3 4 3 3 3 4 3 5
5 3 4 4 3 4 5 4 4 3 3 4 4 5
6 3 2 3 2 3 2 2 2 4 3 3 2 5
g33 g34 g35 g36 g37 g38 g39_att g40 g41 g42 g43 g44 g45
1 4 4 2 3 4 3 3 4 3 4 4 4 3
2 5 5 4 3 4 3 3 3 5 3 3 5 5
3 2 2 3 3 3 4 3 3 3 3 2 3 2
4 2 2 2 1 3 3 3 1 3 2 2 4 3
5 3 3 4 3 4 3 3 4 3 4 3 3 3
6 3 2 3 3 2 3 3 3 3 2 2 2 4
g46 g47 g48 S49 S50 S51 S52 S53 S54 S55 S56 S57 S58 S59
1 4 3 3 3 3 3 3 2 3 4 3 4 3 4
2 5 5 4 3 3 4 3 3 3 2 3 4 3 2
3 2 4 3 1 3 1 2 2 1 1 2 1 2 2
4 1 3 1 1 1 1 2 1 2 1 1 1 1 2
5 2 2 3 5 4 4 4 4 5 4 4 3 4 5
6 2 3 3 2 2 2 3 2 3 4 3 3 3 4
S60 S61_recode S62 S63 S64 J65 J66 J67 J68 J69 J70 J71
1 2 3 3 5 4 4 4 4 4 3 1 4
2 2 3 3 3 2 3 3 3 3 3 1 3
3 1 5 2 1 2 3 3 3 4 2 3 3
4 2 5 2 1 3 2 2 1 2 2 3 4
5 3 1 3 4 5 3 3 3 3 4 3 2
6 2 3 2 1 3 4 3 2 3 2 2 2
J72 J73 J74 J75 J76 J77 J78 J79 J80 time
1 4 4 3 4 4 3 3 5 3 1632
2 3 2 2 2 2 2 2 2 2 1572
3 3 3 4 2 1 5 3 3 3 1616
4 2 3 4 2 3 4 2 3 4 1609
5 4 2 3 2 3 3 3 3 3 1639
6 2 3 3 3 3 2 2 3 3 215
head(data_dirty_02)
gender ethnicity first_name last_name age P1 P2 P3 P4 P5
1 1 1 Samara Lizer 20 3 3 5 4 3
2 1 2 Nur Sem 29 1 2 2 2 3
3 1 2 Juhi Ware 28 1 1 3 2 3
4 0 2 Joshua Vang 21 3 2 2 2 3
5 0 2 Ronak Hartley 20 3 4 3 3 5
6 0 2 Jacob Solongo 26 3 2 3 1 2
P6 P7 P8 P9 P10_att P11 P12 P13 P14 P15 P16 W17 W18 W19
1 4 4 5 4 3 5 3 5 3 3 4 4 4 3
2 3 1 2 3 3 1 2 3 2 3 3 2 2 2
3 1 3 2 1 3 1 1 1 1 1 3 3 2 2
4 2 3 3 2 3 2 2 3 2 4 3 2 3 4
5 3 3 4 4 3 4 4 4 2 4 4 5 3 4
6 3 2 2 3 3 3 4 1 3 2 2 2 3 3
W20 W21 W22 W23 W24 W25 W26 W27 W28 W29 W30 W31 W32_att
1 5 3 3 3 4 5 4 2 5 4 3 4 5
2 2 3 3 2 3 3 3 2 2 2 1 1 5
3 3 2 3 999 4 2 3 3 1 4 2 3 5
4 2 4 3 3 4 3 4 3 3 3 4 3 5
5 3 4 4 3 4 5 4 4 3 3 4 4 5
6 3 2 3 2 3 2 2 2 4 3 3 2 5
g33 g34 g35 g36 g37 g38 g39_att g40 g41 g42 g43 g44 g45
1 4 4 2 3 4 3 3 4 3 4 4 4 3
2 5 5 4 3 4 3 3 3 5 3 3 5 5
3 2 2 3 3 3 4 3 3 3 3 2 3 2
4 2 2 2 1 3 3 3 1 3 2 2 4 3
5 3 3 4 3 4 3 3 4 3 4 3 3 3
6 3 2 3 3 2 3 3 3 3 2 2 2 4
g46 g47 g48 S49 S50 S51 S52 S53 S54 S55 S56 S57 S58 S59
1 4 3 3 3 3 3 3 2 3 4 3 4 3 4
2 5 5 4 3 3 4 3 3 3 2 3 4 3 2
3 2 4 3 1 3 1 2 2 1 1 2 1 2 2
4 1 3 1 1 1 1 2 1 2 1 1 1 1 2
5 2 2 3 5 4 4 4 4 5 4 4 3 4 5
6 2 3 3 2 2 2 3 2 3 4 3 3 3 4
S60 S61_recode S62 S63 S64 J65 J66 J67 J68 J69 J70 J71
1 2 3 3 5 4 4 4 4 4 3 1 4
2 2 3 3 3 2 3 3 3 3 3 1 3
3 1 5 2 1 2 3 3 3 4 2 3 3
4 2 5 2 1 3 2 2 1 2 2 3 4
5 3 1 3 4 5 3 3 3 3 4 3 2
6 2 3 2 1 3 4 3 2 3 2 2 2
J72 J73 J74 J75 J76 J77 J78 J79 J80 time
1 4 4 3 4 4 3 3 5 3 1632
2 3 2 2 2 2 2 2 2 2 1572
3 3 3 4 2 1 5 3 3 3 1616
4 2 3 4 2 3 4 2 3 4 1609
5 4 2 3 2 3 3 3 3 3 1639
6 2 3 3 3 3 2 2 3 3 215
- 是否完全一样
- 哪里不一样
如果两份数据文件记录的内容完全一样,但只是 element type 不同,identical()
函数也会返回FALSE
的结果,所以首先要检查是否存在 element type 上的不同。
# 判断是不是元素类型导致的问题
elem_type_data01 <- sapply(data_dirty_01, typeof)
elem_type_data01
gender ethnicity first_name last_name age
"integer" "integer" "character" "character" "integer"
P1 P2 P3 P4 P5
"double" "double" "double" "double" "double"
P6 P7 P8 P9 P10_att
"double" "double" "double" "double" "double"
P11 P12 P13 P14 P15
"double" "double" "double" "double" "double"
P16 W17 W18 W19 W20
"double" "double" "double" "double" "double"
W21 W22 W23 W24 W25
"double" "double" "double" "double" "double"
W26 W27 W28 W29 W30
"double" "double" "double" "double" "double"
W31 W32_att g33 g34 g35
"double" "double" "double" "double" "double"
g36 g37 g38 g39_att g40
"double" "double" "double" "double" "double"
g41 g42 g43 g44 g45
"double" "double" "double" "double" "double"
g46 g47 g48 S49 S50
"double" "double" "double" "double" "double"
S51 S52 S53 S54 S55
"double" "double" "double" "double" "double"
S56 S57 S58 S59 S60
"double" "double" "double" "double" "double"
S61_recode S62 S63 S64 J65
"double" "double" "double" "double" "double"
J66 J67 J68 J69 J70
"double" "double" "double" "double" "double"
J71 J72 J73 J74 J75
"double" "double" "double" "double" "double"
J76 J77 J78 J79 J80
"double" "double" "double" "double" "double"
time
"double"
elem_type_data02 <- sapply(data_dirty_02, typeof)
identical(elem_type_data01, elem_type_data02)
[1] TRUE
确认了 element type 没有问题以后,就可以判断出来确实存在数据内容上的不同。接下来就是找到不同所在。
n_ob <- nrow(data_dirty_01)
locations <- which(data_dirty_01 != data_dirty_02, arr.ind = TRUE)
locations_row <- locations %% n_ob
locations_col <- locations %/% n_ob + 1
locations <- cbind(row = locations_row, col = locations_col)
locations <- locations[order(locations_row), ]
Error in locations[order(locations_row), ]: subscript out of bounds
head(locations)
row col row col
[1,] 439 6 1 1
[2,] 614 6 1 1
[3,] 740 6 1 1
[4,] 904 6 1 1
[5,] 927 6 1 1
[6,] 972 6 1 1
核对一下是不是不一样。
data_dirty_01[3, ]
gender ethnicity first_name last_name age P1 P2 P3 P4 P5
3 1 2 Juhi Ware 28 1 1 3 2 3
P6 P7 P8 P9 P10_att P11 P12 P13 P14 P15 P16 W17 W18 W19
3 1 3 2 1 3 1 1 1 1 1 3 3 2 2
W20 W21 W22 W23 W24 W25 W26 W27 W28 W29 W30 W31 W32_att
3 3 2 3 9999 4 2 3 3 1 4 2 3 5
g33 g34 g35 g36 g37 g38 g39_att g40 g41 g42 g43 g44 g45
3 2 2 3 3 3 4 3 3 3 3 2 3 2
g46 g47 g48 S49 S50 S51 S52 S53 S54 S55 S56 S57 S58 S59
3 2 4 3 1 3 1 2 2 1 1 2 1 2 2
S60 S61_recode S62 S63 S64 J65 J66 J67 J68 J69 J70 J71
3 1 5 2 1 2 3 3 3 4 2 3 3
J72 J73 J74 J75 J76 J77 J78 J79 J80 time
3 3 3 4 2 1 5 3 3 3 1616
data_dirty_02[3, ]
gender ethnicity first_name last_name age P1 P2 P3 P4 P5
3 1 2 Juhi Ware 28 1 1 3 2 3
P6 P7 P8 P9 P10_att P11 P12 P13 P14 P15 P16 W17 W18 W19
3 1 3 2 1 3 1 1 1 1 1 3 3 2 2
W20 W21 W22 W23 W24 W25 W26 W27 W28 W29 W30 W31 W32_att
3 3 2 3 999 4 2 3 3 1 4 2 3 5
g33 g34 g35 g36 g37 g38 g39_att g40 g41 g42 g43 g44 g45
3 2 2 3 3 3 4 3 3 3 3 2 3 2
g46 g47 g48 S49 S50 S51 S52 S53 S54 S55 S56 S57 S58 S59
3 2 4 3 1 3 1 2 2 1 1 2 1 2 2
S60 S61_recode S62 S63 S64 J65 J66 J67 J68 J69 J70 J71
3 1 5 2 1 2 3 3 3 4 2 3 3
J72 J73 J74 J75 J76 J77 J78 J79 J80 time
3 3 3 4 2 1 5 3 3 3 1616
16.2.2 Merge
当数据是分散在不同的设备,或者是数据是在不同的时间点采集时,往往会得到多份记录相同变量但具体数据信息不同的文件,此时就需要将多份文件合并成为一个完整的数据集。
- 不同被试相同变量
当只是针对不同被试测量了一次相同变量时,这样产生的多份数据只需要匹配变量的信息,即只需要将多份数据中相同的变量信息提取出来,然后纵向合并即可。
head(data_difsub_idevar_01)
id humility egalitarian_belief stereotyping var_a var_b
1 1 3 5 2 3 5
2 2 3 3 1 2 3
3 3 2 1 5 5 4
4 4 2 2 5 4 2
5 5 3 1 1 2 3
6 6 5 2 1 1 2
var_c
1 2
2 1
3 5
4 3
5 4
6 5
head(data_difsub_idevar_02)
id humility egalitarian_belief stereotyping var_d var_e
1 101 1 1 4 1 4
2 102 3 1 4 5 4
3 103 4 1 2 3 2
4 104 2 2 3 4 4
5 105 5 2 1 3 5
6 106 2 4 4 2 4
var_f
1 5
2 5
3 4
4 3
5 3
6 1
name_common_col <- intersect(names(data_difsub_idevar_01), names(data_difsub_idevar_02))
name_common_col
[1] "id" "humility"
[3] "egalitarian_belief" "stereotyping"
data_merged <-
rbind(
data_difsub_idevar_01[name_common_col],
data_difsub_idevar_02[name_common_col]
)
head(data_merged)
id humility egalitarian_belief stereotyping
1 1 3 5 2
2 2 3 3 1
3 3 2 1 5
4 4 2 2 5
5 5 3 1 1
6 6 5 2 1
- 相同被试不同变量
当针对同一批被试,测试了多次不同的变量,或者是在不同的时间点测量了相同的变量时,也可能产生多份数据文件,此时就需要以被试的背景信息为依据,来合并多份数据。
head(data_idesub_diffvar_01)
id humility_t1 egalitarian_belief_t1 stereotyping_t1
1 1 3 2 5
2 2 5 1 3
3 3 1 3 1
4 4 3 3 3
5 5 1 5 3
6 6 4 3 3
head(data_idesub_diffvar_02)
id humility_t2 egalitarian_belief_t2 stereotyping_t2
1 1 3 5 5
2 2 2 2 1
3 3 4 1 1
4 4 3 3 1
5 5 2 1 1
6 6 1 2 5
data_merged <- merge(data_idesub_diffvar_01, data_idesub_diffvar_02)
head(data_merged)
id humility_t1 egalitarian_belief_t1 stereotyping_t1
1 1 3 2 5
2 2 5 1 3
3 3 1 3 1
4 4 3 3 3
5 5 1 5 3
6 6 4 3 3
humility_t2 egalitarian_belief_t2 stereotyping_t2
1 3 5 5
2 2 2 1
3 4 1 1
4 3 3 1
5 2 1 1
6 1 2 5
使用merge()
中的额外参数满足更多需求。
# the extra rows in x (data_idesub_diffvar_01) that are not in
# y (data_idesub_diffvar_02) will be reserved
data_merged <- merge(data_idesub_diffvar_01, data_idesub_diffvar_02, all.x = TRUE)
tail(data_merged)
id humility_t1 egalitarian_belief_t1 stereotyping_t1
105 105 2 4 4
106 106 1 2 2
107 107 5 5 2
108 108 4 5 4
109 109 5 1 2
110 110 2 1 2
humility_t2 egalitarian_belief_t2 stereotyping_t2
105 NA NA NA
106 NA NA NA
107 NA NA NA
108 NA NA NA
109 NA NA NA
110 NA NA NA
16.3 Convert into tidy data
创建完数据集以后,很有可能数据并不是 tidy data 的形式,大多是两种情况:
- Wider format: 一个变量的数据跨了好几列;
- Longer format: 一个观察记录跨了好几行。
上述两种情况都需要转换数据的形态,前者需要将数据从 wider format 转换成 longer format。在心理学领域的数据分析任务中,前者最为常见,所以本节只讲前者。
16.3.1 Wider to Longer
重复测量的数据是一个典型的 wider format。例如\(2\times2\)重复测量,
id A1B1 A2B1 A1B2 A2B2 A1B3 A2B3
1 1 3 5 4 2 1 1
2 4 1 2 4 4 5 2
3 3 3 2 4 5 4 3
4 2 3 3 1 5 3 4
5 3 1 2 4 5 3 5
6 5 4 3 1 5 5 6
实验处理一共是 4 个水平,但其实是把 A 和 B 这两个变量分在了 4 列。因此,需要执行 wider to longer 的转换,把 4 列的数据合并成 2 列,将原本的分组变量 A 和 B 分离出来,这就需要使用到 tidyr 包中的pivot_loger()
。
# the most frequently used arguments of pivot_longer
pivot_longer(
data,
cols,
names_to = "name",
names_sep = NULL,
names_pattern = NULL,
values_to = "value"
)
-
cols
: 要转变为 longer format 的原 columns; -
names_to
: 字符串向量,原 columns 的 names 将储存在新 column 的单元格里,新 column 的 name 由本参数设定。- 如果
names_to
的长度大于 1,将会创建多个新的 columns。在这个情况下,必须提供names_sep
或names_pattern
来指定原 columns 的 names 怎么分配到新 columns 的单元格里去。-
NA
:将会遗弃对应的新 column, -
.value
:新 columns 的 names 是原 columns names 的一部分,新 columns 的单元格里储存对应的原 columns 单元格的数据,该参数会覆盖value_to
参数的设定。
-
- 如果
-
names_sep
: 指定原 columns names 的分割规则,既可以是 numeric vector,表示按照位置来分割,也可以是单个字符(类似read.table()
中的sep
参数,可以是正常的字符,如"_"
,也可以是正则表达式,如"\."
)。 -
names_pattern
: 正则表达式,原 columns 的 names 将按照该表达式的匹配结果分割并创建新 columns。 -
values_to
: 字符,原 column 的单元格中的数据将单独储存在一个新 column (value column)里,新 value column 的 name 由本参数设定。如果names_to
中包含了.value
(必须给定names_sep
或names_pattern
),本参数的设定会被忽略,并且新 value column 的 name 来自原 columns name 中的指定部分。
- 基本用法:4 个必写参数,
pivot(data, cols, names_to, values_to)
data_virtual_imaginary <- read.table(
"F:/Nutstore backup/R/codes/RBA/data/untidy data virtual imaginary.txt"
)
head(data_virtual_imaginary)
A B C D E F G
virtual_imaginary 13 9 12 7 10 8 9
no_virtual_imaginary 4 2 10 8 6 6 4
data_virtual_imaginary |>
t() |>
as.data.frame(row.names = 1:ncol(data_virtual_imaginary)) |>
cbind(participant = LETTERS[1:ncol(data_virtual_imaginary)]) |>
pivot_longer(
cols = c("virtual_imaginary", "no_virtual_imaginary"),
names_to = "group",
values_to = "count"
) |>
dplyr::select(participant, group, count)
# A tibble: 14 × 3
participant group count
<chr> <chr> <int>
1 A virtual_imaginary 13
2 A no_virtual_imaginary 4
3 B virtual_imaginary 9
4 B no_virtual_imaginary 2
5 C virtual_imaginary 12
6 C no_virtual_imaginary 10
7 D virtual_imaginary 7
8 D no_virtual_imaginary 8
9 E virtual_imaginary 10
10 E no_virtual_imaginary 6
11 F virtual_imaginary 8
12 F no_virtual_imaginary 6
13 G virtual_imaginary 9
14 G no_virtual_imaginary 4
- 额外参数:
names_sep
# numeric
head(data_repeated_2f)
id A1B1 A2B1 A1B2 A2B2 A1B3 A2B3
1 1 3 5 4 2 1 1
2 4 1 2 4 4 5 2
3 3 3 2 4 5 4 3
4 2 3 3 1 5 3 4
5 3 1 2 4 5 3 5
6 5 4 3 1 5 5 6
data_repeated_2f |>
pivot_longer(
cols = 2:7,
names_to = c("A", "B"),
names_sep = 2,
values_to = "score"
)
# A tibble: 600 × 4
id A B score
<int> <chr> <chr> <int>
1 1 A1 B1 3
2 1 A2 B1 5
3 1 A1 B2 4
4 1 A2 B2 2
5 1 A1 B3 1
6 1 A2 B3 1
7 4 A1 B1 1
8 4 A2 B1 2
9 4 A1 B2 4
10 4 A2 B2 4
# ℹ 590 more rows
# single string
data_repeated_2f_underline <- data_repeated_2f
names(data_repeated_2f_underline)[2:7] <- c("A1_B1", "A2_B1", "A1_B2", "A2_B2", "A1_B3", "A2_B3")
head(data_repeated_2f_underline)
id A1_B1 A2_B1 A1_B2 A2_B2 A1_B3 A2_B3
1 1 3 5 4 2 1 1
2 4 1 2 4 4 5 2
3 3 3 2 4 5 4 3
4 2 3 3 1 5 3 4
5 3 1 2 4 5 3 5
6 5 4 3 1 5 5 6
data_repeated_2f_underline |>
pivot_longer(
cols = 2:7,
names_to = c("A", "B"),
names_sep = "_",
values_to = "score"
)
# A tibble: 600 × 4
id A B score
<int> <chr> <chr> <int>
1 1 A1 B1 3
2 1 A2 B1 5
3 1 A1 B2 4
4 1 A2 B2 2
5 1 A1 B3 1
6 1 A2 B3 1
7 4 A1 B1 1
8 4 A2 B1 2
9 4 A1 B2 4
10 4 A2 B2 4
# ℹ 590 more rows
- 额外参数:
names_pattern
head(data_repeated_2f)
id A1B1 A2B1 A1B2 A2B2 A1B3 A2B3
1 1 3 5 4 2 1 1
2 4 1 2 4 4 5 2
3 3 3 2 4 5 4 3
4 2 3 3 1 5 3 4
5 3 1 2 4 5 3 5
6 5 4 3 1 5 5 6
data_repeated_2f |>
pivot_longer(
cols = 2:7,
names_to = c("A", "B"),
names_pattern = "([A-Z][0-9])([A-Z][0-9])",
values_to = "score"
)
# A tibble: 600 × 4
id A B score
<int> <chr> <chr> <int>
1 1 A1 B1 3
2 1 A2 B1 5
3 1 A1 B2 4
4 1 A2 B2 2
5 1 A1 B3 1
6 1 A2 B3 1
7 4 A1 B1 1
8 4 A2 B1 2
9 4 A1 B2 4
10 4 A2 B2 4
# ℹ 590 more rows
-
names_to
的特殊用法
-
NA
: will discard the corresponding component of the column name.
head(data_repeated_2f)
id A1B1 A2B1 A1B2 A2B2 A1B3 A2B3
1 1 3 5 4 2 1 1
2 4 1 2 4 4 5 2
3 3 3 2 4 5 4 3
4 2 3 3 1 5 3 4
5 3 1 2 4 5 3 5
6 5 4 3 1 5 5 6
data_repeated_2f |>
pivot_longer(
cols = 2:7,
names_to = c(NA, "B"),
names_pattern = "([A-Z][0-9])([A-Z][0-9])",
values_to = "score"
)
# A tibble: 600 × 3
id B score
<int> <chr> <int>
1 1 B1 3
2 1 B1 5
3 1 B2 4
4 1 B2 2
5 1 B3 1
6 1 B3 1
7 4 B1 1
8 4 B1 2
9 4 B2 4
10 4 B2 4
# ℹ 590 more rows
-
.value
: indicates that the corresponding component of the column name defines the name of the output column containing the cell values, overridingvalues_to
entirely.
data_partipipant_id_in_col <-
sapply(1:9, \(x) sample(5, size = 3, replace = T)) |>
as.data.frame() |>
setNames(c(
"var1_t1", "var2_t1", "var3_t1",
"var1_t2", "var2_t2", "var3_t2",
"var1_t3", "var2_t3", "var3_t3"
))
data_partipipant_id_in_col
var1_t1 var2_t1 var3_t1 var1_t2 var2_t2 var3_t2 var1_t3
1 2 1 2 2 3 5 2
2 3 4 2 1 3 5 2
3 5 3 1 3 1 5 2
var2_t3 var3_t3
1 5 3
2 5 4
3 1 3
# component of the column name (i.e. "VAR1") defines the name of the output column
data_partipipant_id_in_col |>
pivot_longer(
cols = 1:ncol(data_partipipant_id_in_col),
names_to = c(".value", "time_point"),
names_sep = "_"
)
# A tibble: 9 × 4
time_point var1 var2 var3
<chr> <int> <int> <int>
1 t1 2 1 2
2 t2 2 3 5
3 t3 2 5 3
4 t1 3 4 2
5 t2 1 3 5
6 t3 2 5 4
7 t1 5 3 1
8 t2 3 1 5
9 t3 2 1 3
更多关于pivot_longer()
的高级用法详见vignette("pivot")
。
16.4 Reverse coding
Reverse coding is a technique frequently used in survey study.
16.4.1 What is reverse coding
One common validation technique for survey items is to rephrase a “positive” item in a “negative” way. When done properly, this can be used to check if respondents are giving consistent answers.
For example, consider the following two items concerning “extraversion”:
- I see myself as someone who is talkative.
- Disagree strongly
- Disagree a little
- Neither agree nor disagree
- Agree a little
- Agree strongly
- I see myself as someone who tends to be quiet.
- Disagree strongly
- Disagree a little
- Neither agree nor disagree
- Agree a little
- Agree strongly
For question 1, “agree strongly” corresponds to the “most extraverted” option, and “disagree strongly” corresponds to the “least extraverted” option. However, for question 2, “disagree strongly” corresponds to the “most extraverted” option, and “agree strongly” corresponds to the “least extraverted” option. We say that question 2 is reverse-coded.
16.4.2 How to do reverse coding
Manually recode.
df <- data.frame(x = sample(5, size = 10, replace = TRUE))
df
x
1 3
2 5
3 4
4 1
5 5
6 1
7 4
8 5
9 4
10 1
df[df == 1] <- 50
df[df == 2] <- 40
df[df == 3] <- 30
df[df == 4] <- 20
df[df == 5] <- 10
df <- df/10
df
x
1 3
2 1
3 2
4 5
5 1
6 5
7 2
8 1
9 2
10 5
Since recoding Likert scale survey data are repeat operations, they can be converted into a for loop,
df <- data.frame(x = sample(5, size = 10, replace = TRUE))
df
x
1 1
2 5
3 5
4 3
5 4
6 1
7 1
8 5
9 1
10 2
scale_piont <- 5
coding_rule <- seq(scale_piont*10, 10, by = -10)
for (i in 1:scale_piont) {
df[df == i] <- coding_rule[i]
}
df <- df/10
df
x
1 5
2 1
3 1
4 3
5 2
6 5
7 5
8 1
9 5
10 4
What is its apply()
/sapply()
/lapply()
substitute?
df <- data.frame(x = sample(5, size = 10, replace = TRUE))
scale_piont <- 5
coding_rule <- seq(scale_piont*10, 10, by = -10)
# sapply(1:scale_piont, \(x) {
# df[df == x] <<- coding_rule[x]
# return(NULL)
# })
# df <- df / 10
# a more straightforward solution by Shusheng Liu
df <- sapply(df, \(i)coding_rule[i])/10
The code above illustrates the basic idea of recoding. However, a more flexible function is available, and reinventing the wheel is not recommended
recode(var, recodes)
-
var
: numeric vector, character vector, or factor. -
recodes
: character string of recode specifications: see below.
Recode specifications appear in a character string, separated by default by semicolons (see the examples below), each of the form input=output
.
Several recode specifications are supported:
- single value: For example,
0=NA
. - vector of values: For example,
c(7, 8, 9) = "high"
. - range of values: For example,
7:9 = "C"
. Note::
is not the R sequence operator. In addition, you may not use:
with thec
function within a recode specification, so for examplec(1, 3, 5:7)
will cause an error. The:
is the default value of therecode interval
operator; a non-default value may be specified. -
else
: everything that does not fit a previous specification. For example,else = NA
. Note thatelse
matches all otherwise unspecified values on input, includingNA
, and if present should appear last among the recode specifications.
Character data and factor levels on the left-hand side of a recode specification must be quoted. Thus, e.g., c(a, b, c) = "low"
is not allowed, and should be c("a", "b", "c") = "low"
. Similarly, the colon is reserved for numeric data, and, e.g., c("a":"c") = "low"
is not allowed. If the var
argument is a character variable with (some) values that are character representations of numbers, or a factor with (some) levels that are numbers (e.g., "12"
or "-2"
), then these too must be quoted and cannot be used with colons (e.g., "15":"19" = "15 to 19"
is not allowed, and could be specified as c("15", "16", "17", "18", "19") = "15 to 19"
, assuming that all values are the character representation of whole numbers).
x <- rep(1:3, times = 3)
x
[1] 1 2 3 1 2 3 1 2 3
car::recode(x, 'c(1, 2) = "A"; else = "B"')
[1] "A" "A" "B" "A" "A" "B" "A" "A" "B"
# don't miss the double quote if recode numeric into character or vice versa
# otherwise error
car::recode(x, "c(1, 2) = A; else = B")
Error in car::recode(x, "c(1, 2) = A; else = B"):
in recode term: else = B
message: Error in eval(parse(text = strsplit(term, to.value)[[1]][2])) :
object 'B' not found
Reverse-code a Liker-5 point variable.
df <-
data.frame(
x = sample(5, size = 10, replace = TRUE),
y = sample(5, size = 10, replace = TRUE)
)
df$x
[1] 1 5 1 4 5 5 3 5 1 3
# reverse code x
df$x <- car::recode(df$x, recodes = "1=5;2=4;3=3;4=2;5=1")
df$x
[1] 5 1 5 2 1 1 3 1 5 3
# generate recode specifications with paste
# df$x <- car::recode(df$x, paste(1:5, 5:1, sep = "=", collapse = ";"))
# df < df - 6 # By Yuchen Liu, effective by less flexible than recode
Reverse-code multiple variables simultaneously with a self-defined function.
# author: Wenzheng Lin
# collaborator: Yujun Li
recode_vars <- function(
data,
max = 5,
min = 1,
vars = names(data)
){
rule <- paste(min:max, max:min, sep = "=", collapse = ";")
if (!is.data.frame(data) & is.vector(data)) {
# recode single variable
output <- car::recode(data, rule)
return(output)
}
output <- as.data.frame(apply(data[vars], 2, \(x) car::recode(x, rule)))
# recode some variables
if (ncol(data) != length(vars)) {
vars_rest <- !names(data) %in% vars
output <- data.frame(data[, vars_rest], output)
names(output)[1:(ncol(output) - length(vars))] <- names(data)[vars_rest]
output <- output[names(data)]
}
return(output)
}
df <-
data.frame(
x = sample(7, size = 10, replace = TRUE),
y = sample(7, size = 10, replace = TRUE),
z = sample(7, size = 10, replace = TRUE)
)
head(df)
x y z
1 6 7 2
2 3 4 4
3 3 6 2
4 3 7 2
5 5 4 4
6 4 6 4
df_recode_xy <- recode_vars(df, max = 7, vars = c("x", "y"))
head(df_recode_xy)
x y z
1 2 1 2
2 5 4 4
3 5 2 2
4 5 1 2
5 3 4 4
6 4 2 4
df_recode_all <- recode_vars(df, max = 7)
head(df_recode_all)
x y z
1 2 1 6
2 5 4 4
3 5 2 6
4 5 1 6
5 3 4 4
6 4 2 4
16.5 Aberrant response
Aberrant response 的种类很多,常见的如下:
- Validity measure/attention check,
- Quantitative unrealistic response,
- Speedness,
- Repeated observations,
- Recursive responses,
- Outlier,
- Missing data.
处理 aberrant response 的核心是通过一些特定的操作产生一个 logical object,TRUE
可以表示是 aberrant response,FALSE
则相反。再使用该 logical object 作为 subsetting 的 index 就可以提取出找到的所有 aberrant response,至于找出来以后是直接删除还是另作处理就很简单了。
16.5.1 Validity measure/attention check, unrealistic response, speedness
这三类都适用相同的处理办法,即使用 relational + logical operators 的方式生成 logical object 作为 index,然后使用该 index 来 subsetting 即可。
16.5.1.1 Validity measure/attention check
Too Good to Be True: Bots and Bad Data From Mechanical Turk
The attention checks consisted of the following:
- embedded on the Grit scale—“Select‘Somewhat like me’ for this statement,”
- embedded on the Beck Depression Inventory—“1 – Select this option,” and
- embedded on the Borderline Personality Inventory—“Select ‘Yes’ for this statement.”
A more detailed introduction on attention check: Prolific’s Attention and Comprehension Check Policy
df <- data.frame(
x = sample(
1:5,
size = 10,
replace = TRUE,
prob = c(0.6, 0.1, 0.1, 0.1, 0.1)
),
y = sample(5, size = 10, replace = TRUE),
z = sample(5, size = 10, replace = TRUE)
)
head(df)
x y z
1 1 2 4
2 5 5 4
3 1 1 5
4 3 3 2
5 5 4 2
6 1 5 4
# assume x is the attention check item and participant is requested to select 1.
id_fail_atten_check <- (1:nrow(df))[df$x != 1]
id_fail_atten_check
[1] 2 4 5
df_filtered <- df[-id_fail_atten_check, ]
16.5.1.2 Quantitative unrealistic response
需要根据问卷的具体内容来判断到底什么样的作答才算 unrealistic。例如询问被试的日均手机使用时间,有人填了 25 小时,或者是同一份问卷内有多个询问手机使用时间的题目,结果被试在这些题目上填写的使用时间加起来大于 24 小时,显然就是 unrealistic。针对这种可以使用一定数值标准(例如>24
)为依据判断作答是否 unrealistic 的情况,称之为 quantitative unrealistic response。文本类型的 qualitative unrealistic response 则需要人工检查。
16.5.2 Repeated observations
有些时候可能由于人工录入数据的时候,出现了同一份问卷录入了两次甚至更多的问题,此时就产生了 repeated observations,使用duplicated()
检测。
df <- data.frame(
x = sample(
1:5,
size = 10,
replace = TRUE,
prob = c(0.6, 0.1, 0.1, 0.1, 0.1)
),
y = sample(5, size = 10, replace = TRUE),
z = sample(5, size = 10, replace = TRUE)
)
head(df)
x y z
1 5 4 1
2 3 5 4
3 1 4 4
4 5 2 4
5 5 5 4
6 1 5 5
df <- df[sample(nrow(df), size = 15, replace = TRUE), ]
df
x y z
2 3 5 4
9 5 4 5
4 5 2 4
10 1 1 2
1 5 4 1
2.1 3 5 4
9.1 5 4 5
3 1 4 4
1.1 5 4 1
6 1 5 5
9.2 5 4 5
2.2 3 5 4
7 3 1 3
1.2 5 4 1
9.3 5 4 5
id_dup_row <- duplicated(df)
id_dup_row
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
[10] FALSE TRUE TRUE FALSE TRUE TRUE
df[!id_dup_row, ]
x y z
2 3 5 4
9 5 4 5
4 5 2 4
10 1 1 2
1 5 4 1
3 1 4 4
6 1 5 5
7 3 1 3
16.5.3 Recursive responses
当问卷总题数较长且被试失去了兴趣和耐心时,被试很有可能会出现低动机(low motivation)的状况,并最终驱动 ta 开始随意作答,出现规律性的重复作答,如 345345345。但是这种情况的处理是有争议的,一般建议在没有非常强的证据时(如伴随有异常短的 response time),不将类似的作答作为 aberrant response。因为无法排除是正常作答的被试选择了类似 345345345345 的 recursive responses 的可能。因此,更建议的做法是找出有 recursive response 的被试,然后结合这些被试在每一题上的 response time 来 double check。
- locate recursive responses
可选方案:
head(df_all)
var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11
1 1 5 3 3 3 4 5 5 1 3 5
2 3 3 4 4 1 2 1 1 3 5 2
3 3 4 3 1 4 5 1 3 1 2 3
4 5 4 1 3 5 5 1 5 3 2 3
5 2 3 4 4 2 2 2 1 5 4 5
6 2 1 1 2 3 2 3 4 5 2 2
var12 var13 var14 var15 var16 var17 var18 var19 var20
1 2 2 4 1 3 4 2 4 2
2 1 5 1 4 2 5 2 5 3
3 1 2 3 1 2 3 1 2 3
4 2 5 5 1 2 2 2 1 5
5 3 1 2 1 3 2 2 2 2
6 1 2 2 3 1 1 5 2 1
rt_var1 rt_var2 rt_var3 rt_var4 rt_var5
1 5222.713 8772.040 1696.600 1701.902 12042.126
2 13151.900 18339.086 11419.242 17095.191 14977.905
3 7778.153 6339.928 7087.107 16611.145 8593.895
4 9565.369 5096.424 10139.362 18054.408 7974.312
5 8613.145 11495.983 16376.831 8111.641 9193.016
6 5213.714 7653.672 5157.204 6819.726 7384.415
rt_var6 rt_var7 rt_var8 rt_var9 rt_var10
1 9829.149 8097.522 12789.335 10677.9951 4937.9442
2 1115.291 1467.144 11723.196 9850.0672 13820.8817
3 18875.407 2421.836 10641.922 157.4386 393.3905
4 9235.695 16773.862 5889.464 14123.9528 10475.4325
5 16830.824 14951.592 7819.244 19998.8858 15402.9620
6 10789.580 1914.283 6471.676 19028.5202 1228.3531
rt_var11 rt_var12 rt_var13 rt_var14 rt_var15
1 12982.1337 10921.8568 7940.0550 13356.9291 7416.2221
2 15475.1756 12132.0654 9886.2839 2029.1908 13760.8445
3 317.1343 316.9625 381.2607 360.5937 404.4828
4 8097.1048 7833.4123 10143.5020 6903.1930 17040.7080
5 7450.0287 18656.8629 18700.0247 9392.1637 2702.3892
6 14426.5087 11353.4421 1487.5482 9286.3959 6687.3900
rt_var16 rt_var17 rt_var18 rt_var19 rt_var20
1 13945.427 1370.3845 8664.8276 16522.7978 14275.0080
2 10509.150 12981.4090 12077.0342 15434.2804 18099.1148
3 212.315 493.6364 485.4686 111.2361 398.8267
4 1882.642 10607.8031 17938.9437 19786.3327 5845.9509
5 5973.787 7927.9637 19011.2234 19908.2588 6864.7616
6 4773.808 6886.6003 8995.4045 7107.3114 15189.9099
df_char <- apply(df_all[, 1:20], 1, \(x) paste(x, collapse = ""))
df_char
[1] "15333455135224134242" "33441211352151425253"
[3] "34314513123123123123" "54135515323255122215"
[5] "23442221545312132222" "21123234522122311521"
[7] "54223253243222354245" "31232545251315331345"
[9] "53345325211315212554" "35142143132234442214"
grep("123123123123", df_char)
[1] 3
- check response time
可选方案,t test (more informative) or Kolmogorov-Smirnov Tests.
df_char <- apply(df_all[, 1:20], 1, \(x) paste(x, collapse = ""))
id_recur_res <- grep("123123123123", df_char)
results_match <- regexec("123123123123", df_char)
start_recur_res <- results_match[[3]][1] + 20
end_recur_res <- start_recur_res + attributes(results_match[[3]])$match.length - 1
rt_normal_res <- df_all[id_recur_res, 21:(start_recur_res - 1)]
rt_recur_res <- df_all[id_recur_res, start_recur_res:end_recur_res]
t.test(rt_normal_res, rt_recur_res)
Welch Two Sample t-test
data: rt_normal_res and rt_recur_res
t = 4.8974, df = 7.0046, p-value = 0.001755
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4891.772 14023.452
sample estimates:
mean of x mean of y
9793.6742 336.0622
ks.test(as.numeric(rt_recur_res), as.numeric(rt_normal_res))
Exact two-sample Kolmogorov-Smirnov test
data: as.numeric(rt_recur_res) and as.numeric(rt_normal_res)
D = 1, p-value = 1.588e-05
alternative hypothesis: two-sided
16.5.4 Outlier
Outliers in statistics are considered as the data values which differ considerably from the bulk of a given data set. Outliers can strongly bias statistical inference.
cor_matrix <- matrix(c(1, 0, 0, 1), nrow = 2, ncol = 2)
data <- MASS::mvrnorm(1000, mu = rep(0, 2), Sigma = cor_matrix)
plot(data)
cor.test(data[, 1], data[, 2])
Pearson's product-moment correlation
data: data[, 1] and data[, 2]
t = -0.35719, df = 998, p-value = 0.721
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.07324757 0.05072280
sample estimates:
cor
-0.01130583
# add_outliers
n_outlier <- 20
id_outlier <- order(rowSums(data), decreasing = T)[1:n_outlier]
data[id_outlier, ] <- data[id_outlier, ] + matrix(
runif(n_outlier * 2, 2, 3),
nrow = n_outlier, ncol = 2
)
plot(data)
cor.test(data[, 1], data[, 2])
Pearson's product-moment correlation
data: data[, 1] and data[, 2]
t = 6.8605, df = 998, p-value = 1.203e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1522271 0.2706498
sample estimates:
cor
0.2122174
16.5.4.1 Mahalanobis Distance
\[D^2_i=(\boldsymbol{x}_i-\boldsymbol{\hat{\mu}})'\boldsymbol{S}^{-1}(\boldsymbol{x}_i-\boldsymbol{\hat{\mu}}).\]
\(D^2\) theoretically follows a \(\chi^2\) distribution with degrees of freedom (\(df\)) equal to the number of variables.
dis_mh <- mahalanobis(data, center = colMeans(data), cov = cov(data))
id_outlier_mh <- which(dis_mh > qchisq(p = 1 - 0.05, df = ncol(data)))
id_outlier_mh
[1] 22 68 76 84 120 151 172 185 205 207 243 247 266 267
[15] 270 271 297 310 327 347 360 370 397 403 420 425 432 482
[29] 489 493 504 579 627 672 702 705 745 813 829 832 868 905
[43] 943
id_outlier %in% id_outlier_mh
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
cor.test(data[-id_outlier_mh, 1], data[-id_outlier_mh, 2])
Pearson's product-moment correlation
data: data[-id_outlier_mh, 1] and data[-id_outlier_mh, 2]
t = -0.40694, df = 955, p-value = 0.6841
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.07647432 0.05024608
sample estimates:
cor
-0.01316699
16.5.4.2 Minimum Covariance Determinant
It is clearly that Mahalanobis distance has an inflated type-I error rate, since it detects more than 20 outliers in the previous example. This is because Mahalanobis distance uses the mean vector and covariance matrix, which already contain considerable amount of bias due to outliers. It makes more sense to use more accuratly estimated mean and covariance when calculating Mahalanobis distance. This is the idea behind Minimum Covariance Determinant, which calculates the mean and covariance matrix based on the most central subset of the data.
mean_cov_75 <- MASS::cov.mcd(data, quantile.used = nrow(data)*.75)
mh_75 <- mahalanobis(data, mean_cov_75$center, mean_cov_75$cov)
id_outlier_mh_75 <- which(mh_75 > qchisq(p = 1 - 0.05, df = ncol(data)))
id_outlier_mh_75
[1] 22 25 56 65 68 76 84 94 120 136 146 151 172 178
[15] 185 204 205 207 213 215 243 247 250 266 267 270 271 272
[29] 297 310 327 347 360 370 397 403 420 425 432 446 448 478
[43] 482 489 493 504 533 579 627 635 672 702 705 745 774 792
[57] 798 813 829 832 868 877 900 905 943 955 963 973 990
id_outlier %in% id_outlier_mh_75
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
cor.test(data[-id_outlier_mh_75, 1], data[-id_outlier_mh_75, 2])
Pearson's product-moment correlation
data: data[-id_outlier_mh_75, 1] and data[-id_outlier_mh_75, 2]
t = -0.93584, df = 929, p-value = 0.3496
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.09475286 0.03362729
sample estimates:
cor
-0.03068936
Although the type I error rate of MCD-based Mahalanobis distance is still inflated, it performs arguably better than the traditional one. Methods to deal with outliers abound, and none of them is perfect. Interested readers in more details about how to detect outliers are refered to Leys et al. (2018).
Leys, C., Klein, O., Dominicy, Y., & Ley, C. (2018). Detecting multivariate outliers: Use a robust variant of Mahalanobis distance. Journal of Experimental Social Psychology, 74, 150-156.