11 Data Collection
11.1 Survey sampling
The following package accompanies the textbook Sampling Algorithms by Yves Tillé.
require(sampling)
11.1.1 Sampling according to prescribed inclusion probabilities
We assume that the inclusion probabilities are all equal, unless otherwise specified. Note that these inclusion probabilities sum to the (expected) sample size.
= 1e2 # population size
N = 20 # desired sample size
n = n*rep(1/N, N) # inclusion probabilities Pi
Each sampling design that follows is in the form of a binary vector with 1 indicating that the individual is to be sampled.
Systematic sampling This sampling design consists in sampling at a regular interval until the desired sample size is achieved.
UPsystematic(Pi)
[1] 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
[38] 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0
[75] 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0
Poisson sampling This sampling scheme amounts to sampling each individual according to its inclusion probability, independently of everything else. This design has the property that the desired sample size is not always achieved. (Rather, the sample size is random, having the Poisson distribution with mean the target sample size.)
UPpoisson(Pi)
[1] 1 0 1 0 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
[38] 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1
[75] 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0
Conditional Poisson sampling (CPS) This sampling scheme is Poisson sampling conditioned on the sample being of the desired size. This can be implemented by rejection sampling (repeatedly rejecting a Poisson sample until it has the desired size), but there are faster methods. (CPS is known to maximize the entropy given the inclusion probabilities, therefore the name of the function below.)
UPmaxentropy(Pi)
[1] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0
[38] 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0
[75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0
11.1.2 Cluster sampling
We use the following dataset regarding Swiss municipalities in the year 2003, where clusters can be naturally defined. In particular, we use the canton (CT) as clustering variable. (There are 26 cantons in total so that CT takes that many different values.)
data(swissmunicipalities)
= cluster(swissmunicipalities, "CT", size = 5, description = TRUE) # we select 5 cantons cl
Number of selected clusters: 5
Number of units in the population and number of selected units: 2896 334
= getdata(swissmunicipalities, cl) # municipalities to be surveyed plan
11.1.3 Stratified sampling
We use the same dataset and this time use the region (REG) as the stratification variable. (The function, as implemented, requires the ordering of the dataset according to the values of the stratification variable. This is an implementation flaw.)
= swissmunicipalities[order(swissmunicipalities$REG), ]
Data = strata(Data, "REG", size = rep(20, 7), description = TRUE) # we select 20 municipalities in each of the 7 regions st
Stratum 1
Population total and number of selected units: 589 20
Stratum 2
Population total and number of selected units: 913 20
Stratum 3
Population total and number of selected units: 321 20
Stratum 4
Population total and number of selected units: 171 20
Stratum 5
Population total and number of selected units: 471 20
Stratum 6
Population total and number of selected units: 186 20
Stratum 7
Population total and number of selected units: 245 20
Number of strata 7
Total number of selected units 140
= getdata(swissmunicipalities, st) # municipalities to be surveyed plan
11.2 Experimental design
Experimental designs were perhaps first developed in the context of agricultural experiments. So it is fitting that use the following package, which gathers a very large number of statistical tools for agricultural research
require(agricolae)
# require(knitr)
The “plots” below stand for the experimental units. (These would be individuals in clinical trials.) The “book” is short for “field book”, and specifies the treatment assignments.
Completely randomized design
= c("A", "B", "C") # treatments
treat = c(13, 13, 14) # number of replicates per treatment
r = design.crd(treat, r)
outdesign = outdesign$book # field book (treatment assignments)
book subset(book, select = c("plots", "treat"))
plots treat
1 101 B
2 102 B
3 103 B
4 104 A
5 105 A
6 106 B
7 107 A
8 108 A
9 109 C
10 110 A
11 111 C
12 112 C
13 113 B
14 114 C
15 115 A
16 116 A
17 117 C
18 118 C
19 119 C
20 120 B
21 121 B
22 122 A
23 123 A
24 124 C
25 125 A
26 126 A
27 127 A
28 128 C
29 129 C
30 130 B
31 131 B
32 132 B
33 133 B
34 134 B
35 135 C
36 136 B
37 137 C
38 138 C
39 139 C
40 140 A
Randomized complete block design
= c("A", "B", "C", "D", "E")
treat = 4 # number of blocks
r = design.rcbd(treat, r)
outdesign = outdesign$book
book book
plots block treat
1 101 1 A
2 102 1 B
3 103 1 D
4 104 1 C
5 105 1 E
6 201 2 E
7 202 2 D
8 203 2 B
9 204 2 A
10 205 2 C
11 301 3 C
12 302 3 A
13 303 3 E
14 304 3 B
15 305 3 D
16 401 4 C
17 402 4 D
18 403 4 A
19 404 4 B
20 405 4 E
Matched-pairs design This is a special case, with blocks of size 2 representing the pairs.
= c("A", "B") # two treatments
treat = 10 # number of pairs
r = design.rcbd(treat, r)
outdesign = outdesign$book
book names(book) = c("subject", "pair", "treat")
book
subject pair treat
1 101 1 B
2 102 1 A
3 201 2 B
4 202 2 A
5 301 3 A
6 302 3 B
7 401 4 A
8 402 4 B
9 501 5 A
10 502 5 B
11 601 6 B
12 602 6 A
13 701 7 A
14 702 7 B
15 801 8 A
16 802 8 B
17 901 9 A
18 902 9 B
19 1001 10 A
20 1002 10 B
Balanced incomplete block design The number of blocks is determined automatically so that the design is balanced.
= c("A", "B", "C", "D", "E" )
treat = 4 # number of units per block
k = design.bib(treat, k) outdesign
Parameters BIB
==============
Lambda : 3
treatmeans : 5
Block size : 4
Blocks : 5
Replication: 4
Efficiency factor 0.9375
<<< Book >>>
= outdesign$book
book book
plots block treat
1 101 1 A
2 102 1 B
3 103 1 C
4 104 1 D
5 201 2 D
6 202 2 E
7 203 2 A
8 204 2 B
9 301 3 D
10 302 3 E
11 303 3 C
12 304 3 A
13 401 4 E
14 402 4 C
15 403 4 D
16 404 4 B
17 501 5 E
18 502 5 B
19 503 5 C
20 504 5 A
Split plot design
= c("A", "B", "C", "D") # treatment at the whole plot level
treat1 = c("a", "b", "c") # treatment at the subplot level
treat2 = design.split(treat1, treat2, r = 1)
outdesign = outdesign$book
book book
plots splots block treat1 treat2
1 101 1 1 A a
2 101 2 1 A b
3 101 3 1 A c
4 102 1 1 B c
5 102 2 1 B a
6 102 3 1 B b
7 103 1 1 C b
8 103 2 1 C a
9 103 3 1 C c
10 104 1 1 D c
11 104 2 1 D b
12 104 3 1 D a
13 105 1 2 C b
14 105 2 2 C c
15 105 3 2 C a
16 106 1 2 B a
17 106 2 2 B c
18 106 3 2 B b
19 107 1 2 A a
20 107 2 2 A b
21 107 3 2 A c
22 108 1 2 D c
23 108 2 2 D b
24 108 3 2 D a
25 109 1 1 D b
26 109 2 1 D a
27 109 3 1 D c
28 110 1 1 A a
29 110 2 1 A b
30 110 3 1 A c
31 111 1 1 B c
32 111 2 1 B a
33 111 3 1 B b
34 112 1 1 C b
35 112 2 1 C a
36 112 3 1 C c
There seems to be a bug in that even when \(r = 1\) there appears to be 3 replicates. To remedy the situation, we only keep 1 out of 3 replications.
= length(treat1) * length(treat2)
number_splots 1:number_splots, -3] book[
plots splots treat1 treat2
1 101 1 A a
2 101 2 A b
3 101 3 A c
4 102 1 B c
5 102 2 B a
6 102 3 B b
7 103 1 C b
8 103 2 C a
9 103 3 C c
10 104 1 D c
11 104 2 D b
12 104 3 D a
Latin square design This design can be used for simple repeated measures design or a crossover design. (There are packages dedicated to crossover designs.)
= c("A", "B", "C", "D")
treat = design.lsd(treat)
outdesign = outdesign$book[, -1]
book names(book) = c("subject", "period", "treat") # for a crossover design
book
subject period treat
1 1 1 A
2 1 2 B
3 1 3 C
4 1 4 D
5 2 1 B
6 2 2 C
7 2 3 D
8 2 4 A
9 3 1 C
10 3 2 D
11 3 3 A
12 3 4 B
13 4 1 D
14 4 2 A
15 4 3 B
16 4 4 C
# or in matrix form
= matrix(book$treat, byrow = TRUE, 4, 4)
book_matrix rownames(book_matrix) = paste("subject", 1:4)
colnames(book_matrix) = paste("period", 1:4)
prmatrix(book_matrix, quote = FALSE)
period 1 period 2 period 3 period 4
subject 1 A B C D
subject 2 B C D A
subject 3 C D A B
subject 4 D A B C
11.3 Observational studies
We look at how to perform a matching in R. Such a matching can be used in a matched-pairs experimental design or in the analysis of an observational study.
require(optmatch)
We look at the following dataset on women of Pima Indian heritage. (The Pima Indians in the US state of Arizona are known to have among the highest rates of type 2 diabetes in the world.) We merge the training and test sets into a single dataset with a total of 532 observations.
require(MASS)
= rbind(Pima.tr, Pima.te) Pima
In a case-control framework, we might want to understand what distinguishes women with diabetes (cases) and women without (controls).
= which(Pima$type == "Yes") # 177 cases
cases = which(Pima$type == "No") # 355 controls controls
We match one-to-one according to age. (This is a particularly simple problem as we are matching on a single, one-dimensional variable.)
We first build the matrix of differences in age between cases and controls
= abs(outer(Pima$age[cases], Pima$age[controls], "-"))
D rownames(D) = cases
colnames(D) = controls
We then performing the matching
= pairmatch(D)
M = print(M, grouped = TRUE) # each row corresponds to a pair (case, control) given in the 2nd column P
Group Members
1.1 10, 57
1.10 118, 413
1.100 344, 531
1.101 345, 532
1.102 35, 479
1.103 355, 499
1.104 356, 441
1.105 357, 474
1.106 362, 382
1.107 371, 296
1.108 372, 449
1.109 377, 408
1.11 120, 176
1.110 378, 192
1.111 380, 165
1.112 384, 410
1.113 385, 395
1.114 386, 448
1.115 390, 461
1.116 391, 456
1.117 392, 159
1.118 396, 393
1.119 398, 369
1.12 123, 308
1.120 399, 513
1.121 41, 212
1.122 412, 135
1.123 415, 476
1.124 421, 419
1.125 428, 22
1.126 430, 340
1.127 431, 324
1.128 438, 467
1.129 439, 132
1.13 125, 432
1.130 442, 277
1.131 443, 106
1.132 457, 163
1.133 458, 133
1.134 465, 492
1.135 468, 29
1.136 470, 347
1.137 472, 101
1.138 478, 333
1.139 481, 485
1.14 13, 411
1.140 482, 379
1.141 484, 440
1.142 487, 403
1.143 488, 82
1.144 49, 316
1.145 493, 354
1.146 497, 376
1.147 498, 202
1.148 50, 528
1.149 501, 447
1.15 130, 346
1.150 503, 112
1.151 505, 23
1.152 506, 312
1.153 512, 310
1.154 517, 274
1.155 522, 285
1.156 523, 3
1.157 526, 469
1.158 527, 420
1.159 529, 520
1.16 131, 226
1.160 53, 179
1.161 6, 220
1.162 60, 405
1.163 61, 264
1.164 66, 495
1.165 67, 15
1.166 69, 435
1.167 71, 169
1.168 72, 287
1.169 73, 9
1.17 14, 168
1.170 75, 404
1.171 76, 515
1.172 79, 471
1.173 83, 241
1.174 84, 182
1.175 87, 180
1.176 93, 417
1.177 96, 80
1.18 141, 215
1.19 142, 450
1.2 100, 315
1.20 148, 504
1.21 152, 46
1.22 153, 425
1.23 154, 294
1.24 156, 273
1.25 157, 358
1.26 160, 166
1.27 161, 359
1.28 167, 363
1.29 171, 406
1.3 102, 305
1.30 173, 63
1.31 174, 124
1.32 175, 288
1.33 18, 352
1.34 184, 261
1.35 186, 302
1.36 187, 227
1.37 188, 244
1.38 19, 185
1.39 190, 129
1.4 104, 209
1.40 193, 486
1.41 197, 426
1.42 2, 530
1.43 200, 509
1.44 201, 521
1.45 204, 518
1.46 205, 507
1.47 206, 491
1.48 207, 140
1.49 210, 373
1.5 108, 351
1.50 216, 150
1.51 217, 275
1.52 218, 116
1.53 221, 36
1.54 222, 466
1.55 231, 111
1.56 234, 172
1.57 243, 423
1.58 248, 145
1.59 253, 178
1.6 11, 290
1.60 255, 510
1.61 257, 242
1.62 258, 323
1.63 259, 88
1.64 26, 103
1.65 260, 381
1.66 268, 92
1.67 270, 284
1.68 272, 214
1.69 276, 251
1.7 113, 496
1.70 278, 266
1.71 279, 293
1.72 28, 508
1.73 280, 208
1.74 281, 524
1.75 282, 400
1.76 286, 292
1.77 289, 460
1.78 291, 110
1.79 295, 463
1.8 114, 490
1.80 298, 245
1.81 300, 151
1.82 301, 299
1.83 304, 230
1.84 306, 64
1.85 311, 342
1.86 313, 12
1.87 317, 366
1.88 320, 409
1.89 325, 329
1.9 117, 239
1.90 328, 433
1.91 33, 43
1.92 330, 444
1.93 332, 368
1.94 334, 489
1.95 335, 402
1.96 336, 360
1.97 337, 483
1.98 339, 427
1.99 341, 502