Chapter 5 Programming with R—An introduction

“Good programmers write good code, great programmers steal great code.”

R is a comprehensive and powerful programming language. In this section we briefly summarize how to use R for introductory programming, including writing and executing functions.

5.1 Basic programming

Basic data science programming in R is just a list of R expressions, that are collected and executed as a batch job. The collection of R expressions are saved in an ASCII text file with a .R extension. In RStudio, from the main menu, select File, New, and then R Script. This script file will be saved with an .R extension. This script file can be edited and executed within RStudio. Alternatively, we can edit this file using our favorite text editor (e.g., GNU Emacs).

What are the characteristics of good R programming?

  • Use a good text editor (or RStudio) for programming
  • Organize batch jobs into numbered sequential files (e.g., job01_task.R)
  • Avoid graphical menu-driven approaches

First, use a good text editor (or RStudio) for programming. Each R expression will span one or more lines. Although one could write and submit each line at the R console, this approach is inefficient and not recommended. Instead, type the expressions into your favorite text editor and save with a .R extension. Then, selected expressions or the whole file can be executed in R. Use the text editor that comes with R, or text editors customized to work with R (e.g., RStudio, GNU Emacs with ESS34).

Second, we organize batch jobs into sequential files. Data analysis is a series of tasks involving data entry, checking, cleaning, analysis, and reporting. Although data analysts are primarily involved in analysis and reporting, they may be involved in earlier phases of data preparation. Regardless of stage of involvement, data analysts should organize, conduct, and document their analytic tasks and batch jobs in chronological order. For example, batch jobs might be named as follows: job01_cleaning.R, job02_recoding.R, job03_descriptive.R, job04_logistic.R, etc. Naming the program file has two components: jobs represent major tasks and are always numbered in chronological order (job01_*.R, job02_*.R, etc.); and a brief descriptor can be appended to the first component of the file name (job01_recode_data.R, job02_bivariate_analysis.R).

If one needs to repeat parts of a previous job, then add new jobs, not edit old jobs. This way our analysis can always be reviewed, replicated, and audited exactly in the order it was conducted, including corrections. We avoid editing earlier jobs. If we edit previous jobs, then we must rerun all subsequent jobs in chronological order to double check our work.

Third, we avoid graphical, menu-driven approaches. While this is a tempting approach, our work cannot be documented, replicated, and audited. The best approach is to collect R expressions into batch jobs and run them using the source function.

Fourth, while preliminary and core analyses should be conducted with script (.R) files, the final report and analyses should be compiled with Rmarkdown (.Rmd) files. For now, we focus on script files.

More recently, data scientists are using R Markdown or notebook style of conducting and documenting analyses. This is fine but be sure to follow the principles above.

5.2 Intermediate programming

The next level of R programming involves (1) implementing control flow (decision points); (2) implementing dependencies in calculations or data manipulation; and (3) improving execution efficiency,

5.2.1 Control statements

Control flow involves one or more decision points. A simple decision point goes like this: if a condition is TRUE, do {this} and then continue; if it is FALSE, do not do {this} and then continue. When R continues, the next R expression can be any valid expression, including another decision point.

5.2.1.1 The if function

We use the if function to implement single decision points:

if(TRUE) { execute these R expressions }

If the condition if FALSE, R skips the bracketed expression and continues executing subsequent lines. Study this example:

x <- c(1, 2, NA, 4, 5)
y <- c(1, 2, 3, 4, 5)
if(any(is.na(x))) {x[is.na(x)] <- 999}
x
#> [1]   1   2 999   4   5
if(any(is.na(y))) {y[is.na(y)] <- 999}
y
#> [1] 1 2 3 4 5

The first if condition evaluated to TRUE and the missing value was replaced. The second if condition evaluated to FALSE and the bracketed expressions were not evaluated.

5.2.1.2 The else function

Up to now the if condition had only one possible response. For two, mutually exclusive responses we add the else function

if(TRUE) {
  execute these R expressions
} else {
  execute these R expressions
}

Here is an example:

x <- c(1, 2, NA, 4, 5)
y <- c(1, 2, 3, 4, 5)
if (any(is.na(x))) {
    x[is.na(x)] <- 999
    cat("NAs replaced\n")
} else {
    cat("No missing values; no replacement\n")
}
#> NAs replaced
x
#> [1]   1   2 999   4   5
if (any(is.na(y))) {
    y[is.na(y)] <- 999
    cat("NAs replaced\n")
} else {
    cat("No missing values; no replacement\n")
}
#> No missing values; no replacement
y
#> [1] 1 2 3 4 5

Therefore, use the if and else combination if one needs to evaluate one of two possible collection of R expressions.

If one needs to evaluate possibly one of two possible collection of R expressions then use the following pattern:

if(TRUE) {
  execute these R expressions
} else if(TRUE) {
  execute these R expressions
}

The if and else functions can be combined to achieve any desired control flow.

5.2.1.3 The “short circuit” logical operators

The “short circuit” && and || logical operators are used for control flow in if functions. If logical vectors are provided, only the first element of each vector is used. In constrast, for element-wise comparisons of two or more vectors, use the & and | operators but not the && and || operators.35 For if function comparisons use the && and || operators.

Suppose we want to square the elements of a numeric vector but not if it is a matrix.

x <- 1:5
y <- matrix(1:4, 2, 2)
if(!is.matrix(x) && is.numeric(x) && is.vector(x)) {
  x^2
} else cat("Either not numeric, not a vector, or is a matrix\n")
#> [1]  1  4  9 16 25
if(!is.matrix(y) && is.numeric(y) && is.vector(y)) {
  y^2
} else cat("Either not numeric, not a vector, or is a matrix\n")
#> Either not numeric, not a vector, or is a matrix

The && and || operators are called “short circuit” operators because not all its arguments may be evaluated: moving from left to right, only sufficient arguments are evaluated to determine if the if function should return TRUE or FALSE. This can save considerable time if some the arguments are complicated functions that require significant computing time to evaluate to either TRUE or FALSE. In the previous example, because !is.matrix(y) evaluates to FALSE, it was not necessary to evaluate is.numeric(y) or is.vector(y).

5.2.2 Vectorized approach

An important advantage of R is the availability of functions that perform vectorized calculations. For example, suppose we wish to add to columns of a matrix. Here is one approach:

tab <- matrix(1:12, 3, 4)
(colsum <- tab[,1] + tab[,2] + tab[,3] + tab[,4])
#> [1] 22 26 30

However, this can be accomplished more efficiently using the apply function:

(colsum2 <- apply(tab, 1, sum))
#> [1] 22 26 30

In general, we want to use these types of functions (e.g., tapply, sweep, outer, mean, etc.) because they have been optimized to perform vectorized calculations.

5.2.2.1 The ifelse function

The ifelse function is a vectorized, element-wise implementation of the if and else functions. We demonstrate using the practical example of recoding a 2-level variable.

sex <- c("M", NA, "F", "F", NA, "M", "F", "M") 
(sex2 <- ifelse(sex=="M", "Male", "Female")) # binary recode
#> [1] "Male"   NA       "Female" "Female" NA       "Male"  
#> [7] "Female" "Male"

If an element of sex contains “M” (TRUE), it is recoded to “Male” (in sex2), otherwise (FALSE) it is recoded to “Female”. This assumes that there are only “M”s and “F”s in the data vector.

5.2.3 Looping

Looping is a common programming approach that is discouraged in R because it is inefficient. It is much better to conduct vectorized calculations using existing functions. For example, suppose we want to sum a numeric vector.

x <- c(2,4,6,8,10)
index <- 1:length(x) # create vector of sequential integers
xsum <- 0
for (i in index){
  xsum <- xsum + x[i] 
}
xsum
#> [1] 30

A much better approach is to use the sum function:

sum(x)
#> [1] 30

Unless it is absolutely necessary, we avoid looping.

Looping is necessary when (1) there is no R function to conduct a vectorized calculation, and (2) when the result of an element depends on the result of a preceding element which may not be known beforehand (e.g., when it is the result of a random process).

5.2.3.1 The for function

The previous example was a for loop. Here is the syntax:

for (i in somevector}{
  do some calcuation with ith element of somevector
}

In the for function R loops and uses the ith element of somevector either directly or indirectly (e.g., indexing another vector). Here is using the vector directly:

for(i in 1:3){
  cat(i^2, "\n")
}
#> 1 
#> 4 
#> 9

The letters object contain the American English alphabet. Here we use an integer vector for indexing letters:

letters # display letters
#>  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
#> [16] "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
for(i in 1:3){
  cat(letters[i],"\n")
}
#> a 
#> b 
#> c

Somevector can be any vector:

kids <- c("Tomasito", "Luisito", "Angelita")
for (i in kids) {print(i)}
#> [1] "Tomasito"
#> [1] "Luisito"
#> [1] "Angelita"

5.2.3.2 The while function

The while function will continue to evaluate a collection of R expressions while a condition is true. Here is the syntax:

while(TRUE) {
  execute these R expressions
}

Here is a trivial example:

x <- 1 ; z <- 0
while(z < 5){
  show(z)
  z <- z + x
}
#> [1] 0
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4

The while function is used for optimization functions that are converging to a numerical value.

%TODO

5.2.3.3 The break and next functions

The break expression will break out of a for or while loop if a condition is met, and transfers control to the first statement outside of the inner-most loop. Here is the general syntax:

for (i in somevector}{
  do some calcuation with ith element of somevector
  if(TRUE) break
}

The next expression halts the processing of the current iteration and advances the looping index. Here is the general syntax:

for (i in somevector}{
  do some calcuation with ith element of somevector
  if(TRUE) next
}

Both break and next apply only to the innermost of nested loops.

5.2.3.4 The double for loop

In the next example we nest two for loop to generate a multiplication table for two integer vectors.

x <- 11:15
y <- 16:25
mtab <- matrix(NA, nrow = length(x), ncol = length(y))
rownames(mtab) <- x
colnames(mtab) <- y
for (i in 1:length(x)) {
    for (j in 1:length(y)) {
        mtab[i, j] <- x[i] * y[j]
    }
}
mtab
#>     16  17  18  19  20  21  22  23  24  25
#> 11 176 187 198 209 220 231 242 253 264 275
#> 12 192 204 216 228 240 252 264 276 288 300
#> 13 208 221 234 247 260 273 286 299 312 325
#> 14 224 238 252 266 280 294 308 322 336 350
#> 15 240 255 270 285 300 315 330 345 360 375

Notice that the outer function is a more efficient way of conducting a calculation using all combinations for two numeric vectors.

x <- 11:15; y <- 16:25
mtab2 <- outer(x, y, '*')
rownames(mtab2) <- x; colnames(mtab2) <- y
mtab2
#>     16  17  18  19  20  21  22  23  24  25
#> 11 176 187 198 209 220 231 242 253 264 275
#> 12 192 204 216 228 240 252 264 276 288 300
#> 13 208 221 234 247 260 273 286 299 312 325
#> 14 224 238 252 266 280 294 308 322 336 350
#> 15 240 255 270 285 300 315 330 345 360 375

5.3 Writing R functions

Writing R functions is a great way to create “commands” to customize analyses or repeated operations. Here is the general analysis approach:

  1. Prepare inputs
  2. Do calculations
  3. Collect results

For example, suppose we are writing R code to calculate the odds ratio from a \(2 \times 2\) table with the appropriate format. For this we will use the Oswego outbreak data set.

#### Prepare inputs
odatcsv <- read.csv('~/data/oswego/oswego2.csv') 
head(odatcsv)
#>   ID Age Sex   MealDate MealTime Ill OnsetDate TimeOnset
#> 1  2  52   F 04/18/1940 20:00:00   Y 4/19/1940  00:30:00
#> 2  3  65   M 04/18/1940 18:30:00   Y 4/19/1940  00:30:00
#> 3  4  59   F 04/18/1940 18:30:00   Y 4/19/1940  00:30:00
#> 4  6  63   F 04/18/1940 19:30:00   Y 4/18/1940  22:30:00
#> 5  7  70   M 04/18/1940 19:30:00   Y 4/18/1940  22:30:00
#> 6  8  40   F 04/18/1940 19:30:00   Y 4/19/1940  02:00:00
#>   Baked.Ham Spinach Mashed.Potatoes Cabbage.Salad Jello Rolls
#> 1         Y       Y               Y             N     N     Y
#> 2         Y       Y               Y             Y     N     N
#> 3         Y       Y               N             N     N     N
#> 4         Y       Y               N             Y     Y     N
#> 5         Y       Y               Y             N     Y     Y
#> 6         N       N               N             N     N     N
#>   Brown.Bread Milk Coffee Water Cakes Van.Ice.Cream
#> 1           N    N      Y     N     N             Y
#> 2           N    N      Y     N     N             Y
#> 3           N    N      Y     N     Y             Y
#> 4           N    N      N     Y     N             Y
#> 5           Y    N      Y     Y     N             Y
#> 6           N    N      N     N     N             Y
#>   Choc.Ice.Cream Fruit.Salad
#> 1              N           N
#> 2              Y           N
#> 3              Y           N
#> 4              N           N
#> 5              N           N
#> 6              Y           N
tab1 = xtabs(~ Ill + Spinach, data = odatcsv)
tab1
#>    Spinach
#> Ill  N  Y
#>   N 12 17
#>   Y 20 26
aa = tab1[1, 1]
bb = tab1[1, 2]
cc = tab1[2, 1]
dd = tab1[2, 2]

#### Do calculations
crossprod.OR = (aa*dd)/(bb*cc)

#### Collect results
list(data = tab1, odds.ratio = crossprod.OR)
#> $data
#>    Spinach
#> Ill  N  Y
#>   N 12 17
#>   Y 20 26
#> 
#> $odds.ratio
#> [1] 0.9176471

Now that we are familiar of what it takes to calculate an odds ratio from a 2-way table we can convert the code into a function and load it at the R console. Here is new function:

myOR <- function(x){
  #### Prepare input
  #### x = 2x2 table amenable to cross-product
  aa = x[1, 1]
  bb = x[1, 2]
  cc = x[2, 1]
  dd = x[2, 2]

  #### Do calculations
  crossprod.OR = (aa*dd)/(bb*cc)

  #### Collect results
  list(data = x, odds.ratio = crossprod.OR)
}

Now we can test the function:

tab.test = xtabs(~ Ill + Spinach, data = odatcsv)
myOR(tab.test)
#> $data
#>    Spinach
#> Ill  N  Y
#>   N 12 17
#>   Y 20 26
#> 
#> $odds.ratio
#> [1] 0.9176471

5.3.1 Arguments with default values

Now suppose we wish to add calculating a 95% confidence interval to this function. We will use the following normal approximation standard error formula for an odds ratio:

\[ SE[\log(OR)] = \sqrt{\frac{1}{A_1} + \frac{1}{B_1} + \frac{1}{A_0} + \frac{1}{B_0}} \]

And here is the (\(1-\alpha\))% confidence interval:

\[ OR_L, OR_U = \exp\lbrace log(OR) \pm Z_{\alpha /2} SE[\log(OR)] \rbrace \]

Here is the improved function:

myOR2 <- function(x, conf.level){
  #### Prepare input
  #### x = 2x2 table amenable to cross-product
  if(missing(conf.level)) stop("Must specify confidence level")
  aa = x[1, 1]
  bb = x[1, 2]
  cc = x[2, 1]
  dd = x[2, 2]

  Z <- qnorm((1 + conf.level)/2)
  
  #### Do calculations
  logOR <- log((aa*dd)/(bb*cc))
  SE.logOR <- sqrt(1/aa + 1/bb + 1/cc + 1/dd)
  OR <- exp(logOR)
  CI <- exp(logOR + c(-1, 1)*Z*SE.logOR)

  #### Collect results
  list(data = x, odds.ratio = OR, conf.int = CI)
}

Notice that conf.level is a new argument, but with no default value. If a user forgets to specify a default value, the following line handles this possibility:

if(missing(conf.level)) stop("Must specify confidence level")

Now we test this function:

tab.test <- xtabs(~ Ill + Spinach, data = odatcsv)
myOR2(tab.test)

## Error in myOR2(tab.test) (from #4) : Must specify confidence level
myOR2(tab.test, 0.95)
#> $data
#>    Spinach
#> Ill  N  Y
#>   N 12 17
#>   Y 20 26
#> 
#> $odds.ratio
#> [1] 0.9176471
#> 
#> $conf.int
#> [1] 0.3580184 2.3520471

If an argument has a usual value, then specify this as an argument default value:

myOR3 <- function(x, conf.level = 0.95){
  #### Prepare input
  #### x = 2x2 table amenable to cross-product
  aa = x[1, 1]
  bb = x[1, 2]
  cc = x[2, 1]
  dd = x[2, 2]
  Z <- qnorm((1 + conf.level)/2)
  
  #### Do calculations
  logOR <- log((aa*dd)/(bb*cc))
  SE.logOR <- sqrt(1/aa + 1/bb + 1/cc + 1/dd)
  OR <- exp(logOR)
  CI <- exp(logOR + c(-1, 1)*Z*SE.logOR)

  #### Collect results
  list(data = x, odds.ratio = OR, conf.level = conf.level, conf.int = CI)
}

We test our new function:

tab.test <- xtabs(~ Ill + Spinach, data = odatcsv)
myOR3(tab.test)
#> $data
#>    Spinach
#> Ill  N  Y
#>   N 12 17
#>   Y 20 26
#> 
#> $odds.ratio
#> [1] 0.9176471
#> 
#> $conf.level
#> [1] 0.95
#> 
#> $conf.int
#> [1] 0.3580184 2.3520471
myOR3(tab.test, 0.90)
#> $data
#>    Spinach
#> Ill  N  Y
#>   N 12 17
#>   Y 20 26
#> 
#> $odds.ratio
#> [1] 0.9176471
#> 
#> $conf.level
#> [1] 0.9
#> 
#> $conf.int
#> [1] 0.4165094 2.0217459

5.3.2 Passing optional arguments using the ... function

On occasion we will have a function nested inside one of our functions and we need to be able to pass optional arguments to this nested function. This commonly occurs when we write functions for customized graphics but only wish to specify some arguments for the nested function and leave the remaining arguments optional. For example, consider this function:

myplot <- function(x, y, type = "b", ...){
  plot(x, y, type = type, ...)
}

When using myplot one only needs to provide x and y arguments. The type option has been set to a default value of “b”. The ... function will pass any optional arguments to the nested plot function. Of course, the optional arguments must be valid options for plot function.

5.4 Lexical scoping

The variables which occur in the body of a function can be divided into three classes; formal parameters, local variables and free variables. The formal parameters of a function are those occurring in the argument list of the function. Their values are determined by the process of binding the actual function arguments to the formal parameters. Local variables are those whose values are determined by the evaluation of expressions in the body of the functions. Variables which are not formal parameters or local variables are called free variables. Free variables become local variables if they are assigned to. Consider the following function definition.

f <- function(x){
  y <- 2 * x
  print(x)
  print(y)
  print(z)
}

In this function, x is a formal parameter, y is a local variable and z is a free variable. In R the value of free variable is resolved by first looking in the environment in which the function was defined. This is called lexical scope. If the free variable is not defined there, R looks in the parent environment. For the function f this would be the global environment (workspace).

To understand the implications of lexical scope consider the following:

rm(list = ls())
ls()

## character(0)

f <- function(x){
    y <- 2 * x
    print(x)
    print(y)
    print(z)
}

f(5)

## [1] 5
## [1] 10
## Error in print(z) : object 'z' not found

z <- 99
f(5)

## [1] 5
## [1] 10
## [1] 99

In the f function z is a free variable. The first time f is executed z is not defined in the function. R looks in the enclosing environment and does not find a value for z and reports an error. However, when a object z is created in the global environment, R is able to find it and uses it.

To summarize, from Roger Peng (CITE):

  1. "If the value of a symbol is not found in the environment in which a function was defined, then the search is continued in the parent environment.
  2. The search continues down the sequence of parent environments until we hit the top-level environment; this usually the global environment (workspace) or the namspace of a package.
  3. After the top-level environment, the search continues down the search list until we hit the empty environment."

Lexical scoping is convenient because it allows nested functions with free variables to run provided the variable has been defined in an enclosing environment. This convenience becomes obvious when one writes many programs. However, there is a danger: an unintended free variable many find an unintended value in an parent environment. This may go undetected because no error is reported. This can happen when there are many objects in the workspace from previous sessions. A good habit is to clear the workspace of all objects at the beginning of every session.

Here is another example from the R introductory manual.36 Consider a function called cube.

cube <- function(n) {
  sq <- function() {n * n}
  n * sq()
}

The variable n in the function sq is not an argument to that function. Therefore it is a free variable and the scoping rules must be used to ascertain the value that is to be associated with it. Under lexical scope (e.g., in R) it is the parameter to the function cube since that is the active binding for the variable n at the time the function sq was defined.

cube(2)
#> [1] 8

Here is a variation that demonstrates lexical scoping but with an unexpected numerical answer.

n <- 3
sq <- function() {n * n}
cube <- function(n) {
  n * sq()
}
cube(2)
#> [1] 18

Why does cube(2) = 18? The sq function has a free variable n. When sq is evaluated inside of cube, sq must find a value for its free variable n. Under lexical scoping, sq looks in the parent environment when sq was defined. The sq function was defined in the global environment, so it looks, finds, and uses n = 3.

5.4.1 Functions with a mutatable state

Lexical scoping can be used to create a function with a mutable state. Consider the following function that simulates a banking savings account balance with deposits and withdrawals.37

open.account <- function(total) {
    list(deposit = function(amount) {
        if (amount <= 0) stop("Deposits must be positive!\n")
        total <<- total + amount
        cat(amount, "deposited.  Your balance is", total, "\n\n")
    }, withdraw = function(amount) {
        if (amount > total) stop("You don't have that much money!\n")
        total <<- total - amount
        cat(amount, "withdrawn.  Your balance is", total, "\n\n")
    }, balance = function() {
        cat("Your balance is", total, "\n\n")
    })
}

The first total is a formal parameter of the function open.account. There is a list with three functions: deposit, withdraw, and balance. Inside these functions, the first total is a free variable; they find their values in the enclosing environment’s formal parameter (total) which is one level up. The special assignment operator, “<<-”, makes an assignment to the enclosing environment, one level up, updating the value of the formal parameter total.

Let’s open some savings accounts and make deposits and withdrawals.

luis <- open.account(500)
angela <- open.account(100)
luis$withdraw(75)
#> 75 withdrawn.  Your balance is 425
angela$deposit(250)
#> 250 deposited.  Your balance is 350
luis$balance()
#> Your balance is 425
angela$balance()
#> Your balance is 350

5.5 Recursive functions

A function that calls itself is a recursive function. Recursive functions break a problem down into small steps that require the same methodologic technique. For example, consider the factorial of 5:

\[ 5! = 5 \times 4! = 5 \times 4 \times 3! = 5 \times 4 \times 3 \times 2! = = 5 \times 4 \times 3 \times 2 \times 1! \]

We can write a recursive function for factorial calculations:

recursive.factorial <- function(x) {
    if (x == 0) {
        return(1)
    } else {
        return(x * recursive.factorial(x - 1))
    }
}
recursive.factorial(5)
#> [1] 120
factorial(5)  # compare to native R function
#> [1] 120

5.6 Debugging and profiling R code

TODO: pending

5.7 Example: Bootstrap of risk ratio data

Consider the following data (Table 5.1) with the number of coronary heart disease events comparing men with Type A behavior to men with Type B behavior [17]. There is no formula for calculating an exact confidence interval for a risk ratio. When the distribution for a statistical estimate is not known we can use the bootstrap method. Essentially, this involves the following steps:

  1. Simulate risk ratio estimates using the data
  2. Estimate the mean of the risk ratios
  3. Choose appropriate quantile values as confidence limits
TABLE 5.1: Counts of coronary heart disease (CHD) events for Type A and Type B subjects
CHD
Exposure Yes No Total
Type A 178 1411 1589
Type B 79 1486 1565
rr.boot <- function(x1, x0, conf.level = 0.95, replicates = 5000){
  #### x1 = c(number cases among exposed, number of exposed)
  #### x0 = c(number cases among nonexposed, number of nonexposed)
  #### prepare input 
  n1 <- x1[2]; n0 <- x0[2]
  p1 <- x1[1]/n1 # risk among exposed
  p0 <- x0[1]/n0 # risk among unexposed
    
  #### do calculations
  RR <- p1/p0 # maximum likelihood estimate of risk ratio
  r1 <- rbinom(replicates, n1, p1)/n1
  r0 <- rbinom(replicates, n0, p0)/n0
  rrboot <- r1/r0
  rrbar <- mean(rrboot)
  alpha <- 1 - conf.level
  ci <- quantile(rrboot, c(alpha/2, 1-alpha/2))

  #### collect results
  list(x = x,
       risks = c(p1 = p1, p0 = p0),
       risk.ratio = RR,
       rrboot.mean = rrbar,
       conf.int = unname(ci),
       conf.level = conf.level,
       replicates = replicates)
}

#### test function
xx1 <- c(178, 1589) # CHD among Type A
xx0 <- c(79, 1565) # CHD among Type B
rr.boot(xx1, xx0)
#> $x
#> [1] 11 12 13 14 15
#> 
#> $risks
#>         p1         p0 
#> 0.11202014 0.05047923 
#> 
#> $risk.ratio
#> [1] 2.219133
#> 
#> $rrboot.mean
#> [1] 2.25108
#> 
#> $conf.int
#> [1] 1.736260 2.909937
#> 
#> $conf.level
#> [1] 0.95
#> 
#> $replicates
#> [1] 5000

5.8 Exercises

Probability distributions are either continuous (e.g., normal distribution) or discrete (e.g. binomial distribution). In R, we deal with four aspects of probability distributions (Table 5.2).

TABLE 5.2: Examples of probability distribution functions in R
Aspects of a probability distribution Normal Binomial
probability density function pnorm pbinom
cumulative distribution function dnorm dbinom
quantile (“percentile”) function qnorm qbinom
random sample from distribution rnorm rbinom

For example, to get the quantile value from a standard normal distribution (Z value) that corresponds to a 95% confidence interval we use the qnorm function.

conf.level <- 0.95 
qnorm((1 + conf.level)/2)
#> [1] 1.959964

For a second example, suppose we wish to simulate a binomial distribution. Out of 350 patients that adhered to a new drug treatment, 273 recovered. The response proportion was 273/350 or 0.78 (78%). We can use the rbinom function to simulate this binomial process. Let’s simulate this 10 times (i.e., 10 replicates).

replicates <- 10
num_of_events <- 273
num_at_risk <- 350
risk <- num_of_events/num_at_risk
sim_events <- rbinom(n = replicates, size = num_at_risk, prob = risk)
sim_events
#>  [1] 275 275 277 280 288 284 268 282 282 266
sim_risks <- sim_events/num_at_risk
round(sim_risks,3)
#>  [1] 0.786 0.786 0.791 0.800 0.823 0.811 0.766 0.806 0.806 0.760

Bootstrapping is a resampling method to estimate a confidence interval. For example, we could simulate 5000 replicates and use the quantile function to estimate a 0.95 confidence interval.

replicates <- 5000
num_of_events <- 273
num_at_risk <- 350
risk <- num_of_events/num_at_risk
sim_events <- rbinom(n = replicates, size = num_at_risk, prob = risk)
conf.level <- 0.95
alpha <- 1 - conf.level
## 0.95 CI for binomial counts
quantile(sim_events, c(alpha/2, 1-alpha/2))
#>  2.5% 97.5% 
#>   257   288
## 0.95 CI for binomial proportions
quantile(sim_events, c(alpha/2, 1-alpha/2))/num_at_risk
#>      2.5%     97.5% 
#> 0.7342857 0.8228571
Exercise 5.1 (Bootstrap a confidence interval) In the same study above, 350 patients did not take the new drug treatment, and 289 recovered. Calculate the bootstrap confidence interval for this proportion (289/350).
Exercise 5.2 (Function for bootstrapping a confidence interval) Adapt your code from the previous problem to write and test a function for calculating the bootstrap confidence interval for a binomial proportion.

For example, if your previous code was

z <- 4
z^3    

then, your function and test might be

mycube <- function(x) {
    x^3
}
mycube(4)  # test
#> [1] 64
Exercise 5.3 (Calculating risks and risk ratio) Consider Table 5.3 for cohort (risk) data.
TABLE 5.3: \(2 \times 2\) table for cohort (risk) data
Exposed
Disease Yes No Total
Yes \(a\) \(b\) \(M_1\)
No \(c\) \(d\) \(M_0\)
Total \(N_1\) \(N_0\) \(T\)

The risk ratio is calculated by

\[ RR = \frac{a/N_1}{b/N_0} = \frac{R_1}{R_0}, \]

and the standard error is calculated by

\[ SE(ln(RR)) = \sqrt{\frac{1}{a}-\frac{1}{N_1}+\frac{1}{b}-\frac{1}{N_0}}. \]

Consider an observational study where 700 patients were given access to a new drug for an ailment. A total of 350 patients chose to take the drug and 350 patients did not. Table 5.4 displays the results of this study.

TABLE 5.4: Study where 700 patients were given access to a new drug treatment
Treatment
Recovered Yes No Total
Yes 273 289 562
No 77 61 138
Total 350 350 700

Using the results from Table 5.4 create and test a function that:

  • Calculates the risks \(R_1\) and \(R_0\).
  • Calculates the risk ratio (\(RR\)).
  • Calculates 95% confidence intervals.
  • Uses the fisher.test function (Fisher’s Exact Test) to calculate the two-sided \(p\)-value of the null hypothesis of no association. (Hint: see epitable function in Exercise 5.5)
  • Collects and returns all the results as a ‘list’.
Exercise 5.4 (Evaluating drug treatment) Consider an observational study where 700 patients were given access to a new drug for an ailment. A total of 350 patients chose to take the drug and 350 patients did not. Table 5.4 displays the results of this study.

Using the read.csv function import the data frame for Table 5.4. Use the code below.

drugrx.df <- read.csv('~/data/exported/jp-drugrx-p2.txt')

Analyze the drugrx.df data frame using the xtab function, your function from the previous problem, and any R functions you wish. Is the drug treatment effective? Is there a difference in drug treatment effectiveness comparing men and women?

Exercise 5.5 (Oswego outbreak investigation) Study this clean version of the Oswego data set and import it as a data frame: ~/data/oswego/oswego2.txt. Remember this data set is available from https://github.com/taragonmd/data. Locate the data set and select the “Raw” tab. Copy the URL and use with the read.csv function.

Import this clean version.

oswego2 <- read.csv("~/data/oswego/oswego2.txt", strip.white = TRUE)
food.item <- c("Baked.ham", "Spinach", "Mashed.potato", "Cabbage.salad", 
    "Jello", "Rolls", "Brown.bread", "Milk", "Coffee", "Water", "Cakes", 
    "Vanilla.ice.cream", "Chocolate.ice.cream", "Fruit.salad")
names(oswego2)[9:22] <- food.item
str(oswego2)
#> 'data.frame':    75 obs. of  22 variables:
#>  $ ID                 : int  2 3 4 6 7 8 9 10 14 16 ...
#>  $ Age                : int  52 65 59 63 70 40 15 33 10 32 ...
#>  $ Sex                : Factor w/ 2 levels "F","M": 1 2 1 1 2 1 1 1 2 1 ...
#>  $ MealDate           : Factor w/ 1 level "04/18/1940": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ MealTime           : Factor w/ 6 levels "11:00:00","18:30:00",..: 5 2 2 4 4 4 6 3 4 NA ...
#>  $ Ill                : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ OnsetDate          : Factor w/ 2 levels "4/18/1940","4/19/1940": 2 2 2 1 1 2 2 1 2 2 ...
#>  $ OnsetTime          : Factor w/ 17 levels "00:00:00","00:30:00",..: 2 2 2 15 15 4 3 16 4 7 ...
#>  $ Baked.ham          : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 1 2 1 2 ...
#>  $ Spinach            : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 1 2 1 2 ...
#>  $ Mashed.potato      : Factor w/ 2 levels "N","Y": 2 2 1 1 2 1 1 2 1 1 ...
#>  $ Cabbage.salad      : Factor w/ 2 levels "N","Y": 1 2 1 2 1 1 1 1 1 1 ...
#>  $ Jello              : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 1 1 1 ...
#>  $ Rolls              : Factor w/ 2 levels "N","Y": 2 1 1 1 2 1 1 2 1 2 ...
#>  $ Brown.bread        : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 2 1 1 ...
#>  $ Milk               : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Coffee             : Factor w/ 2 levels "N","Y": 2 2 2 1 2 1 1 1 1 2 ...
#>  $ Water              : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 2 1 1 ...
#>  $ Cakes              : Factor w/ 2 levels "N","Y": 1 1 2 1 1 1 2 1 1 2 ...
#>  $ Vanilla.ice.cream  : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
#>  $ Chocolate.ice.cream: Factor w/ 2 levels "N","Y": 1 2 2 1 1 2 2 2 2 2 ...
#>  $ Fruit.salad        : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...

Study the following epitable function, especially the use of the ... argument to past options to the fisher.test function.

epitable <- function(exposure, outcome, digits = 3, ...){
    # Updated 2016-10-16
    # 2x2 table from 'xtabs' function
    # row names are exposure status
    # col names are illness outcome
    # digits = significant digits (default = 3) 
    # ... = options to fisher.test function
    x <- xtabs(~ exposure + outcome, na.action=na.pass)
    rowtot <- apply(x, 1, sum)
    rowdist <- sweep(x, 1, rowtot, '/')
    risk0 <- rowdist['N','Y']
    risk1 <- rowdist['Y','Y']
    rr <- risk1/risk0
    or <- (risk1/(1-risk1)) / (risk0/(1-risk0))
    ft <- fisher.test(x, ...)
    pv <- ft$p.value
    results <- signif(c(risk0, risk1, rr, or, pv), digits = digits)
    names(results) <- c('R_0', 'R_1', 'RR', 'OR', 'p.value')
    list(data = x, results = results)
}

Here is an example of testing one food item:

epitable(oswego2$Baked.ham, oswego2$Ill)
#> $data
#>         outcome
#> exposure  N  Y
#>        N 12 17
#>        Y 17 29
#> 
#> $results
#>     R_0     R_1      RR      OR p.value 
#>   0.586   0.630   1.080   1.200   0.809

Use the epitable function to test each food item to see if it is associated with developing illness. Construct a table with the following column headings:

  • Food item
  • \(R_0\)
  • \(R_1\)
  • \(RR\) (risk ratio)
  • \(OR\) (odds ratio)
  • Fisher \(p\) value

Discuss your findings.

Exercise 5.6 (Generate outbreak investigation results table) Write a for loop to automate the creation of the outbreak table from previous exercise.

Hint provided:

outbreak <- matrix(NA, length(food.item), 5)
rownames(outbreak) <- food.item
colnames(outbreak) <- c('R_0','R_1','RR','OR','P value')
for(i in 1:length(food.item)) {
  outbreak[i,] <- epitable(oswego2[,food.item[i]], oswego2$Ill)$results
}
outbreak
Exercise 5.7 (Comparing lapply and sapply) Study the following use of lapply and sapply. Explain how both functions work, and what are their differences.
myepitab <- function(x) {
    epitable(x, oswego2$Ill)$results
}
lapply(oswego2[, food.item], myepitab)
#> $Baked.ham
#>     R_0     R_1      RR      OR p.value 
#>   0.586   0.630   1.080   1.200   0.809 
#> 
#> $Spinach
#>     R_0     R_1      RR      OR p.value 
#>   0.625   0.605   0.967   0.918   1.000 
#> 
#> $Mashed.potato
#>     R_0     R_1      RR      OR p.value 
#>   0.622   0.622   1.000   1.000   1.000 
#> 
#> $Cabbage.salad
#>     R_0     R_1      RR      OR p.value 
#>   0.596   0.643   1.080   1.220   0.808 
#> 
#> $Jello
#>     R_0     R_1      RR      OR p.value 
#>   0.577   0.696   1.210   1.680   0.442 
#> 
#> $Rolls
#>     R_0     R_1      RR      OR p.value 
#>   0.658   0.568   0.863   0.682   0.482 
#> 
#> $Brown.bread
#>     R_0     R_1      RR      OR p.value 
#>   0.583   0.667   1.140   1.430   0.622 
#> 
#> $Milk
#>     R_0     R_1      RR      OR p.value 
#>   0.620   0.500   0.807   0.614   0.638 
#> 
#> $Coffee
#>     R_0     R_1      RR      OR p.value 
#>   0.614   0.613   0.999   0.997   1.000 
#> 
#> $Water
#>     R_0     R_1      RR      OR p.value 
#>   0.647   0.542   0.837   0.645   0.450 
#> 
#> $Cakes
#>     R_0     R_1      RR      OR p.value 
#>   0.543   0.675   1.240   1.750   0.342 
#> 
#> $Vanilla.ice.cream
#>      R_0      R_1       RR       OR  p.value 
#> 1.43e-01 7.96e-01 5.57e+00 2.35e+01 2.60e-07 
#> 
#> $Chocolate.ice.cream
#>     R_0     R_1      RR      OR p.value 
#>  0.7410  0.5320  0.7180  0.3980  0.0891 
#> 
#> $Fruit.salad
#>     R_0     R_1      RR      OR p.value 
#>   0.609   0.667   1.100   1.290   1.000
t(sapply(oswego2[, food.item], myepitab))  # transpose output
#>                       R_0   R_1    RR     OR  p.value
#> Baked.ham           0.586 0.630 1.080  1.200 8.09e-01
#> Spinach             0.625 0.605 0.967  0.918 1.00e+00
#> Mashed.potato       0.622 0.622 1.000  1.000 1.00e+00
#> Cabbage.salad       0.596 0.643 1.080  1.220 8.08e-01
#> Jello               0.577 0.696 1.210  1.680 4.42e-01
#> Rolls               0.658 0.568 0.863  0.682 4.82e-01
#> Brown.bread         0.583 0.667 1.140  1.430 6.22e-01
#> Milk                0.620 0.500 0.807  0.614 6.38e-01
#> Coffee              0.614 0.613 0.999  0.997 1.00e+00
#> Water               0.647 0.542 0.837  0.645 4.50e-01
#> Cakes               0.543 0.675 1.240  1.750 3.42e-01
#> Vanilla.ice.cream   0.143 0.796 5.570 23.500 2.60e-07
#> Chocolate.ice.cream 0.741 0.532 0.718  0.398 8.91e-02
#> Fruit.salad         0.609 0.667 1.100  1.290 1.00e+00
Exercise 5.8 (Lexical scoping) Study the following R code and explain the final answer for g(2):
a <- 1
b <- 2
f <- function(x) {
    a * x + b
}
g <- function(x) {
    a <- 2
    b <- 1
    f(x)
}
g(2)
#> [1] 4
Exercise 5.9 (More lexical scoping) Study the following R code and explain the final answer for g(2):
g <- function(x) {
    a <- 2
    b <- 1
    f <- function(x) {
        a * x + b
    }
    f(x)
}
g(2)
#> [1] 5

References

17. Selvin S. Modern applied biostatistical methods using s-plus. New York: Oxford University Press; 1998.