Chapter 5 Programming and R functions

“Good programmers write good code, great programmers borrow good code.”

R is a comprehensive and powerful programming language. In this section we briefly summarize how to use R for introductory programming, including writing and executing functions.

5.1 Basic programming

Basic data science programming in R is just a list of R expressions, that are collected and executed as a batch job. The collection of R expressions are saved in an ASCII text file with a .R extension. In RStudio, from the main menu, select File, New, and then R Script. This script file will be saved with an .R extension. This script file can be edited and executed within RStudio. Alternatively, we can edit this file using our favorite text editor (e.g., GNU Emacs).

What are the characteristics of good R programming?

  • Use a good text editor (or RStudio) for programming
  • Organize batch jobs into numbered sequential files (e.g., job01.R)
  • Avoid graphical menu-driven approaches

First, use a good text editor (or RStudio) for programming. Each R expression will span one or more lines. Although one could write and submit each line at the R console, this approach is inefficient and not recommended. Instead, type the expressions into your favorite text editor and save with a .R extension. Then, selected expressions or the whole file can be executed in R. Use the text editor that comes with R, or text editors customized to work with R (e.g., RStudio, Emacs with ESS).

Second, we organize batch jobs into sequential files. Data analysis is a series of tasks involving data entry, checking, cleaning, analysis, and reporting. Although data analysts are primarily involved in analysis and reporting, they may be involved in earlier phases of data preparation. Regardless of stage of involvement, data analysts should organize, conduct, and document their analytics tasks and batch jobs in chronological order. For example, batch jobs might be named as follows: job01-cleaning.R, job02-recoding.R, job03-descriptive.R, job04-logistic.R, etc. Naming the program file has two components: ‘jobs’ represent major tasks and are always numbered in chronological order (job01-*.R, job02-*.R, etc.); and a brief descriptor can be appended to the first component of the file name (job01-recode-data.R, job02-bivariate-analysis.R).

If one needs to repeat parts of a previous job, then add new jobs, not edit old jobs. This way our analysis can always be reviewed, replicated, and audited exactly in the order it was conducted. We avoid editing earlier jobs. If we edit previous jobs, then we must rerun all subsequent jobs in chronological order.

Third, we avoid graphical, menu-driven approaches. While this is a tempting approach, our work cannot be documented, replicated, and audited. The best approach is to collect R expressions into batch jobs and run them using the source function.

Fourth, while preliminary and core analyses should be conducted with script (.R) files, the final report and analyses should be compiled with Rmarkdown (.Rmd) files. For now, we focus on script files.

5.2 Intermediate programming

The next level of R programming involves (1) implementing control flow (decision points); (2) implementing dependencies in calculations or data manipulation; and (3) improving execution efficiency,

5.2.1 Control statements

Control flow involves one or more decision points. A simple decision point goes like this: if a condition is TRUE, do {this} and then continue; if it is FALSE, do not do {this} and then continue. When R continues, the next R expression can be any valid expression, including another decision point.

5.2.1.1 The if function

We use the if function to implement single decision points:

if(TRUE) { execute these R expressions }

If the condition if FALSE, R skips the bracketed expression and continues executing subsequent lines. Study this example:

x <- c(1, 2, NA, 4, 5)
y <- c(1, 2, 3, 4, 5)
if(any(is.na(x))) {x[is.na(x)] <- 999}
x
## [1]   1   2 999   4   5
if(any(is.na(y))) {y[is.na(y)] <- 999}
y
## [1] 1 2 3 4 5

The first if condition evaluated to TRUE and the missing value was replaced. The second if condition evaluated to FALSE and the bracketed expressions were not evaluated.

5.2.1.2 The else function

Up to now the if condition had only one possible response. If there are two, mutually exclusive possible responses, add one else statement:

if(TRUE) {
  execute these R expressions
} else {
  execute these R expressions
}

Here is an example:

x <- c(1, 2, NA, 4, 5)
y <- c(1, 2, 3, 4, 5)
if (any(is.na(x))) {
    x[is.na(x)] <- 999
    cat("NAs replaced\n")
} else {
    cat("No missing values; no replacement\n")
}
## NAs replaced
x
## [1]   1   2 999   4   5
if (any(is.na(y))) {
    y[is.na(y)] <- 999
    cat("NAs replaced\n")
} else {
    cat("No missing values; no replacement\n")
}
## No missing values; no replacement
y
## [1] 1 2 3 4 5

Therefore, use the if and else combination if one needs to evaluate one of two possible collection of R expressions.

If one needs to evaluate possibly one of two possible collection of R expressions then use the following pattern:

if(TRUE) {
  execute these R expressions
} else if(TRUE) {
  execute these R expressions
}

The if and else functions can be combined to achieve any desired control flow.

5.2.1.3 The “short circuit” logical operators

The “short circuit” && and || logical operators are used for control flow in if functions. If logical vectors are provided, only the first element of each vector is used. Therefore, for element-wise comparisons of two or more vectors, use the & and | operators but not the && and || operators.21 For if function comparisons use the && and || operators.

Suppose we want to square the elements of a numeric vector but not if it is a matrix.

x <- 1:5
y <- matrix(1:4, 2, 2)
if(is.numeric(x) && !is.matrix(x)) {
  x^2
} else cat("Either not numeric or is a matrix\n")
## [1]  1  4  9 16 25
if(is.numeric(y) && !is.matrix(y)) {
  y^2
} else cat("Either not numeric or is a matrix\n")
## Either not numeric or is a matrix

The && and || operators are called “short circuit” operators because not all its arguments may be evaluated: moving from left to right, only sufficient arguments are evaluated to determine if the if function should return TRUE or FALSE. This can save considerable time if some the arguments are complicated functions that require significant computing time to evaluate to either TRUE or FALSE. In the previous example, because !is.matrix(y) evaluates to FALSE, it was not necessary to evaluate is.numeric(y).

5.2.2 Vectorized approach

An important advantage of R is the availability of functions that perform vectorized calculations. For example, suppose we wish to add to columns of a matrix. Here is one approach:

(tab <- matrix(1:12, 3, 4))
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
(colsum <- tab[,1] + tab[,2] + tab[,3] + tab[,4])
## [1] 22 26 30
(colsum2 <- apply(tab, 1, sum))
## [1] 22 26 30

However, this can be accomplished more efficiently using the apply function:

(colsum2 <- apply(tab, 1, sum))
## [1] 22 26 30

In general, we want to use these types of functions (e.g., tapply, sweep, outer, mean, etc.) because they have been optimized to perform vectorized calculations.

5.2.2.1 The ifelse function

The ifelse function is a vectorized element-wise implementation of the if and else functions. We demonstrate using the practical example of recoding a 2-level variable.

sex <- c("M", NA, "F", "F", NA, "M", "F", "M") 
(sex2 <- ifelse(sex=="M", "Male", "Female")) # binary recode
## [1] "Male"   NA       "Female" "Female" NA       "Male"   "Female" "Male"

If an element of sex contains “M” (TRUE), it is recoded to “Male” (in sex2), otherwise (FALSE) it is recoded to “Female”. This assumes that there are only “M”s and “F”s in the data vector.

5.2.3 Looping

Looping is a common programming approach that is discouraged in R because it is inefficient. It is much better to conduct vectorized calculations using existing functions. For example, suppose we want to sum a numeric vector.

x <- c(2,4,6,8,10)
index <- 1:length(x) # create vector of sequential integers
xsum <- 0
for (i in index){
  xsum <- xsum + x[i] 
}
xsum
## [1] 30

A much better approach is to use the sum function:

sum(x)
## [1] 30

Unless it is absolutely necessary, we avoid looping.

Looping is necessary when (1) there is no R function to conduct a vectorized calculation, and (2) when the result of an element depends on the result of a preceding element which may not be known beforehand (e.g., when it is the result of a random process).

5.2.3.1 The for function

The previous example was a for loop. Here is the syntax:

for (i in somevector}{
  do some calcuation with ith element of somevector
}

In the for function R loops and uses the ith element of somevector either directly or indirectly (e.g., indexing another vector). Here is using the vector directly:

for(i in 1:3){
  cat(i^2, "\n")
}
## 1 
## 4 
## 9

The letters object contain the American English alphabet. Here we use an integer vector for indexing letters:

letters # display letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
for(i in 1:3){
  cat(letters[i],"\n")
}
## a 
## b 
## c

Somevector can be any vector:

kids <- c("Tomasito", "Luisito", "Angelita")
for (i in kids) {print(i)}
## [1] "Tomasito"
## [1] "Luisito"
## [1] "Angelita"

5.2.3.2 The while function

The while function will continue to evaluate a collection of R expressions while a condition is true. Here is the syntax:

while(TRUE) {
  execute these R expressions
}

Here is a trivial example:

x <- 1 ; z <- 0
while(z < 5){
  show(z)
  z <- z + x
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4

The while function is used for optimization functions that are converging to a numerical value.

%TODO

5.2.3.3 The break and next functions

The break expression will break out of a for or while loop if a condition is met, and transfers control to the first statement outside of the inner-most loop. Here is the general syntax:

for (i in somevector}{
  do some calcuation with ith element of somevector
  if(TRUE) break
}

The next expression halts the processing of the current iteration and advances the looping index. Here is the general syntax:

for (i in somevector}{
  do some calcuation with ith element of somevector
  if(TRUE) next
}

Both break and next apply only to the innermost of nested loops.

5.2.3.4 The double for loop

In the next example we nest two for loop to generate a multiplication table for two integer vectors.

x <- 11:15
y <- 16:25
mtab <- matrix(NA, nrow = length(x), ncol = length(y))
rownames(mtab) <- x
colnames(mtab) <- y
for (i in 1:length(x)) {
    for (j in 1:length(y)) {
        mtab[i, j] <- x[i] * y[j]
    }
}
mtab
##     16  17  18  19  20  21  22  23  24  25
## 11 176 187 198 209 220 231 242 253 264 275
## 12 192 204 216 228 240 252 264 276 288 300
## 13 208 221 234 247 260 273 286 299 312 325
## 14 224 238 252 266 280 294 308 322 336 350
## 15 240 255 270 285 300 315 330 345 360 375

Notice that the outer function is a more efficient way of conducting a calculation using all combinations for two numeric vectors.

x <- 11:15; y <- 16:25
mtab2 <- outer(x, y, '*')
rownames(mtab2) <- x; colnames(mtab2) <- y
mtab2
##     16  17  18  19  20  21  22  23  24  25
## 11 176 187 198 209 220 231 242 253 264 275
## 12 192 204 216 228 240 252 264 276 288 300
## 13 208 221 234 247 260 273 286 299 312 325
## 14 224 238 252 266 280 294 308 322 336 350
## 15 240 255 270 285 300 315 330 345 360 375

5.3 Writing R functions

Writing R functions is a great way to create “commands” to customize analyses or repeated operations. Here is the general analysis approach:

  1. Prepare inputs
  2. Do calculations
  3. Collect results

For example, suppose we are writing R code to calculate the odds ratio from a \(2 \times 2\) table with the appropriate format. For this we will use the Oswego outbreak data set.

#### Prepare inputs
odatcsv <- read.csv('~/git/phds/data/oswego/oswego2.csv') 
head(odatcsv)
##   ID Age Sex   MealDate MealTime Ill OnsetDate TimeOnset Baked.Ham Spinach
## 1  2  52   F 04/18/1940 20:00:00   Y 4/19/1940  00:30:00         Y       Y
## 2  3  65   M 04/18/1940 18:30:00   Y 4/19/1940  00:30:00         Y       Y
## 3  4  59   F 04/18/1940 18:30:00   Y 4/19/1940  00:30:00         Y       Y
## 4  6  63   F 04/18/1940 19:30:00   Y 4/18/1940  22:30:00         Y       Y
## 5  7  70   M 04/18/1940 19:30:00   Y 4/18/1940  22:30:00         Y       Y
## 6  8  40   F 04/18/1940 19:30:00   Y 4/19/1940  02:00:00         N       N
##   Mashed.Potatoes Cabbage.Salad Jello Rolls Brown.Bread Milk Coffee Water
## 1               Y             N     N     Y           N    N      Y     N
## 2               Y             Y     N     N           N    N      Y     N
## 3               N             N     N     N           N    N      Y     N
## 4               N             Y     Y     N           N    N      N     Y
## 5               Y             N     Y     Y           Y    N      Y     Y
## 6               N             N     N     N           N    N      N     N
##   Cakes Vanilla.Ice.Cream Chocolate.Ice.Cream Fruit.Salad
## 1     N                 Y                   N           N
## 2     N                 Y                   Y           N
## 3     Y                 Y                   Y           N
## 4     N                 Y                   N           N
## 5     N                 Y                   N           N
## 6     N                 Y                   Y           N
tab1 = xtabs(~ Ill + Spinach, data = odatcsv)
tab1
##    Spinach
## Ill  N  Y
##   N 12 17
##   Y 20 26
aa = tab1[1, 1]
bb = tab1[1, 2]
cc = tab1[2, 1]
dd = tab1[2, 2]

#### Do calculations
crossprod.OR = (aa*dd)/(bb*cc)

#### Collect results
list(data = tab1, odds.ratio = crossprod.OR)
## $data
##    Spinach
## Ill  N  Y
##   N 12 17
##   Y 20 26
## 
## $odds.ratio
## [1] 0.9176471

Now that we are familiar of what it takes to calculate an odds ratio from a 2-way table we can convert the code into a function and load it at the R console. Here is new function:

myOR <- function(x){
  #### Prepare input
  #### x = 2x2 table amenable to cross-product
  aa = x[1, 1]
  bb = x[1, 2]
  cc = x[2, 1]
  dd = x[2, 2]

  #### Do calculations
  crossprod.OR = (aa*dd)/(bb*cc)

  #### Collect results
  list(data = x, odds.ratio = crossprod.OR)
}

Now we can test the function:

tab.test = xtabs(~ Ill + Spinach, data = odatcsv)
myOR(tab.test)
## $data
##    Spinach
## Ill  N  Y
##   N 12 17
##   Y 20 26
## 
## $odds.ratio
## [1] 0.9176471

5.3.1 Arguments with default values

Now suppose we wish to add calculating a 95% confidence interval to this function. We will use the following normal approximation standard error formula for an odds ratio:

\[ SE[\log(OR)] = \sqrt{\frac{1}{A_1} + \frac{1}{B_1} + \frac{1}{A_0} + \frac{1}{B_0}} \]

And here is the (\(1-\alpha\))% confidence interval:

\[ OR_L, OR_U = \exp\lbrace log(OR) \pm Z_{\alpha /2} SE[\log(OR)] \rbrace \]

Here is the improved function:

myOR2 <- function(x, conf.level){
  #### Prepare input
  #### x = 2x2 table amenable to cross-product
  if(missing(conf.level)) stop("Must specify confidence level")
  aa = x[1, 1]
  bb = x[1, 2]
  cc = x[2, 1]
  dd = x[2, 2]

  Z <- qnorm((1 + conf.level)/2)
  
  #### Do calculations
  logOR <- log((aa*dd)/(bb*cc))
  SE.logOR <- sqrt(1/aa + 1/bb + 1/cc + 1/dd)
  OR <- exp(logOR)
  CI <- exp(logOR + c(-1, 1)*Z*SE.logOR)

  #### Collect results
  list(data = x, odds.ratio = OR, conf.int = CI)
}

Notice that conf.level is a new argument, but with no default value. If a user forgets to specify a default value, the following line handles this possibility:

if(missing(conf.level)) stop("Must specify confidence level")

Now we test this function:

tab.test <- xtabs(~ Ill + Spinach, data = odatcsv)
myOR2(tab.test)

## Error in myOR2(tab.test) (from #4) : Must specify confidence level
myOR2(tab.test, 0.95)
## $data
##    Spinach
## Ill  N  Y
##   N 12 17
##   Y 20 26
## 
## $odds.ratio
## [1] 0.9176471
## 
## $conf.int
## [1] 0.3580184 2.3520471

If an argument has a usual value, then specify this as an argument default value:

myOR3 <- function(x, conf.level = 0.95){
  #### Prepare input
  #### x = 2x2 table amenable to cross-product
  aa = x[1, 1]
  bb = x[1, 2]
  cc = x[2, 1]
  dd = x[2, 2]
  Z <- qnorm((1 + conf.level)/2)
  
  #### Do calculations
  logOR <- log((aa*dd)/(bb*cc))
  SE.logOR <- sqrt(1/aa + 1/bb + 1/cc + 1/dd)
  OR <- exp(logOR)
  CI <- exp(logOR + c(-1, 1)*Z*SE.logOR)

  #### Collect results
  list(data = x, odds.ratio = OR, conf.int = CI)
}

We test our new function:

tab.test <- xtabs(~ Ill + Spinach, data = odatcsv)
myOR3(tab.test)
## $data
##    Spinach
## Ill  N  Y
##   N 12 17
##   Y 20 26
## 
## $odds.ratio
## [1] 0.9176471
## 
## $conf.int
## [1] 0.3580184 2.3520471
myOR3(tab.test, 0.90)
## $data
##    Spinach
## Ill  N  Y
##   N 12 17
##   Y 20 26
## 
## $odds.ratio
## [1] 0.9176471
## 
## $conf.int
## [1] 0.4165094 2.0217459

5.3.2 Passing optional arguments using the ... function

On occasion we will have a function nested inside one of our functions and we need to be able to pass optional arguments to this nested function. This commonly occurs when we write functions for customized graphics but only wish to specify some arguments for the nested function and leave the remaining arguments optional. For example, consider this function:

myplot <- function(x, y, type = "b", ...){
  plot(x, y, type = type, ...)
}

When using myplot one only needs to provide x and y arguments. The type option has been set to a default value of “b”. The ... function will pass any optional arguments to the nested plot function. Of course, the optional arguments must be valid options for plot function.

5.4 Lexical scoping

The variables which occur in the body of a function can be divided into three classes; formal parameters, local variables and free variables. The formal parameters of a function are those occurring in the argument list of the function. Their values are determined by the process of binding the actual function arguments to the formal parameters. Local variables are those whose values are determined by the evaluation of expressions in the body of the functions. Variables which are not formal parameters or local variables are called free variables. Free variables become local variables if they are assigned to. Consider the following function definition.

f <- function(x){
  y <- 2 * x
  print(x)
  print(y)
  print(z)
}

In this function, x is a formal parameter, y is a local variable and z is a free variable. In R the value of free variable is resolved by first looking in the environment in which the function was defined. This is called lexical scope. If the free variable is not defined there, R looks in the parent environment. For the function f this would be the global environment (workspace).

To understand the implications of lexical scope consider the following:

rm(list = ls())
ls()

## character(0)

f <- function(x){
    y <- 2 * x
    print(x)
    print(y)
    print(z)
}

f(5)

## [1] 5
## [1] 10
## Error in print(z) : object 'z' not found

z <- 99
f(5)

## [1] 5
## [1] 10
## [1] 99

In the f function z is a free variable. The first time f is executed z is not defined in the function. R looks in the enclosing environment and does not find a value for z and reports an error. However, when a object z is created in the global environment, R is able to find it and uses it.

To summarize, from Roger Peng (CITE):

  1. “If the value of a symbol is not found in the environment in which a function was defined, then the search is continued in the parent environment.
  2. The search continues down the sequence of parent environments until we hit the top-level environment; this usually the global environment (workspace) or the namspace of a package.
  3. After the top-level environment, the search continues down the search list until we hit the empty environment."

Lexical scoping is convenient because it allows nested functions with free variables to run provided the variable has been defined in an enclosing environment. This convenience becomes obvious when one writes many programs. However, there is a danger: an unintended free variable many find an unintended value in an parent environment. This may go undetected because no error is reported. This can happen when there are many objects in the workspace from previous sessions. A good habit is to clear the workspace of all objects at the beginning of every session.

Here is another example from the R introductory manual.22 Consider a function called cube.

cube <- function(n) {
  sq <- function() {n * n}
  n * sq()
}

The variable n in the function sq is not an argument to that function. Therefore it is a free variable and the scoping rules must be used to ascertain the value that is to be associated with it. Under lexical scope (e.g., in R) it is the parameter to the function cube since that is the active binding for the variable n at the time the function sq was defined.

cube(2)
## [1] 8

Here is a variation that demonstrates lexical scoping but with an unexpected numerical answer.

n <- 3
sq <- function() {n * n}
cube <- function(n) {
  n * sq()
}
cube(2)
## [1] 18

Why does cube(2) = 18? The sq function has a free variable n. When sq is evaluated inside of cube, sq must find a value for its free variable n. Under lexical scoping, sq looks in the parent environment when sq was defined. The sq function was defined in the global environment, so it looks, finds, and uses n = 3.

5.4.1 Functions with a mutatable state

Lexical scoping can be used to create a function with a mutable state. Consider the following function that simulates a banking savings account balance with deposits and withdrawals.23

open.account <- function(total) {
    list(deposit = function(amount) {
        if (amount <= 0) stop("Deposits must be positive!\n")
        total <<- total + amount
        cat(amount, "deposited.  Your balance is", total, "\n\n")
    }, withdraw = function(amount) {
        if (amount > total) stop("You don't have that much money!\n")
        total <<- total - amount
        cat(amount, "withdrawn.  Your balance is", total, "\n\n")
    }, balance = function() {
        cat("Your balance is", total, "\n\n")
    })
}

The first total is a formal parameter of the function open.account. There is a list with three functions: deposit, withdraw, and balance. Inside these functions, the first total is a free variable; they find their values in the enclosing environment’s formal parameter (total) which is one level up. The special assignment operator, “<<-”, makes an assignment to the enclosing environment, one level up, updating the value of the formal parameter target.

Let’s open some savings accounts and make deposits and withdrawals.

luis <- open.account(500)
angela <- open.account(100)
luis$withdraw(75)
## 75 withdrawn.  Your balance is 425
angela$deposit(250)
## 250 deposited.  Your balance is 350
luis$balance()
## Your balance is 425
angela$balance()
## Your balance is 350

5.5 Recursive functions

A function that calls itself is a recursive function. Recursive functions break a problem down into small steps that require the same methodologic technique. For example, consider the factorial of 5:

\[ 5! = 5 \times 4! = 5 \times 4 \times 3! = 5 \times 4 \times 3 \times 2! = = 5 \times 4 \times 3 \times 2 \times 1! \]

We can write a recursive function for factorial calculations:

recursive.factorial <- function(x) {
    if (x == 0) {
        return(1)
    } else {
        return(x * recursive.factorial(x - 1))
    }
}
recursive.factorial(5)
## [1] 120
factorial(5)  # compare to native R function
## [1] 120

5.6 Debugging and profiling R code

TODO: pending

5.7 Example: Bootstrap of risk ratio data

Consider the following data (Table 5.1) with the number of coronary heart disease events comparing men with Type A behavior to men with Type B behavior (Selvin 1998). There is no formula for calculating an exact confidence interval for a risk ratio. When the distribution for a statistical estimate is not known we can use the bootstrap method. Essentially, this involves the following steps:

  1. Simulate risk ratio estimates using the data
  2. Estimate the mean of the risk ratios
  3. Choose appropriate quantile values as confidence limits
Table 5.1: Counts of coronary heart disease (CHD) events for Type A and Type B subjects
CHD
Exposure Yes No Total
Type A 178 1411 1589
Type B 79 1486 1565
rr.boot <- function(x1, x0, conf.level = 0.95, replicates = 5000){
  #### x1 = c(number cases among exposed, number of exposed)
  #### x0 = c(number cases among nonexposed, number of nonexposed)
  #### prepare input 
  n1 <- x1[2]; n0 <- x0[2]
  p1 <- x1[1]/n1 # risk among exposed
  p0 <- x0[1]/n0 # risk among unexposed
  #### calculate
  RR <- p1/p0 # maximum likelihood estimate of risk ratio
  r1 <- rbinom(replicates, n1, p1)/n1
  r0 <- rbinom(replicates, n0, p0)/n0
  rrboot <- r1/r0
  rrbar <- mean(rrboot)
  alpha <- 1 - conf.level
  ci <- quantile(rrboot, c(alpha/2, 1-alpha/2))
  #### collect
  list(x = x,
       risks = c(p1 = p1, p0 = p0),
       risk.ratio = RR,
       rrboot.mean = rrbar,
       conf.int = unname(ci),
       conf.level = conf.level,
       replicates = replicates)
}

#### test function
xx1 <- c(178, 1589) # CHD among Type A
xx0 <- c(79, 1565) # CHD among Type B
rr.boot(xx1, xx0)
## $x
## [1] 11 12 13 14 15
## 
## $risks
##         p1         p0 
## 0.11202014 0.05047923 
## 
## $risk.ratio
## [1] 2.219133
## 
## $rrboot.mean
## [1] 2.251069
## 
## $conf.int
## [1] 1.738612 2.897633
## 
## $conf.level
## [1] 0.95
## 
## $replicates
## [1] 5000

5.8 Exercises

5.8.1

Consider Table 5.2 for cohort (risk) data.

Table 5.2: \(2 \times 2\) table for cohort (risk) data
Exposed
Disease Yes No Total
Yes \(a\) \(b\) \(M_1\)
No \(c\) \(d\) \(M_0\)
Total \(N_1\) \(N_0\) \(T\)

The risk ratio is calculated by

\[ RR = \frac{a/N_1}{b/N_0} = \frac{R_1}{R_0}, \]

and the standard error is calculated by

\[ SE(ln(RR)) = \sqrt{\frac{1}{a}-\frac{1}{N_1}+\frac{1}{b}-\frac{1}{N_0}}. \]

Table 5.3 displays the results of a study where 700 patients were given access to a new drug treatment.

Table 5.3: Results of a study where 700 patients were given access to a new drug treatment
Treatment
Recovered Yes No Total
Yes 273 289 562
No 77 61 138
Total 350 350 700

Using the results from Table 5.3 create and test a function that:

  • Calculates the risks \(R_1\) and \(R_0\).
  • Calculates the risk ratio (\(RR\)).
  • Calculates 95% confidence intervals.
  • Uses the fisher.test function (Fisher’s Exact Test) to calculate the two-sided \(p\)-value of the null hypothesis of no association.
  • Collects and returns all the results as a ‘list’.

5.8.2

Using the read.csv function import the data frame for Table 5.3. Use the code below.

drugrx.df <- read.csv('~/git/phds/data/jp-drugrx-p2.txt')

Analyze the jp-drugrx-p2.df data frame using the xtab function, your function from the previous problem, and any R functions you wish. Is the drug treatment effective? Is there a difference in drug treatment effectiveness comparing men and women?

5.8.3

Study this clean version of the Oswego data set and import it as a data frame: ~/git/phds/data/oswego/oswego2.txt

Import this clean version.

oswego2 <- read.csv("~/git/phds/data/oswego/oswego2.txt", strip.white = TRUE)
food.item <- c("Baked.ham", "Spinach", "Mashed.potato", "Cabbage.salad", "Jello", 
    "Rolls", "Brown.bread", "Milk", "Coffee", "Water", "Cakes", "Vanilla.ice.cream", 
    "Chocolate.ice.cream", "Fruit.salad")
names(oswego2)[9:22] <- food.item
str(oswego2)
## 'data.frame':    75 obs. of  22 variables:
##  $ ID                 : int  2 3 4 6 7 8 9 10 14 16 ...
##  $ Age                : int  52 65 59 63 70 40 15 33 10 32 ...
##  $ Sex                : Factor w/ 2 levels "F","M": 1 2 1 1 2 1 1 1 2 1 ...
##  $ MealDate           : Factor w/ 1 level "04/18/1940": 1 1 1 1 1 1 1 1 1 1 ...
##  $ MealTime           : Factor w/ 6 levels "11:00:00","18:30:00",..: 5 2 2 4 4 4 6 3 4 NA ...
##  $ Ill                : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ OnsetDate          : Factor w/ 2 levels "4/18/1940","4/19/1940": 2 2 2 1 1 2 2 1 2 2 ...
##  $ OnsetTime          : Factor w/ 17 levels "00:00:00","00:30:00",..: 2 2 2 15 15 4 3 16 4 7 ...
##  $ Baked.ham          : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 1 2 1 2 ...
##  $ Spinach            : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 1 2 1 2 ...
##  $ Mashed.potato      : Factor w/ 2 levels "N","Y": 2 2 1 1 2 1 1 2 1 1 ...
##  $ Cabbage.salad      : Factor w/ 2 levels "N","Y": 1 2 1 2 1 1 1 1 1 1 ...
##  $ Jello              : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 1 1 1 ...
##  $ Rolls              : Factor w/ 2 levels "N","Y": 2 1 1 1 2 1 1 2 1 2 ...
##  $ Brown.bread        : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 2 1 1 ...
##  $ Milk               : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Coffee             : Factor w/ 2 levels "N","Y": 2 2 2 1 2 1 1 1 1 2 ...
##  $ Water              : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 2 1 1 ...
##  $ Cakes              : Factor w/ 2 levels "N","Y": 1 1 2 1 1 1 2 1 1 2 ...
##  $ Vanilla.ice.cream  : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
##  $ Chocolate.ice.cream: Factor w/ 2 levels "N","Y": 1 2 2 1 1 2 2 2 2 2 ...
##  $ Fruit.salad        : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...

Study the following epitable function:

epitable <- function(exposure, outcome, ...){
  #### Updated 2016-10-16
  #### 2x2 table from 'xtabs' function
  #### row names are exposure status
  #### col names are illness outcome
  #### ... = options to fisher.test functions
  x <- xtabs(~ exposure + outcome, na.action=na.pass)
  rowtot <- apply(x, 1, sum)
  rowdist <- sweep(x, 1, rowtot, '/')
  risk0 <- rowdist['N','Y']
  risk1 <- rowdist['Y','Y']
  rr <- risk1/risk0
  or <- (risk1/(1-risk1)) / (risk0/(1-risk0))
  ft <- fisher.test(x, ...)
  pv <- ft$p.value
  results <- unname(round(c(risk0, risk1, rr, or, pv), 2))
  names(results) <- c('R0', 'R1', 'RR', 'OR', 'p.value')
  list(data = x, results = results)
}

Here is an example of testing one food item:

epitable(oswego2$Baked.ham, oswego2$Ill)
## $data
##         outcome
## exposure  N  Y
##        N 12 17
##        Y 17 29
## 
## $results
##      R0      R1      RR      OR p.value 
##    0.59    0.63    1.08    1.20    0.81

Use the epitable function to test each food item to see if it is associated with developing illness. Construct a table with the following column headings:

  • Food item
  • \(R_0\)
  • \(R_1\)
  • \(RR\) (risk ratio)
  • \(OR\) (odds ratio)
  • Fisher \(p\) value

Discuss your findings.

5.8.4

Write a for loop to automate the creation of the outbreak table from previous exercise.

5.8.4.1 Hint

outbreak <- matrix(NA, length(food.item), 5)
rownames(outbreak) <- food.item
colnames(outbreak) <- c('R.0','R.1','RR','OR','P value')
for(i in 1:length(food.item)) {
  outbreak[i,] <- epitable(oswego2[,food.item[i]], oswego2$Ill)$results
}
outbreak

5.8.5

Study the following use of lapply and sapply. Explain how both functions work, and what are their differences.

lapply(oswego2[,food.item], function(x) epitable(x, oswego2$Ill)$results)
## $Baked.ham
##      R0      R1      RR      OR p.value 
##    0.59    0.63    1.08    1.20    0.81 
## 
## $Spinach
##      R0      R1      RR      OR p.value 
##    0.62    0.60    0.97    0.92    1.00 
## 
## $Mashed.potato
##      R0      R1      RR      OR p.value 
##    0.62    0.62    1.00    1.00    1.00 
## 
## $Cabbage.salad
##      R0      R1      RR      OR p.value 
##    0.60    0.64    1.08    1.22    0.81 
## 
## $Jello
##      R0      R1      RR      OR p.value 
##    0.58    0.70    1.21    1.68    0.44 
## 
## $Rolls
##      R0      R1      RR      OR p.value 
##    0.66    0.57    0.86    0.68    0.48 
## 
## $Brown.bread
##      R0      R1      RR      OR p.value 
##    0.58    0.67    1.14    1.43    0.62 
## 
## $Milk
##      R0      R1      RR      OR p.value 
##    0.62    0.50    0.81    0.61    0.64 
## 
## $Coffee
##      R0      R1      RR      OR p.value 
##    0.61    0.61    1.00    1.00    1.00 
## 
## $Water
##      R0      R1      RR      OR p.value 
##    0.65    0.54    0.84    0.64    0.45 
## 
## $Cakes
##      R0      R1      RR      OR p.value 
##    0.54    0.68    1.24    1.75    0.34 
## 
## $Vanilla.ice.cream
##      R0      R1      RR      OR p.value 
##    0.14    0.80    5.57   23.45    0.00 
## 
## $Chocolate.ice.cream
##      R0      R1      RR      OR p.value 
##    0.74    0.53    0.72    0.40    0.09 
## 
## $Fruit.salad
##      R0      R1      RR      OR p.value 
##    0.61    0.67    1.10    1.29    1.00
t(sapply(oswego2[,food.item], function(x) epitable(x, oswego2$Ill)$results))
##                       R0   R1   RR    OR p.value
## Baked.ham           0.59 0.63 1.08  1.20    0.81
## Spinach             0.62 0.60 0.97  0.92    1.00
## Mashed.potato       0.62 0.62 1.00  1.00    1.00
## Cabbage.salad       0.60 0.64 1.08  1.22    0.81
## Jello               0.58 0.70 1.21  1.68    0.44
## Rolls               0.66 0.57 0.86  0.68    0.48
## Brown.bread         0.58 0.67 1.14  1.43    0.62
## Milk                0.62 0.50 0.81  0.61    0.64
## Coffee              0.61 0.61 1.00  1.00    1.00
## Water               0.65 0.54 0.84  0.64    0.45
## Cakes               0.54 0.68 1.24  1.75    0.34
## Vanilla.ice.cream   0.14 0.80 5.57 23.45    0.00
## Chocolate.ice.cream 0.74 0.53 0.72  0.40    0.09
## Fruit.salad         0.61 0.67 1.10  1.29    1.00

5.8.6

Study the following R code and explain the final answer for g(2):

a = 1
b = 2
f <- function(x) {
    a * x + b
}
g <- function(x) {
    a = 2
    b = 1
    f(x)
}
g(2)
## [1] 4

5.8.7

Study the following R code and explain the final answer for g(2):

g <- function(x) {
    a = 2
    b = 1
    f <- function(x) {
        a * x + b
    }
    f(x)
}
g(2)
## [1] 5

Bibliography

Selvin, S. 1998. Modern Applied Biostatistical Methods Using S-Plus. New York: Oxford University Press.