Chapter 5 Programming and R functions
“Good programmers write good code, great programmers borrow good code.”
R is a comprehensive and powerful programming language. In this section we briefly summarize how to use R for introductory programming, including writing and executing functions.
5.1 Basic programming
Basic data science programming in R is just a list of R expressions, that are collected and executed as a batch job. The collection of R expressions are saved in an ASCII text file with a .R
extension. In RStudio, from the main menu, select File
, New
, and then R Script
. This script file will be saved with an .R
extension. This script file can be edited and executed within RStudio. Alternatively, we can edit this file using our favorite text editor (e.g., GNU Emacs).
What are the characteristics of good R programming?
- Use a good text editor (or RStudio) for programming
- Organize batch jobs into numbered sequential files (e.g., job01.R)
- Avoid graphical menu-driven approaches
First, use a good text editor (or RStudio) for programming. Each R expression will span one or more lines. Although one could write and submit each line at the R console, this approach is inefficient and not recommended. Instead, type the expressions into your favorite text editor and save with a .R
extension. Then, selected expressions or the whole file can be executed in R. Use the text editor that comes with R, or text editors customized to work with R (e.g., RStudio, Emacs with ESS).
Second, we organize batch jobs into sequential files. Data analysis is a series of tasks involving data entry, checking, cleaning, analysis, and reporting. Although data analysts are primarily involved in analysis and reporting, they may be involved in earlier phases of data preparation. Regardless of stage of involvement, data analysts should organize, conduct, and document their analytics tasks and batch jobs in chronological order. For example, batch jobs might be named as follows: job01-cleaning.R
, job02-recoding.R
, job03-descriptive.R
, job04-logistic.R
, etc. Naming the program file has two components: ‘jobs’ represent major tasks and are always numbered in chronological order (job01-*.R
, job02-*.R
, etc.); and a brief descriptor can be appended to the first component of the file name (job01-recode-data.R
, job02-bivariate-analysis.R
).
If one needs to repeat parts of a previous job, then add new jobs, not edit old jobs. This way our analysis can always be reviewed, replicated, and audited exactly in the order it was conducted. We avoid editing earlier jobs. If we edit previous jobs, then we must rerun all subsequent jobs in chronological order.
Third, we avoid graphical, menu-driven approaches. While this is a tempting approach, our work cannot be documented, replicated, and audited. The best approach is to collect R expressions into batch jobs and run them using the source
function.
Fourth, while preliminary and core analyses should be conducted with script (.R
) files, the final report and analyses should be compiled with Rmarkdown (.Rmd
) files. For now, we focus on script files.
5.2 Intermediate programming
The next level of R programming involves (1) implementing control flow (decision points); (2) implementing dependencies in calculations or data manipulation; and (3) improving execution efficiency,
5.2.1 Control statements
Control flow involves one or more decision points. A simple decision point goes like this: if a condition is TRUE
, do {this} and then continue; if it is FALSE
, do not do {this} and then continue. When R continues, the next R expression can be any valid expression, including another decision point.
5.2.1.1 The if
function
We use the if
function to implement single decision points:
if(TRUE) { execute these R expressions }
If the condition if FALSE
, R skips the bracketed expression and continues executing subsequent lines. Study this example:
x <- c(1, 2, NA, 4, 5)
y <- c(1, 2, 3, 4, 5)
if(any(is.na(x))) {x[is.na(x)] <- 999}
x
## [1] 1 2 999 4 5
if(any(is.na(y))) {y[is.na(y)] <- 999}
y
## [1] 1 2 3 4 5
The first if
condition evaluated to TRUE
and the missing value was replaced. The second if
condition evaluated to FALSE
and the bracketed expressions were not evaluated.
5.2.1.2 The else
function
Up to now the if
condition had only one possible response. If there are two, mutually exclusive possible responses, add one else
statement:
if(TRUE) {
execute these R expressions
} else {
execute these R expressions
}
Here is an example:
x <- c(1, 2, NA, 4, 5)
y <- c(1, 2, 3, 4, 5)
if (any(is.na(x))) {
x[is.na(x)] <- 999
cat("NAs replaced\n")
} else {
cat("No missing values; no replacement\n")
}
## NAs replaced
x
## [1] 1 2 999 4 5
if (any(is.na(y))) {
y[is.na(y)] <- 999
cat("NAs replaced\n")
} else {
cat("No missing values; no replacement\n")
}
## No missing values; no replacement
y
## [1] 1 2 3 4 5
Therefore, use the if
and else
combination if one needs to evaluate one of two possible collection of R expressions.
If one needs to evaluate possibly one of two possible collection of R expressions then use the following pattern:
if(TRUE) {
execute these R expressions
} else if(TRUE) {
execute these R expressions
}
The if
and else
functions can be combined to achieve any desired control flow.
5.2.1.3 The “short circuit” logical operators
The “short circuit” &&
and ||
logical operators are used for control flow in if
functions. If logical vectors are provided, only the first element of each vector is used. Therefore, for element-wise comparisons of two or more vectors, use the &
and |
operators but not the &&
and ||
operators.21 For if
function comparisons use the &&
and ||
operators.
Suppose we want to square the elements of a numeric vector but not if it is a matrix.
x <- 1:5
y <- matrix(1:4, 2, 2)
if(is.numeric(x) && !is.matrix(x)) {
x^2
} else cat("Either not numeric or is a matrix\n")
## [1] 1 4 9 16 25
if(is.numeric(y) && !is.matrix(y)) {
y^2
} else cat("Either not numeric or is a matrix\n")
## Either not numeric or is a matrix
The &&
and ||
operators are called “short circuit” operators because not all its arguments may be evaluated: moving from left to right, only sufficient arguments are evaluated to determine if the if
function should return TRUE or FALSE. This can save considerable time if some the arguments are complicated functions that require significant computing time to evaluate to either TRUE or FALSE. In the previous example, because !is.matrix(y)
evaluates to FALSE, it was not necessary to evaluate is.numeric(y)
.
5.2.2 Vectorized approach
An important advantage of R is the availability of functions that perform vectorized calculations. For example, suppose we wish to add to columns of a matrix. Here is one approach:
(tab <- matrix(1:12, 3, 4))
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
(colsum <- tab[,1] + tab[,2] + tab[,3] + tab[,4])
## [1] 22 26 30
(colsum2 <- apply(tab, 1, sum))
## [1] 22 26 30
However, this can be accomplished more efficiently using the apply
function:
(colsum2 <- apply(tab, 1, sum))
## [1] 22 26 30
In general, we want to use these types of functions (e.g., tapply
, sweep
, outer
, mean
, etc.) because they have been optimized to perform vectorized calculations.
5.2.2.1 The ifelse
function
The ifelse
function is a vectorized element-wise implementation of the if
and else
functions. We demonstrate using the practical example of recoding a 2-level variable.
sex <- c("M", NA, "F", "F", NA, "M", "F", "M")
(sex2 <- ifelse(sex=="M", "Male", "Female")) # binary recode
## [1] "Male" NA "Female" "Female" NA "Male" "Female" "Male"
If an element of sex
contains “M” (TRUE), it is recoded to “Male” (in sex2
), otherwise (FALSE) it is recoded to “Female”. This assumes that there are only “M”s and “F”s in the data vector.
5.2.3 Looping
Looping is a common programming approach that is discouraged in R because it is inefficient. It is much better to conduct vectorized calculations using existing functions. For example, suppose we want to sum a numeric vector.
x <- c(2,4,6,8,10)
index <- 1:length(x) # create vector of sequential integers
xsum <- 0
for (i in index){
xsum <- xsum + x[i]
}
xsum
## [1] 30
A much better approach is to use the sum
function:
sum(x)
## [1] 30
Unless it is absolutely necessary, we avoid looping.
Looping is necessary when (1) there is no R function to conduct a vectorized calculation, and (2) when the result of an element depends on the result of a preceding element which may not be known beforehand (e.g., when it is the result of a random process).
5.2.3.1 The for
function
The previous example was a for
loop. Here is the syntax:
for (i in somevector}{
do some calcuation with ith element of somevector
}
In the for
function R loops and uses the ith element of somevector either directly or indirectly (e.g., indexing another vector). Here is using the vector directly:
for(i in 1:3){
cat(i^2, "\n")
}
## 1
## 4
## 9
The letters
object contain the American English alphabet. Here we use an integer vector for indexing letters
:
letters # display letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
for(i in 1:3){
cat(letters[i],"\n")
}
## a
## b
## c
Somevector can be any vector:
kids <- c("Tomasito", "Luisito", "Angelita")
for (i in kids) {print(i)}
## [1] "Tomasito"
## [1] "Luisito"
## [1] "Angelita"
5.2.3.2 The while
function
The while
function will continue to evaluate a collection of R expressions while a condition is true. Here is the syntax:
while(TRUE) {
execute these R expressions
}
Here is a trivial example:
x <- 1 ; z <- 0
while(z < 5){
show(z)
z <- z + x
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
The while
function is used for optimization functions that are converging to a numerical value.
%TODO
5.2.3.3 The break
and next
functions
The break
expression will break out of a for
or while
loop if a condition is met, and transfers control to the first statement outside of the inner-most loop. Here is the general syntax:
for (i in somevector}{
do some calcuation with ith element of somevector
if(TRUE) break
}
The next
expression halts the processing of the current iteration and advances the looping index. Here is the general syntax:
for (i in somevector}{
do some calcuation with ith element of somevector
if(TRUE) next
}
Both break
and next
apply only to the innermost of nested loops.
5.2.3.4 The double for
loop
In the next example we nest two for
loop to generate a multiplication table for two integer vectors.
x <- 11:15
y <- 16:25
mtab <- matrix(NA, nrow = length(x), ncol = length(y))
rownames(mtab) <- x
colnames(mtab) <- y
for (i in 1:length(x)) {
for (j in 1:length(y)) {
mtab[i, j] <- x[i] * y[j]
}
}
mtab
## 16 17 18 19 20 21 22 23 24 25
## 11 176 187 198 209 220 231 242 253 264 275
## 12 192 204 216 228 240 252 264 276 288 300
## 13 208 221 234 247 260 273 286 299 312 325
## 14 224 238 252 266 280 294 308 322 336 350
## 15 240 255 270 285 300 315 330 345 360 375
Notice that the outer
function is a more efficient way of conducting a calculation using all combinations for two numeric vectors.
x <- 11:15; y <- 16:25
mtab2 <- outer(x, y, '*')
rownames(mtab2) <- x; colnames(mtab2) <- y
mtab2
## 16 17 18 19 20 21 22 23 24 25
## 11 176 187 198 209 220 231 242 253 264 275
## 12 192 204 216 228 240 252 264 276 288 300
## 13 208 221 234 247 260 273 286 299 312 325
## 14 224 238 252 266 280 294 308 322 336 350
## 15 240 255 270 285 300 315 330 345 360 375
5.3 Writing R functions
Writing R functions is a great way to create “commands” to customize analyses or repeated operations. Here is the general analysis approach:
- Prepare inputs
- Do calculations
- Collect results
For example, suppose we are writing R code to calculate the odds ratio from a \(2 \times 2\) table with the appropriate format. For this we will use the Oswego outbreak data set.
#### Prepare inputs
odatcsv <- read.csv('~/git/phds/data/oswego/oswego2.csv')
head(odatcsv)
## ID Age Sex MealDate MealTime Ill OnsetDate TimeOnset Baked.Ham Spinach
## 1 2 52 F 04/18/1940 20:00:00 Y 4/19/1940 00:30:00 Y Y
## 2 3 65 M 04/18/1940 18:30:00 Y 4/19/1940 00:30:00 Y Y
## 3 4 59 F 04/18/1940 18:30:00 Y 4/19/1940 00:30:00 Y Y
## 4 6 63 F 04/18/1940 19:30:00 Y 4/18/1940 22:30:00 Y Y
## 5 7 70 M 04/18/1940 19:30:00 Y 4/18/1940 22:30:00 Y Y
## 6 8 40 F 04/18/1940 19:30:00 Y 4/19/1940 02:00:00 N N
## Mashed.Potatoes Cabbage.Salad Jello Rolls Brown.Bread Milk Coffee Water
## 1 Y N N Y N N Y N
## 2 Y Y N N N N Y N
## 3 N N N N N N Y N
## 4 N Y Y N N N N Y
## 5 Y N Y Y Y N Y Y
## 6 N N N N N N N N
## Cakes Vanilla.Ice.Cream Chocolate.Ice.Cream Fruit.Salad
## 1 N Y N N
## 2 N Y Y N
## 3 Y Y Y N
## 4 N Y N N
## 5 N Y N N
## 6 N Y Y N
tab1 = xtabs(~ Ill + Spinach, data = odatcsv)
tab1
## Spinach
## Ill N Y
## N 12 17
## Y 20 26
aa = tab1[1, 1]
bb = tab1[1, 2]
cc = tab1[2, 1]
dd = tab1[2, 2]
#### Do calculations
crossprod.OR = (aa*dd)/(bb*cc)
#### Collect results
list(data = tab1, odds.ratio = crossprod.OR)
## $data
## Spinach
## Ill N Y
## N 12 17
## Y 20 26
##
## $odds.ratio
## [1] 0.9176471
Now that we are familiar of what it takes to calculate an odds ratio from a 2-way table we can convert the code into a function and load it at the R console. Here is new function:
myOR <- function(x){
#### Prepare input
#### x = 2x2 table amenable to cross-product
aa = x[1, 1]
bb = x[1, 2]
cc = x[2, 1]
dd = x[2, 2]
#### Do calculations
crossprod.OR = (aa*dd)/(bb*cc)
#### Collect results
list(data = x, odds.ratio = crossprod.OR)
}
Now we can test the function:
tab.test = xtabs(~ Ill + Spinach, data = odatcsv)
myOR(tab.test)
## $data
## Spinach
## Ill N Y
## N 12 17
## Y 20 26
##
## $odds.ratio
## [1] 0.9176471
5.3.1 Arguments with default values
Now suppose we wish to add calculating a 95% confidence interval to this function. We will use the following normal approximation standard error formula for an odds ratio:
\[ SE[\log(OR)] = \sqrt{\frac{1}{A_1} + \frac{1}{B_1} + \frac{1}{A_0} + \frac{1}{B_0}} \]
And here is the (\(1-\alpha\))% confidence interval:
\[ OR_L, OR_U = \exp\lbrace log(OR) \pm Z_{\alpha /2} SE[\log(OR)] \rbrace \]
Here is the improved function:
myOR2 <- function(x, conf.level){
#### Prepare input
#### x = 2x2 table amenable to cross-product
if(missing(conf.level)) stop("Must specify confidence level")
aa = x[1, 1]
bb = x[1, 2]
cc = x[2, 1]
dd = x[2, 2]
Z <- qnorm((1 + conf.level)/2)
#### Do calculations
logOR <- log((aa*dd)/(bb*cc))
SE.logOR <- sqrt(1/aa + 1/bb + 1/cc + 1/dd)
OR <- exp(logOR)
CI <- exp(logOR + c(-1, 1)*Z*SE.logOR)
#### Collect results
list(data = x, odds.ratio = OR, conf.int = CI)
}
Notice that conf.level
is a new argument, but with no default value. If a user forgets to specify a default value, the following line handles this possibility:
if(missing(conf.level)) stop("Must specify confidence level")
Now we test this function:
tab.test <- xtabs(~ Ill + Spinach, data = odatcsv)
myOR2(tab.test)
## Error in myOR2(tab.test) (from #4) : Must specify confidence level
myOR2(tab.test, 0.95)
## $data
## Spinach
## Ill N Y
## N 12 17
## Y 20 26
##
## $odds.ratio
## [1] 0.9176471
##
## $conf.int
## [1] 0.3580184 2.3520471
If an argument has a usual value, then specify this as an argument default value:
myOR3 <- function(x, conf.level = 0.95){
#### Prepare input
#### x = 2x2 table amenable to cross-product
aa = x[1, 1]
bb = x[1, 2]
cc = x[2, 1]
dd = x[2, 2]
Z <- qnorm((1 + conf.level)/2)
#### Do calculations
logOR <- log((aa*dd)/(bb*cc))
SE.logOR <- sqrt(1/aa + 1/bb + 1/cc + 1/dd)
OR <- exp(logOR)
CI <- exp(logOR + c(-1, 1)*Z*SE.logOR)
#### Collect results
list(data = x, odds.ratio = OR, conf.int = CI)
}
We test our new function:
tab.test <- xtabs(~ Ill + Spinach, data = odatcsv)
myOR3(tab.test)
## $data
## Spinach
## Ill N Y
## N 12 17
## Y 20 26
##
## $odds.ratio
## [1] 0.9176471
##
## $conf.int
## [1] 0.3580184 2.3520471
myOR3(tab.test, 0.90)
## $data
## Spinach
## Ill N Y
## N 12 17
## Y 20 26
##
## $odds.ratio
## [1] 0.9176471
##
## $conf.int
## [1] 0.4165094 2.0217459
5.3.2 Passing optional arguments using the ...
function
On occasion we will have a function nested inside one of our functions and we need to be able to pass optional arguments to this nested function. This commonly occurs when we write functions for customized graphics but only wish to specify some arguments for the nested function and leave the remaining arguments optional. For example, consider this function:
myplot <- function(x, y, type = "b", ...){
plot(x, y, type = type, ...)
}
When using myplot
one only needs to provide x
and y
arguments. The type
option has been set to a default value of “b”. The ...
function will pass any optional arguments to the nested plot
function. Of course, the optional arguments must be valid options for plot
function.
5.4 Lexical scoping
The variables which occur in the body of a function can be divided into three classes; formal parameters, local variables and free variables. The formal parameters of a function are those occurring in the argument list of the function. Their values are determined by the process of binding the actual function arguments to the formal parameters. Local variables are those whose values are determined by the evaluation of expressions in the body of the functions. Variables which are not formal parameters or local variables are called free variables. Free variables become local variables if they are assigned to. Consider the following function definition.
f <- function(x){
y <- 2 * x
print(x)
print(y)
print(z)
}
In this function, x
is a formal parameter, y
is a local variable and z
is a free variable. In R the value of free variable is resolved by first looking in the environment in which the function was defined. This is called lexical scope. If the free variable is not defined there, R looks in the parent environment. For the function f
this would be the global environment (workspace).
To understand the implications of lexical scope consider the following:
rm(list = ls())
ls()
## character(0)
f <- function(x){
y <- 2 * x
print(x)
print(y)
print(z)
}
f(5)
## [1] 5
## [1] 10
## Error in print(z) : object 'z' not found
z <- 99
f(5)
## [1] 5
## [1] 10
## [1] 99
In the f
function z
is a free variable. The first time f
is executed z
is not defined in the function. R looks in the enclosing environment and does not find a value for z
and reports an error. However, when a object z
is created in the global environment, R is able to find it and uses it.
To summarize, from Roger Peng (CITE):
- “If the value of a symbol is not found in the environment in which a function was defined, then the search is continued in the parent environment.
- The search continues down the sequence of parent environments until we hit the top-level environment; this usually the global environment (workspace) or the namspace of a package.
- After the top-level environment, the search continues down the search list until we hit the empty environment."
Lexical scoping is convenient because it allows nested functions with free variables to run provided the variable has been defined in an enclosing environment. This convenience becomes obvious when one writes many programs. However, there is a danger: an unintended free variable many find an unintended value in an parent environment. This may go undetected because no error is reported. This can happen when there are many objects in the workspace from previous sessions. A good habit is to clear the workspace of all objects at the beginning of every session.
Here is another example from the R introductory manual.22 Consider a function called cube
.
cube <- function(n) {
sq <- function() {n * n}
n * sq()
}
The variable n
in the function sq
is not an argument to that function. Therefore it is a free variable and the scoping rules must be used to ascertain the value that is to be associated with it. Under lexical scope (e.g., in R) it is the parameter to the function cube since that is the active binding for the variable n
at the time the function sq
was defined.
cube(2)
## [1] 8
Here is a variation that demonstrates lexical scoping but with an unexpected numerical answer.
n <- 3
sq <- function() {n * n}
cube <- function(n) {
n * sq()
}
cube(2)
## [1] 18
Why does cube(2) = 18
? The sq
function has a free variable n
. When sq
is evaluated inside of cube
, sq
must find a value for its free variable n
. Under lexical scoping, sq
looks in the parent environment when sq
was defined. The sq
function was defined in the global environment, so it looks, finds, and uses n = 3
.
5.4.1 Functions with a mutatable state
Lexical scoping can be used to create a function with a mutable state. Consider the following function that simulates a banking savings account balance with deposits and withdrawals.23
open.account <- function(total) {
list(deposit = function(amount) {
if (amount <= 0) stop("Deposits must be positive!\n")
total <<- total + amount
cat(amount, "deposited. Your balance is", total, "\n\n")
}, withdraw = function(amount) {
if (amount > total) stop("You don't have that much money!\n")
total <<- total - amount
cat(amount, "withdrawn. Your balance is", total, "\n\n")
}, balance = function() {
cat("Your balance is", total, "\n\n")
})
}
The first total
is a formal parameter of the function open.account
. There is a list with three functions: deposit
, withdraw
, and balance
. Inside these functions, the first total
is a free variable; they find their values in the enclosing environment’s formal parameter (total
) which is one level up. The special assignment operator, “<<-
”, makes an assignment to the enclosing environment, one level up, updating the value of the formal parameter target
.
Let’s open some savings accounts and make deposits and withdrawals.
luis <- open.account(500)
angela <- open.account(100)
luis$withdraw(75)
## 75 withdrawn. Your balance is 425
angela$deposit(250)
## 250 deposited. Your balance is 350
luis$balance()
## Your balance is 425
angela$balance()
## Your balance is 350
5.5 Recursive functions
A function that calls itself is a recursive function. Recursive functions break a problem down into small steps that require the same methodologic technique. For example, consider the factorial of 5:
\[ 5! = 5 \times 4! = 5 \times 4 \times 3! = 5 \times 4 \times 3 \times 2! = = 5 \times 4 \times 3 \times 2 \times 1! \]
We can write a recursive function for factorial calculations:
recursive.factorial <- function(x) {
if (x == 0) {
return(1)
} else {
return(x * recursive.factorial(x - 1))
}
}
recursive.factorial(5)
## [1] 120
factorial(5) # compare to native R function
## [1] 120
5.6 Debugging and profiling R code
TODO: pending
5.7 Example: Bootstrap of risk ratio data
Consider the following data (Table 5.1) with the number of coronary heart disease events comparing men with Type A behavior to men with Type B behavior (Selvin 1998). There is no formula for calculating an exact confidence interval for a risk ratio. When the distribution for a statistical estimate is not known we can use the bootstrap method. Essentially, this involves the following steps:
- Simulate risk ratio estimates using the data
- Estimate the mean of the risk ratios
- Choose appropriate quantile values as confidence limits
CHD | |||
Exposure | Yes | No | Total |
Type A | 178 | 1411 | 1589 |
Type B | 79 | 1486 | 1565 |
rr.boot <- function(x1, x0, conf.level = 0.95, replicates = 5000){
#### x1 = c(number cases among exposed, number of exposed)
#### x0 = c(number cases among nonexposed, number of nonexposed)
#### prepare input
n1 <- x1[2]; n0 <- x0[2]
p1 <- x1[1]/n1 # risk among exposed
p0 <- x0[1]/n0 # risk among unexposed
#### calculate
RR <- p1/p0 # maximum likelihood estimate of risk ratio
r1 <- rbinom(replicates, n1, p1)/n1
r0 <- rbinom(replicates, n0, p0)/n0
rrboot <- r1/r0
rrbar <- mean(rrboot)
alpha <- 1 - conf.level
ci <- quantile(rrboot, c(alpha/2, 1-alpha/2))
#### collect
list(x = x,
risks = c(p1 = p1, p0 = p0),
risk.ratio = RR,
rrboot.mean = rrbar,
conf.int = unname(ci),
conf.level = conf.level,
replicates = replicates)
}
#### test function
xx1 <- c(178, 1589) # CHD among Type A
xx0 <- c(79, 1565) # CHD among Type B
rr.boot(xx1, xx0)
## $x
## [1] 11 12 13 14 15
##
## $risks
## p1 p0
## 0.11202014 0.05047923
##
## $risk.ratio
## [1] 2.219133
##
## $rrboot.mean
## [1] 2.251069
##
## $conf.int
## [1] 1.738612 2.897633
##
## $conf.level
## [1] 0.95
##
## $replicates
## [1] 5000
5.8 Exercises
5.8.1
Consider Table 5.2 for cohort (risk) data.
Exposed | |||
Disease | Yes | No | Total |
Yes | \(a\) | \(b\) | \(M_1\) |
No | \(c\) | \(d\) | \(M_0\) |
Total | \(N_1\) | \(N_0\) | \(T\) |
The risk ratio is calculated by
\[ RR = \frac{a/N_1}{b/N_0} = \frac{R_1}{R_0}, \]
and the standard error is calculated by
\[ SE(ln(RR)) = \sqrt{\frac{1}{a}-\frac{1}{N_1}+\frac{1}{b}-\frac{1}{N_0}}. \]
Table 5.3 displays the results of a study where 700 patients were given access to a new drug treatment.
Treatment | |||
Recovered | Yes | No | Total |
Yes | 273 | 289 | 562 |
No | 77 | 61 | 138 |
Total | 350 | 350 | 700 |
Using the results from Table 5.3 create and test a function that:
- Calculates the risks \(R_1\) and \(R_0\).
- Calculates the risk ratio (\(RR\)).
- Calculates 95% confidence intervals.
- Uses the
fisher.test
function (Fisher’s Exact Test) to calculate the two-sided \(p\)-value of the null hypothesis of no association. - Collects and returns all the results as a ‘list’.
5.8.2
Using the read.csv
function import the data frame for Table 5.3. Use the code below.
drugrx.df <- read.csv('~/git/phds/data/jp-drugrx-p2.txt')
Analyze the jp-drugrx-p2.df
data frame using the xtab
function, your function from the previous problem, and any R functions you wish. Is the drug treatment effective? Is there a difference in drug treatment effectiveness comparing men and women?
5.8.3
Study this clean version of the Oswego data set and import it as a data frame: ~/git/phds/data/oswego/oswego2.txt
Import this clean version.
oswego2 <- read.csv("~/git/phds/data/oswego/oswego2.txt", strip.white = TRUE)
food.item <- c("Baked.ham", "Spinach", "Mashed.potato", "Cabbage.salad", "Jello",
"Rolls", "Brown.bread", "Milk", "Coffee", "Water", "Cakes", "Vanilla.ice.cream",
"Chocolate.ice.cream", "Fruit.salad")
names(oswego2)[9:22] <- food.item
str(oswego2)
## 'data.frame': 75 obs. of 22 variables:
## $ ID : int 2 3 4 6 7 8 9 10 14 16 ...
## $ Age : int 52 65 59 63 70 40 15 33 10 32 ...
## $ Sex : Factor w/ 2 levels "F","M": 1 2 1 1 2 1 1 1 2 1 ...
## $ MealDate : Factor w/ 1 level "04/18/1940": 1 1 1 1 1 1 1 1 1 1 ...
## $ MealTime : Factor w/ 6 levels "11:00:00","18:30:00",..: 5 2 2 4 4 4 6 3 4 NA ...
## $ Ill : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ OnsetDate : Factor w/ 2 levels "4/18/1940","4/19/1940": 2 2 2 1 1 2 2 1 2 2 ...
## $ OnsetTime : Factor w/ 17 levels "00:00:00","00:30:00",..: 2 2 2 15 15 4 3 16 4 7 ...
## $ Baked.ham : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 1 2 1 2 ...
## $ Spinach : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 1 2 1 2 ...
## $ Mashed.potato : Factor w/ 2 levels "N","Y": 2 2 1 1 2 1 1 2 1 1 ...
## $ Cabbage.salad : Factor w/ 2 levels "N","Y": 1 2 1 2 1 1 1 1 1 1 ...
## $ Jello : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 1 1 1 ...
## $ Rolls : Factor w/ 2 levels "N","Y": 2 1 1 1 2 1 1 2 1 2 ...
## $ Brown.bread : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 2 1 1 ...
## $ Milk : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ Coffee : Factor w/ 2 levels "N","Y": 2 2 2 1 2 1 1 1 1 2 ...
## $ Water : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 2 1 1 ...
## $ Cakes : Factor w/ 2 levels "N","Y": 1 1 2 1 1 1 2 1 1 2 ...
## $ Vanilla.ice.cream : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
## $ Chocolate.ice.cream: Factor w/ 2 levels "N","Y": 1 2 2 1 1 2 2 2 2 2 ...
## $ Fruit.salad : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
Study the following epitable
function:
epitable <- function(exposure, outcome, ...){
#### Updated 2016-10-16
#### 2x2 table from 'xtabs' function
#### row names are exposure status
#### col names are illness outcome
#### ... = options to fisher.test functions
x <- xtabs(~ exposure + outcome, na.action=na.pass)
rowtot <- apply(x, 1, sum)
rowdist <- sweep(x, 1, rowtot, '/')
risk0 <- rowdist['N','Y']
risk1 <- rowdist['Y','Y']
rr <- risk1/risk0
or <- (risk1/(1-risk1)) / (risk0/(1-risk0))
ft <- fisher.test(x, ...)
pv <- ft$p.value
results <- unname(round(c(risk0, risk1, rr, or, pv), 2))
names(results) <- c('R0', 'R1', 'RR', 'OR', 'p.value')
list(data = x, results = results)
}
Here is an example of testing one food item:
epitable(oswego2$Baked.ham, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 12 17
## Y 17 29
##
## $results
## R0 R1 RR OR p.value
## 0.59 0.63 1.08 1.20 0.81
Use the epitable
function to test each food item to see if it is associated with developing illness. Construct a table with the following column headings:
- Food item
- \(R_0\)
- \(R_1\)
- \(RR\) (risk ratio)
- \(OR\) (odds ratio)
- Fisher \(p\) value
Discuss your findings.
5.8.4
Write a for
loop to automate the creation of the outbreak table from previous exercise.
5.8.4.1 Hint
outbreak <- matrix(NA, length(food.item), 5)
rownames(outbreak) <- food.item
colnames(outbreak) <- c('R.0','R.1','RR','OR','P value')
for(i in 1:length(food.item)) {
outbreak[i,] <- epitable(oswego2[,food.item[i]], oswego2$Ill)$results
}
outbreak
5.8.5
Study the following use of lapply
and sapply
. Explain how both functions work, and what are their differences.
lapply(oswego2[,food.item], function(x) epitable(x, oswego2$Ill)$results)
## $Baked.ham
## R0 R1 RR OR p.value
## 0.59 0.63 1.08 1.20 0.81
##
## $Spinach
## R0 R1 RR OR p.value
## 0.62 0.60 0.97 0.92 1.00
##
## $Mashed.potato
## R0 R1 RR OR p.value
## 0.62 0.62 1.00 1.00 1.00
##
## $Cabbage.salad
## R0 R1 RR OR p.value
## 0.60 0.64 1.08 1.22 0.81
##
## $Jello
## R0 R1 RR OR p.value
## 0.58 0.70 1.21 1.68 0.44
##
## $Rolls
## R0 R1 RR OR p.value
## 0.66 0.57 0.86 0.68 0.48
##
## $Brown.bread
## R0 R1 RR OR p.value
## 0.58 0.67 1.14 1.43 0.62
##
## $Milk
## R0 R1 RR OR p.value
## 0.62 0.50 0.81 0.61 0.64
##
## $Coffee
## R0 R1 RR OR p.value
## 0.61 0.61 1.00 1.00 1.00
##
## $Water
## R0 R1 RR OR p.value
## 0.65 0.54 0.84 0.64 0.45
##
## $Cakes
## R0 R1 RR OR p.value
## 0.54 0.68 1.24 1.75 0.34
##
## $Vanilla.ice.cream
## R0 R1 RR OR p.value
## 0.14 0.80 5.57 23.45 0.00
##
## $Chocolate.ice.cream
## R0 R1 RR OR p.value
## 0.74 0.53 0.72 0.40 0.09
##
## $Fruit.salad
## R0 R1 RR OR p.value
## 0.61 0.67 1.10 1.29 1.00
t(sapply(oswego2[,food.item], function(x) epitable(x, oswego2$Ill)$results))
## R0 R1 RR OR p.value
## Baked.ham 0.59 0.63 1.08 1.20 0.81
## Spinach 0.62 0.60 0.97 0.92 1.00
## Mashed.potato 0.62 0.62 1.00 1.00 1.00
## Cabbage.salad 0.60 0.64 1.08 1.22 0.81
## Jello 0.58 0.70 1.21 1.68 0.44
## Rolls 0.66 0.57 0.86 0.68 0.48
## Brown.bread 0.58 0.67 1.14 1.43 0.62
## Milk 0.62 0.50 0.81 0.61 0.64
## Coffee 0.61 0.61 1.00 1.00 1.00
## Water 0.65 0.54 0.84 0.64 0.45
## Cakes 0.54 0.68 1.24 1.75 0.34
## Vanilla.ice.cream 0.14 0.80 5.57 23.45 0.00
## Chocolate.ice.cream 0.74 0.53 0.72 0.40 0.09
## Fruit.salad 0.61 0.67 1.10 1.29 1.00
5.8.6
Study the following R code and explain the final answer for g(2)
:
a = 1
b = 2
f <- function(x) {
a * x + b
}
g <- function(x) {
a = 2
b = 1
f(x)
}
g(2)
## [1] 4
5.8.7
Study the following R code and explain the final answer for g(2)
:
g <- function(x) {
a = 2
b = 1
f <- function(x) {
a * x + b
}
f(x)
}
g(2)
## [1] 5
Bibliography
Selvin, S. 1998. Modern Applied Biostatistical Methods Using S-Plus. New York: Oxford University Press.