Chapter 5 Programming with R—An introduction
“Good programmers write good code, great programmers steal great code.”
R is a comprehensive and powerful programming language. In this section we briefly summarize how to use R for introductory programming, including writing and executing functions.
5.1 Basic programming
Basic data science programming in R is just a list of R expressions,
that are collected and executed as a batch job. The collection of R
expressions are saved in an ASCII text file with a .R
extension. In
RStudio, from the main menu, select File
, New
, and then R Script
. This script file will be saved with an .R
extension. This
script file can be edited and executed within RStudio. Alternatively,
we can edit this file using our favorite text editor (e.g., GNU
Emacs).
What are the characteristics of good R programming?
- Use a good text editor (or RStudio) for programming
- Organize batch jobs into numbered sequential files (e.g., job01_task.R)
- Avoid graphical menu-driven approaches
First, use a good text editor (or RStudio) for programming. Each R
expression will span one or more lines. Although one could write and
submit each line at the R console, this approach is inefficient and
not recommended. Instead, type the expressions into your favorite
text editor and save with a .R
extension. Then, selected
expressions or the whole file can be executed in R. Use the text
editor that comes with R, or text editors customized to work with R
(e.g., RStudio, GNU Emacs with ESS34).
Second, we organize batch jobs into sequential files. Data analysis
is a series of tasks involving data entry, checking, cleaning,
analysis, and reporting. Although data analysts are primarily
involved in analysis and reporting, they may be involved in earlier
phases of data preparation. Regardless of stage of involvement, data
analysts should organize, conduct, and document their analytic tasks
and batch jobs in chronological order. For example, batch jobs might
be named as follows: job01_cleaning.R
, job02_recoding.R
,
job03_descriptive.R
, job04_logistic.R
, etc. Naming the program
file has two components: jobs
represent major tasks and are always
numbered in chronological order (job01_*.R
, job02_*.R
, etc.); and
a brief descriptor can be appended to the first component of the file
name (job01_recode_data.R
, job02_bivariate_analysis.R
).
If one needs to repeat parts of a previous job, then add new jobs, not edit old jobs. This way our analysis can always be reviewed, replicated, and audited exactly in the order it was conducted, including corrections. We avoid editing earlier jobs. If we edit previous jobs, then we must rerun all subsequent jobs in chronological order to double check our work.
Third, we avoid graphical, menu-driven approaches. While this is a
tempting approach, our work cannot be documented, replicated, and
audited. The best approach is to collect R expressions into batch
jobs and run them using the source
function.
Fourth, while preliminary and core analyses should be conducted with
script (.R
) files, the final report and analyses should be compiled
with Rmarkdown (.Rmd
) files. For now, we focus on script files.
More recently, data scientists are using R Markdown or notebook style of conducting and documenting analyses. This is fine but be sure to follow the principles above.
5.2 Intermediate programming
The next level of R programming involves (1) implementing control flow (decision points); (2) implementing dependencies in calculations or data manipulation; and (3) improving execution efficiency,
5.2.1 Control statements
Control flow involves one or more decision points. A simple decision
point goes like this: if a condition is TRUE
, do {this} and then
continue; if it is FALSE
, do not do {this} and then continue.
When R continues, the next R expression can be any valid expression,
including another decision point.
5.2.1.1 The if
function
We use the if
function to implement single decision points:
If the condition if FALSE
, R skips the bracketed expression and
continues executing subsequent lines. Study this example:
#> [1] 1 2 999 4 5
#> [1] 1 2 3 4 5
The first if
condition evaluated to TRUE
and the missing value was
replaced. The second if
condition evaluated to FALSE
and the
bracketed expressions were not evaluated.
5.2.1.2 The else
function
Up to now the if
condition had only one possible response. For two,
mutually exclusive responses we add the else
function
Here is an example:
x <- c(1, 2, NA, 4, 5)
y <- c(1, 2, 3, 4, 5)
if (any(is.na(x))) {
x[is.na(x)] <- 999
cat("NAs replaced\n")
} else {
cat("No missing values; no replacement\n")
}
#> NAs replaced
#> [1] 1 2 999 4 5
if (any(is.na(y))) {
y[is.na(y)] <- 999
cat("NAs replaced\n")
} else {
cat("No missing values; no replacement\n")
}
#> No missing values; no replacement
#> [1] 1 2 3 4 5
Therefore, use the if
and else
combination if one needs to
evaluate one of two possible collection of R expressions.
If one needs to evaluate possibly one of two possible collection of R expressions then use the following pattern:
The if
and else
functions can be combined to achieve
any desired control flow.
5.2.1.3 The “short circuit” logical operators
The “short circuit” &&
and ||
logical operators are used for
control flow in if
functions. If logical vectors are provided, only
the first element of each vector is used. In constrast, for
element-wise comparisons of two or more vectors, use the &
and |
operators but not the &&
and ||
operators.35 For if
function comparisons use the &&
and ||
operators.
Suppose we want to square the elements of a numeric vector but not if it is a matrix.
x <- 1:5
y <- matrix(1:4, 2, 2)
if(!is.matrix(x) && is.numeric(x) && is.vector(x)) {
x^2
} else cat("Either not numeric, not a vector, or is a matrix\n")
#> [1] 1 4 9 16 25
if(!is.matrix(y) && is.numeric(y) && is.vector(y)) {
y^2
} else cat("Either not numeric, not a vector, or is a matrix\n")
#> Either not numeric, not a vector, or is a matrix
The &&
and ||
operators are called “short circuit” operators
because not all its arguments may be evaluated: moving from left to
right, only sufficient arguments are evaluated to determine if the
if
function should return TRUE or FALSE. This can save considerable
time if some the arguments are complicated functions that require
significant computing time to evaluate to either TRUE or FALSE. In
the previous example, because !is.matrix(y)
evaluates to FALSE, it
was not necessary to evaluate is.numeric(y)
or is.vector(y)
.
5.2.2 Vectorized approach
An important advantage of R is the availability of functions that perform vectorized calculations. For example, suppose we wish to add to columns of a matrix. Here is one approach:
#> [1] 22 26 30
However, this can be accomplished more efficiently using the
apply
function:
#> [1] 22 26 30
In general, we want to use these types of functions (e.g., tapply
,
sweep
, outer
, mean
, etc.) because they have been optimized to
perform vectorized calculations.
5.2.2.1 The ifelse
function
The ifelse
function is a vectorized, element-wise
implementation of the if
and else
functions. We
demonstrate using the practical example of recoding a 2-level
variable.
sex <- c("M", NA, "F", "F", NA, "M", "F", "M")
(sex2 <- ifelse(sex=="M", "Male", "Female")) # binary recode
#> [1] "Male" NA "Female" "Female" NA "Male"
#> [7] "Female" "Male"
If an element of sex
contains “M” (TRUE), it is recoded to
“Male” (in sex2
), otherwise (FALSE) it is recoded to
“Female”. This assumes that there are only “M”s and “F”s in the data
vector.
5.2.3 Looping
Looping is a common programming approach that is discouraged in R because it is inefficient. It is much better to conduct vectorized calculations using existing functions. For example, suppose we want to sum a numeric vector.
x <- c(2,4,6,8,10)
index <- 1:length(x) # create vector of sequential integers
xsum <- 0
for (i in index){
xsum <- xsum + x[i]
}
xsum
#> [1] 30
A much better approach is to use the sum
function:
#> [1] 30
Unless it is absolutely necessary, we avoid looping.
Looping is necessary when (1) there is no R function to conduct a vectorized calculation, and (2) when the result of an element depends on the result of a preceding element which may not be known beforehand (e.g., when it is the result of a random process).
5.2.3.1 The for
function
The previous example was a for
loop. Here is the syntax:
In the for
function R loops and uses the ith element of somevector
either directly or indirectly (e.g., indexing another vector). Here
is using the vector directly:
#> 1
#> 4
#> 9
The letters
object contain the American English alphabet. Here we
use an integer vector for indexing letters
:
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
#> [16] "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
#> a
#> b
#> c
Somevector can be any vector:
#> [1] "Tomasito"
#> [1] "Luisito"
#> [1] "Angelita"
5.2.3.2 The while
function
The while
function will continue to evaluate a collection of
R expressions while a condition is true. Here is the syntax:
Here is a trivial example:
#> [1] 0
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
The while
function is used for optimization functions that
are converging to a numerical value.
5.2.3.3 The break
and next
functions
The break
expression will break out of a for
or
while
loop if a condition is met, and transfers control to
the first statement outside of the inner-most loop. Here is the
general syntax:
The next
expression halts the processing of the current
iteration and advances the looping index. Here is the
general syntax:
Both break
and next
apply only to the innermost of
nested loops.
5.2.3.4 The double for
loop
In the next example we nest two for
loop to generate a
multiplication table for two integer vectors.
x <- 11:15
y <- 16:25
mtab <- matrix(NA, nrow = length(x), ncol = length(y))
rownames(mtab) <- x
colnames(mtab) <- y
for (i in 1:length(x)) {
for (j in 1:length(y)) {
mtab[i, j] <- x[i] * y[j]
}
}
mtab
#> 16 17 18 19 20 21 22 23 24 25
#> 11 176 187 198 209 220 231 242 253 264 275
#> 12 192 204 216 228 240 252 264 276 288 300
#> 13 208 221 234 247 260 273 286 299 312 325
#> 14 224 238 252 266 280 294 308 322 336 350
#> 15 240 255 270 285 300 315 330 345 360 375
Notice that the outer
function is a more efficient way of conducting
a calculation using all combinations for two numeric vectors.
#> 16 17 18 19 20 21 22 23 24 25
#> 11 176 187 198 209 220 231 242 253 264 275
#> 12 192 204 216 228 240 252 264 276 288 300
#> 13 208 221 234 247 260 273 286 299 312 325
#> 14 224 238 252 266 280 294 308 322 336 350
#> 15 240 255 270 285 300 315 330 345 360 375
5.3 Writing R functions
Writing R functions is a great way to create “commands” to customize analyses or repeated operations. Here is the general analysis approach:
- Prepare inputs
- Do calculations
- Collect results
For example, suppose we are writing R code to calculate the odds ratio from a \(2 \times 2\) table with the appropriate format. For this we will use the Oswego outbreak data set.
#> ID Age Sex MealDate MealTime Ill OnsetDate TimeOnset
#> 1 2 52 F 04/18/1940 20:00:00 Y 4/19/1940 00:30:00
#> 2 3 65 M 04/18/1940 18:30:00 Y 4/19/1940 00:30:00
#> 3 4 59 F 04/18/1940 18:30:00 Y 4/19/1940 00:30:00
#> 4 6 63 F 04/18/1940 19:30:00 Y 4/18/1940 22:30:00
#> 5 7 70 M 04/18/1940 19:30:00 Y 4/18/1940 22:30:00
#> 6 8 40 F 04/18/1940 19:30:00 Y 4/19/1940 02:00:00
#> Baked.Ham Spinach Mashed.Potatoes Cabbage.Salad Jello Rolls
#> 1 Y Y Y N N Y
#> 2 Y Y Y Y N N
#> 3 Y Y N N N N
#> 4 Y Y N Y Y N
#> 5 Y Y Y N Y Y
#> 6 N N N N N N
#> Brown.Bread Milk Coffee Water Cakes Van.Ice.Cream
#> 1 N N Y N N Y
#> 2 N N Y N N Y
#> 3 N N Y N Y Y
#> 4 N N N Y N Y
#> 5 Y N Y Y N Y
#> 6 N N N N N Y
#> Choc.Ice.Cream Fruit.Salad
#> 1 N N
#> 2 Y N
#> 3 Y N
#> 4 N N
#> 5 N N
#> 6 Y N
#> Spinach
#> Ill N Y
#> N 12 17
#> Y 20 26
aa = tab1[1, 1]
bb = tab1[1, 2]
cc = tab1[2, 1]
dd = tab1[2, 2]
#### Do calculations
crossprod.OR = (aa*dd)/(bb*cc)
#### Collect results
list(data = tab1, odds.ratio = crossprod.OR)
#> $data
#> Spinach
#> Ill N Y
#> N 12 17
#> Y 20 26
#>
#> $odds.ratio
#> [1] 0.9176471
Now that we are familiar of what it takes to calculate an odds ratio from a 2-way table we can convert the code into a function and load it at the R console. Here is new function:
myOR <- function(x){
#### Prepare input
#### x = 2x2 table amenable to cross-product
aa = x[1, 1]
bb = x[1, 2]
cc = x[2, 1]
dd = x[2, 2]
#### Do calculations
crossprod.OR = (aa*dd)/(bb*cc)
#### Collect results
list(data = x, odds.ratio = crossprod.OR)
}
Now we can test the function:
#> $data
#> Spinach
#> Ill N Y
#> N 12 17
#> Y 20 26
#>
#> $odds.ratio
#> [1] 0.9176471
5.3.1 Arguments with default values
Now suppose we wish to add calculating a 95% confidence interval to this function. We will use the following normal approximation standard error formula for an odds ratio:
\[ SE[\log(OR)] = \sqrt{\frac{1}{A_1} + \frac{1}{B_1} + \frac{1}{A_0} + \frac{1}{B_0}} \]
And here is the (\(1-\alpha\))% confidence interval:
\[ OR_L, OR_U = \exp\lbrace log(OR) \pm Z_{\alpha /2} SE[\log(OR)] \rbrace \]
Here is the improved function:
myOR2 <- function(x, conf.level){
#### Prepare input
#### x = 2x2 table amenable to cross-product
if(missing(conf.level)) stop("Must specify confidence level")
aa = x[1, 1]
bb = x[1, 2]
cc = x[2, 1]
dd = x[2, 2]
Z <- qnorm((1 + conf.level)/2)
#### Do calculations
logOR <- log((aa*dd)/(bb*cc))
SE.logOR <- sqrt(1/aa + 1/bb + 1/cc + 1/dd)
OR <- exp(logOR)
CI <- exp(logOR + c(-1, 1)*Z*SE.logOR)
#### Collect results
list(data = x, odds.ratio = OR, conf.int = CI)
}
Notice that conf.level
is a new argument, but with no default value.
If a user forgets to specify a default value, the following line
handles this possibility:
Now we test this function:
tab.test <- xtabs(~ Ill + Spinach, data = odatcsv)
myOR2(tab.test)
## Error in myOR2(tab.test) (from #4) : Must specify confidence level
#> $data
#> Spinach
#> Ill N Y
#> N 12 17
#> Y 20 26
#>
#> $odds.ratio
#> [1] 0.9176471
#>
#> $conf.int
#> [1] 0.3580184 2.3520471
If an argument has a usual value, then specify this as an argument default value:
myOR3 <- function(x, conf.level = 0.95){
#### Prepare input
#### x = 2x2 table amenable to cross-product
aa = x[1, 1]
bb = x[1, 2]
cc = x[2, 1]
dd = x[2, 2]
Z <- qnorm((1 + conf.level)/2)
#### Do calculations
logOR <- log((aa*dd)/(bb*cc))
SE.logOR <- sqrt(1/aa + 1/bb + 1/cc + 1/dd)
OR <- exp(logOR)
CI <- exp(logOR + c(-1, 1)*Z*SE.logOR)
#### Collect results
list(data = x, odds.ratio = OR, conf.level = conf.level, conf.int = CI)
}
We test our new function:
#> $data
#> Spinach
#> Ill N Y
#> N 12 17
#> Y 20 26
#>
#> $odds.ratio
#> [1] 0.9176471
#>
#> $conf.level
#> [1] 0.95
#>
#> $conf.int
#> [1] 0.3580184 2.3520471
#> $data
#> Spinach
#> Ill N Y
#> N 12 17
#> Y 20 26
#>
#> $odds.ratio
#> [1] 0.9176471
#>
#> $conf.level
#> [1] 0.9
#>
#> $conf.int
#> [1] 0.4165094 2.0217459
5.3.2 Passing optional arguments using the ...
function
On occasion we will have a function nested inside one of our functions and we need to be able to pass optional arguments to this nested function. This commonly occurs when we write functions for customized graphics but only wish to specify some arguments for the nested function and leave the remaining arguments optional. For example, consider this function:
When using myplot
one only needs to provide x
and y
arguments. The type
option has been set to a default value of
“b”. The ...
function will pass any optional arguments to the nested
plot
function. Of course, the optional arguments must be valid
options for plot
function.
5.4 Lexical scoping
The variables which occur in the body of a function can be divided into three classes; formal parameters, local variables and free variables. The formal parameters of a function are those occurring in the argument list of the function. Their values are determined by the process of binding the actual function arguments to the formal parameters. Local variables are those whose values are determined by the evaluation of expressions in the body of the functions. Variables which are not formal parameters or local variables are called free variables. Free variables become local variables if they are assigned to. Consider the following function definition.
In this function, x
is a formal parameter, y
is a local
variable and z
is a free variable. In R the value of free
variable is resolved by first looking in the environment in which the
function was defined. This is called lexical scope. If the free
variable is not defined there, R looks in the parent environment. For
the function f
this would be the global environment (workspace).
To understand the implications of lexical scope consider the following:
rm(list = ls())
ls()
## character(0)
f <- function(x){
y <- 2 * x
print(x)
print(y)
print(z)
}
f(5)
## [1] 5
## [1] 10
## Error in print(z) : object 'z' not found
z <- 99
f(5)
## [1] 5
## [1] 10
## [1] 99
In the f
function z
is a free variable. The first time f
is
executed z
is not defined in the function. R looks in the enclosing
environment and does not find a value for z
and reports an error.
However, when a object z
is created in the global environment, R is
able to find it and uses it.
To summarize, from Roger Peng (CITE):
- "If the value of a symbol is not found in the environment in which a function was defined, then the search is continued in the parent environment.
- The search continues down the sequence of parent environments until we hit the top-level environment; this usually the global environment (workspace) or the namspace of a package.
- After the top-level environment, the search continues down the search list until we hit the empty environment."
Lexical scoping is convenient because it allows nested functions with free variables to run provided the variable has been defined in an enclosing environment. This convenience becomes obvious when one writes many programs. However, there is a danger: an unintended free variable many find an unintended value in an parent environment. This may go undetected because no error is reported. This can happen when there are many objects in the workspace from previous sessions. A good habit is to clear the workspace of all objects at the beginning of every session.
Here is another example from the R introductory
manual.36
Consider a function called cube
.
The variable n
in the function sq
is not an argument to that
function. Therefore it is a free variable and the scoping rules must
be used to ascertain the value that is to be associated with it.
Under lexical scope (e.g., in R) it is the parameter to the function
cube since that is the active binding for the variable n
at the time
the function sq
was defined.
#> [1] 8
Here is a variation that demonstrates lexical scoping but with an unexpected numerical answer.
#> [1] 18
Why does cube(2) = 18
? The sq
function has a free variable
n
. When sq
is evaluated inside of cube
, sq
must find a value
for its free variable n
. Under lexical scoping, sq
looks in the
parent environment when sq
was defined. The sq
function was
defined in the global environment, so it looks, finds, and uses n = 3
.
5.4.1 Functions with a mutatable state
Lexical scoping can be used to create a function with a mutable state. Consider the following function that simulates a banking savings account balance with deposits and withdrawals.37
open.account <- function(total) {
list(deposit = function(amount) {
if (amount <= 0) stop("Deposits must be positive!\n")
total <<- total + amount
cat(amount, "deposited. Your balance is", total, "\n\n")
}, withdraw = function(amount) {
if (amount > total) stop("You don't have that much money!\n")
total <<- total - amount
cat(amount, "withdrawn. Your balance is", total, "\n\n")
}, balance = function() {
cat("Your balance is", total, "\n\n")
})
}
The first total
is a formal parameter of the function
open.account
. There is a list with three functions: deposit
,
withdraw
, and balance
. Inside these functions, the first total
is a free variable; they find their values in the enclosing
environment’s formal parameter (total
) which is one level up. The
special assignment operator, “<<-
”, makes an assignment to the
enclosing environment, one level up, updating the value of the formal
parameter total
.
Let’s open some savings accounts and make deposits and withdrawals.
#> 75 withdrawn. Your balance is 425
#> 250 deposited. Your balance is 350
#> Your balance is 425
#> Your balance is 350
5.5 Recursive functions
A function that calls itself is a recursive function. Recursive functions break a problem down into small steps that require the same methodologic technique. For example, consider the factorial of 5:
\[ 5! = 5 \times 4! = 5 \times 4 \times 3! = 5 \times 4 \times 3 \times 2! = = 5 \times 4 \times 3 \times 2 \times 1! \]
We can write a recursive function for factorial calculations:
recursive.factorial <- function(x) {
if (x == 0) {
return(1)
} else {
return(x * recursive.factorial(x - 1))
}
}
recursive.factorial(5)
#> [1] 120
#> [1] 120
5.6 Debugging and profiling R code
TODO: pending
5.7 Example: Bootstrap of risk ratio data
Consider the following data (Table 5.1) with the number of coronary heart disease events comparing men with Type A behavior to men with Type B behavior [17]. There is no formula for calculating an exact confidence interval for a risk ratio. When the distribution for a statistical estimate is not known we can use the bootstrap method. Essentially, this involves the following steps:
- Simulate risk ratio estimates using the data
- Estimate the mean of the risk ratios
- Choose appropriate quantile values as confidence limits
CHD | |||
Exposure | Yes | No | Total |
Type A | 178 | 1411 | 1589 |
Type B | 79 | 1486 | 1565 |
rr.boot <- function(x1, x0, conf.level = 0.95, replicates = 5000){
#### x1 = c(number cases among exposed, number of exposed)
#### x0 = c(number cases among nonexposed, number of nonexposed)
#### prepare input
n1 <- x1[2]; n0 <- x0[2]
p1 <- x1[1]/n1 # risk among exposed
p0 <- x0[1]/n0 # risk among unexposed
#### do calculations
RR <- p1/p0 # maximum likelihood estimate of risk ratio
r1 <- rbinom(replicates, n1, p1)/n1
r0 <- rbinom(replicates, n0, p0)/n0
rrboot <- r1/r0
rrbar <- mean(rrboot)
alpha <- 1 - conf.level
ci <- quantile(rrboot, c(alpha/2, 1-alpha/2))
#### collect results
list(x = x,
risks = c(p1 = p1, p0 = p0),
risk.ratio = RR,
rrboot.mean = rrbar,
conf.int = unname(ci),
conf.level = conf.level,
replicates = replicates)
}
#### test function
xx1 <- c(178, 1589) # CHD among Type A
xx0 <- c(79, 1565) # CHD among Type B
rr.boot(xx1, xx0)
#> $x
#> [1] 11 12 13 14 15
#>
#> $risks
#> p1 p0
#> 0.11202014 0.05047923
#>
#> $risk.ratio
#> [1] 2.219133
#>
#> $rrboot.mean
#> [1] 2.25108
#>
#> $conf.int
#> [1] 1.736260 2.909937
#>
#> $conf.level
#> [1] 0.95
#>
#> $replicates
#> [1] 5000
5.8 Exercises
Probability distributions are either continuous (e.g., normal distribution) or discrete (e.g. binomial distribution). In R, we deal with four aspects of probability distributions (Table 5.2).
Aspects of a probability distribution | Normal | Binomial |
---|---|---|
probability density function | pnorm |
pbinom |
cumulative distribution function | dnorm |
dbinom |
quantile (“percentile”) function | qnorm |
qbinom |
random sample from distribution | rnorm |
rbinom |
For example, to get the quantile value from a standard normal
distribution (Z value) that corresponds to a 95% confidence interval
we use the qnorm
function.
#> [1] 1.959964
For a second example, suppose we wish to simulate a binomial
distribution. Out of 350 patients that adhered to a new drug
treatment, 273 recovered. The response proportion was 273/350 or 0.78
(78%). We can use the rbinom
function to simulate this binomial
process. Let’s simulate this 10 times (i.e., 10 replicates).
replicates <- 10
num_of_events <- 273
num_at_risk <- 350
risk <- num_of_events/num_at_risk
sim_events <- rbinom(n = replicates, size = num_at_risk, prob = risk)
sim_events
#> [1] 275 275 277 280 288 284 268 282 282 266
#> [1] 0.786 0.786 0.791 0.800 0.823 0.811 0.766 0.806 0.806 0.760
Bootstrapping is a resampling method to estimate a confidence
interval. For example, we could simulate 5000 replicates and use the
quantile
function to estimate a 0.95 confidence interval.
replicates <- 5000
num_of_events <- 273
num_at_risk <- 350
risk <- num_of_events/num_at_risk
sim_events <- rbinom(n = replicates, size = num_at_risk, prob = risk)
conf.level <- 0.95
alpha <- 1 - conf.level
## 0.95 CI for binomial counts
quantile(sim_events, c(alpha/2, 1-alpha/2))
#> 2.5% 97.5%
#> 257 288
#> 2.5% 97.5%
#> 0.7342857 0.8228571
For example, if your previous code was
then, your function and test might be
#> [1] 64
Exposed | |||
Disease | Yes | No | Total |
Yes | \(a\) | \(b\) | \(M_1\) |
No | \(c\) | \(d\) | \(M_0\) |
Total | \(N_1\) | \(N_0\) | \(T\) |
The risk ratio is calculated by
\[ RR = \frac{a/N_1}{b/N_0} = \frac{R_1}{R_0}, \]
and the standard error is calculated by
\[ SE(ln(RR)) = \sqrt{\frac{1}{a}-\frac{1}{N_1}+\frac{1}{b}-\frac{1}{N_0}}. \]
Consider an observational study where 700 patients were given access to a new drug for an ailment. A total of 350 patients chose to take the drug and 350 patients did not. Table 5.4 displays the results of this study.
Treatment | |||
Recovered | Yes | No | Total |
Yes | 273 | 289 | 562 |
No | 77 | 61 | 138 |
Total | 350 | 350 | 700 |
Using the results from Table 5.4 create and test a function that:
- Calculates the risks \(R_1\) and \(R_0\).
- Calculates the risk ratio (\(RR\)).
- Calculates 95% confidence intervals.
- Uses the
fisher.test
function (Fisher’s Exact Test) to calculate the two-sided \(p\)-value of the null hypothesis of no association. (Hint: seeepitable
function in Exercise 5.5) - Collects and returns all the results as a ‘list’.
Using the read.csv
function import the data frame for Table
5.4. Use the code below.
Analyze the drugrx.df
data frame using the xtab
function, your
function from the previous problem, and any R functions you wish. Is
the drug treatment effective? Is there a difference in drug treatment
effectiveness comparing men and women?
~/data/oswego/oswego2.txt
. Remember this data set is
available from https://github.com/taragonmd/data. Locate the data
set and select the “Raw” tab. Copy the URL and use with the read.csv
function.
Import this clean version.
oswego2 <- read.csv("~/data/oswego/oswego2.txt", strip.white = TRUE)
food.item <- c("Baked.ham", "Spinach", "Mashed.potato", "Cabbage.salad",
"Jello", "Rolls", "Brown.bread", "Milk", "Coffee", "Water", "Cakes",
"Vanilla.ice.cream", "Chocolate.ice.cream", "Fruit.salad")
names(oswego2)[9:22] <- food.item
str(oswego2)
#> 'data.frame': 75 obs. of 22 variables:
#> $ ID : int 2 3 4 6 7 8 9 10 14 16 ...
#> $ Age : int 52 65 59 63 70 40 15 33 10 32 ...
#> $ Sex : Factor w/ 2 levels "F","M": 1 2 1 1 2 1 1 1 2 1 ...
#> $ MealDate : Factor w/ 1 level "04/18/1940": 1 1 1 1 1 1 1 1 1 1 ...
#> $ MealTime : Factor w/ 6 levels "11:00:00","18:30:00",..: 5 2 2 4 4 4 6 3 4 NA ...
#> $ Ill : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
#> $ OnsetDate : Factor w/ 2 levels "4/18/1940","4/19/1940": 2 2 2 1 1 2 2 1 2 2 ...
#> $ OnsetTime : Factor w/ 17 levels "00:00:00","00:30:00",..: 2 2 2 15 15 4 3 16 4 7 ...
#> $ Baked.ham : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 1 2 1 2 ...
#> $ Spinach : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 1 2 1 2 ...
#> $ Mashed.potato : Factor w/ 2 levels "N","Y": 2 2 1 1 2 1 1 2 1 1 ...
#> $ Cabbage.salad : Factor w/ 2 levels "N","Y": 1 2 1 2 1 1 1 1 1 1 ...
#> $ Jello : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 1 1 1 ...
#> $ Rolls : Factor w/ 2 levels "N","Y": 2 1 1 1 2 1 1 2 1 2 ...
#> $ Brown.bread : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 2 1 1 ...
#> $ Milk : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
#> $ Coffee : Factor w/ 2 levels "N","Y": 2 2 2 1 2 1 1 1 1 2 ...
#> $ Water : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 2 1 1 ...
#> $ Cakes : Factor w/ 2 levels "N","Y": 1 1 2 1 1 1 2 1 1 2 ...
#> $ Vanilla.ice.cream : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
#> $ Chocolate.ice.cream: Factor w/ 2 levels "N","Y": 1 2 2 1 1 2 2 2 2 2 ...
#> $ Fruit.salad : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
Study the following epitable
function, especially the use of the
...
argument to past options to the fisher.test
function.
epitable <- function(exposure, outcome, digits = 3, ...){
# Updated 2016-10-16
# 2x2 table from 'xtabs' function
# row names are exposure status
# col names are illness outcome
# digits = significant digits (default = 3)
# ... = options to fisher.test function
x <- xtabs(~ exposure + outcome, na.action=na.pass)
rowtot <- apply(x, 1, sum)
rowdist <- sweep(x, 1, rowtot, '/')
risk0 <- rowdist['N','Y']
risk1 <- rowdist['Y','Y']
rr <- risk1/risk0
or <- (risk1/(1-risk1)) / (risk0/(1-risk0))
ft <- fisher.test(x, ...)
pv <- ft$p.value
results <- signif(c(risk0, risk1, rr, or, pv), digits = digits)
names(results) <- c('R_0', 'R_1', 'RR', 'OR', 'p.value')
list(data = x, results = results)
}
Here is an example of testing one food item:
#> $data
#> outcome
#> exposure N Y
#> N 12 17
#> Y 17 29
#>
#> $results
#> R_0 R_1 RR OR p.value
#> 0.586 0.630 1.080 1.200 0.809
Use the epitable
function to test each food item to see if it is
associated with developing illness. Construct a table with the
following column headings:
- Food item
- \(R_0\)
- \(R_1\)
- \(RR\) (risk ratio)
- \(OR\) (odds ratio)
- Fisher \(p\) value
Discuss your findings.
for
loop to automate the creation of the outbreak table from
previous exercise.
Hint provided:
outbreak <- matrix(NA, length(food.item), 5)
rownames(outbreak) <- food.item
colnames(outbreak) <- c('R_0','R_1','RR','OR','P value')
for(i in 1:length(food.item)) {
outbreak[i,] <- epitable(oswego2[,food.item[i]], oswego2$Ill)$results
}
outbreak
lapply
and sapply
. Explain how both
functions work, and what are their differences.
#> $Baked.ham
#> R_0 R_1 RR OR p.value
#> 0.586 0.630 1.080 1.200 0.809
#>
#> $Spinach
#> R_0 R_1 RR OR p.value
#> 0.625 0.605 0.967 0.918 1.000
#>
#> $Mashed.potato
#> R_0 R_1 RR OR p.value
#> 0.622 0.622 1.000 1.000 1.000
#>
#> $Cabbage.salad
#> R_0 R_1 RR OR p.value
#> 0.596 0.643 1.080 1.220 0.808
#>
#> $Jello
#> R_0 R_1 RR OR p.value
#> 0.577 0.696 1.210 1.680 0.442
#>
#> $Rolls
#> R_0 R_1 RR OR p.value
#> 0.658 0.568 0.863 0.682 0.482
#>
#> $Brown.bread
#> R_0 R_1 RR OR p.value
#> 0.583 0.667 1.140 1.430 0.622
#>
#> $Milk
#> R_0 R_1 RR OR p.value
#> 0.620 0.500 0.807 0.614 0.638
#>
#> $Coffee
#> R_0 R_1 RR OR p.value
#> 0.614 0.613 0.999 0.997 1.000
#>
#> $Water
#> R_0 R_1 RR OR p.value
#> 0.647 0.542 0.837 0.645 0.450
#>
#> $Cakes
#> R_0 R_1 RR OR p.value
#> 0.543 0.675 1.240 1.750 0.342
#>
#> $Vanilla.ice.cream
#> R_0 R_1 RR OR p.value
#> 1.43e-01 7.96e-01 5.57e+00 2.35e+01 2.60e-07
#>
#> $Chocolate.ice.cream
#> R_0 R_1 RR OR p.value
#> 0.7410 0.5320 0.7180 0.3980 0.0891
#>
#> $Fruit.salad
#> R_0 R_1 RR OR p.value
#> 0.609 0.667 1.100 1.290 1.000
#> R_0 R_1 RR OR p.value
#> Baked.ham 0.586 0.630 1.080 1.200 8.09e-01
#> Spinach 0.625 0.605 0.967 0.918 1.00e+00
#> Mashed.potato 0.622 0.622 1.000 1.000 1.00e+00
#> Cabbage.salad 0.596 0.643 1.080 1.220 8.08e-01
#> Jello 0.577 0.696 1.210 1.680 4.42e-01
#> Rolls 0.658 0.568 0.863 0.682 4.82e-01
#> Brown.bread 0.583 0.667 1.140 1.430 6.22e-01
#> Milk 0.620 0.500 0.807 0.614 6.38e-01
#> Coffee 0.614 0.613 0.999 0.997 1.00e+00
#> Water 0.647 0.542 0.837 0.645 4.50e-01
#> Cakes 0.543 0.675 1.240 1.750 3.42e-01
#> Vanilla.ice.cream 0.143 0.796 5.570 23.500 2.60e-07
#> Chocolate.ice.cream 0.741 0.532 0.718 0.398 8.91e-02
#> Fruit.salad 0.609 0.667 1.100 1.290 1.00e+00
g(2)
:
#> [1] 4
g(2)
:
#> [1] 5
References
17. Selvin S. Modern applied biostatistical methods using s-plus. New York: Oxford University Press; 1998.