# Chapter 5 Programming with R—An introduction

*“Good programmers write good code, great programmers steal great code.”*

R is a comprehensive and powerful programming language. In this section we briefly summarize how to use R for introductory programming, including writing and executing functions.

## 5.1 Basic programming

*Basic* data science programming in R is just a list of R expressions,
that are collected and executed as a batch job. The collection of R
expressions are saved in an ASCII text file with a `.R`

extension. In
RStudio, from the main menu, select `File`

, `New`

, and then `R Script`

. This script file will be saved with an `.R`

extension. This
script file can be edited and executed within RStudio. Alternatively,
we can edit this file using our favorite text editor (e.g., GNU
Emacs).

What are the characteristics of good R programming?

- Use a good text editor (or RStudio) for programming
- Organize batch jobs into numbered sequential files (e.g., job01_task.R)
- Avoid graphical menu-driven approaches

First, use a good text editor (or RStudio) for programming. Each R
expression will span one or more lines. Although one could write and
submit each line at the R console, this approach is inefficient and
not recommended. Instead, type the expressions into your favorite
text editor and save with a `.R`

extension. Then, selected
expressions or the whole file can be executed in R. Use the text
editor that comes with R, or text editors customized to work with R
(e.g., RStudio, GNU Emacs with ESS^{34}).

Second, we organize batch jobs into sequential files. Data analysis
is a series of tasks involving data entry, checking, cleaning,
analysis, and reporting. Although data analysts are primarily
involved in analysis and reporting, they may be involved in earlier
phases of data preparation. Regardless of stage of involvement, data
analysts should organize, conduct, and document their analytic tasks
and batch jobs in chronological order. For example, batch jobs might
be named as follows: `job01_cleaning.R`

, `job02_recoding.R`

,
`job03_descriptive.R`

, `job04_logistic.R`

, etc. Naming the program
file has two components: `jobs`

represent major tasks and are always
numbered in chronological order (`job01_*.R`

, `job02_*.R`

, etc.); and
a brief descriptor can be appended to the first component of the file
name (`job01_recode_data.R`

, `job02_bivariate_analysis.R`

).

If one needs to repeat parts of a previous job, then add new jobs, not
edit old jobs. This way our analysis can always be reviewed,
replicated, and audited exactly in the order it was conducted,
including corrections. We avoid editing earlier jobs. If we edit
previous jobs, then we must *rerun all subsequent jobs in chronological
order* to double check our work.

Third, we avoid graphical, menu-driven approaches. While this is a
tempting approach, our work cannot be documented, replicated, and
audited. The best approach is to collect R expressions into batch
jobs and run them using the `source`

function.

Fourth, while preliminary and core analyses should be conducted with
script (`.R`

) files, the final report and analyses should be compiled
with Rmarkdown (`.Rmd`

) files. For now, we focus on script files.

More recently, data scientists are using R Markdown or notebook style of conducting and documenting analyses. This is fine but be sure to follow the principles above.

## 5.2 Intermediate programming

The next level of R programming involves (1) implementing control flow (decision points); (2) implementing dependencies in calculations or data manipulation; and (3) improving execution efficiency,

### 5.2.1 Control statements

Control flow involves one or more decision points. A simple decision
point goes like this: if a condition is `TRUE`

, do {*this*} and then
continue; if it is `FALSE`

, do not do {*this*} and then continue.
When R continues, the next R expression can be any valid expression,
including another decision point.

#### 5.2.1.1 The `if`

function

We use the `if`

function to implement single decision points:

If the condition if `FALSE`

, R skips the bracketed expression and
continues executing subsequent lines. Study this example:

`#> [1] 1 2 999 4 5`

`#> [1] 1 2 3 4 5`

The first `if`

condition evaluated to `TRUE`

and the missing value was
replaced. The second `if`

condition evaluated to `FALSE`

and the
bracketed expressions were not evaluated.

#### 5.2.1.2 The `else`

function

Up to now the `if`

condition had only one possible response. For two,
mutually exclusive responses we add the `else`

function

Here is an example:

```
x <- c(1, 2, NA, 4, 5)
y <- c(1, 2, 3, 4, 5)
if (any(is.na(x))) {
x[is.na(x)] <- 999
cat("NAs replaced\n")
} else {
cat("No missing values; no replacement\n")
}
```

`#> NAs replaced`

`#> [1] 1 2 999 4 5`

```
if (any(is.na(y))) {
y[is.na(y)] <- 999
cat("NAs replaced\n")
} else {
cat("No missing values; no replacement\n")
}
```

`#> No missing values; no replacement`

`#> [1] 1 2 3 4 5`

Therefore, use the `if`

and `else`

combination if one needs to
evaluate one of two possible collection of R expressions.

If one needs to evaluate *possibly* one of two possible
collection of R expressions then use the following pattern:

The `if`

and `else`

functions can be combined to achieve
any desired control flow.

#### 5.2.1.3 The “short circuit” logical operators

The “short circuit” `&&`

and `||`

logical operators are used for
control flow in `if`

functions. If logical vectors are provided, only
the first element of each vector is used. In constrast, for
element-wise comparisons of two or more vectors, use the `&`

and `|`

operators but not the `&&`

and `||`

operators.^{35} For `if`

function comparisons use the `&&`

and `||`

operators.

Suppose we want to square the elements of a numeric vector but not if it is a matrix.

```
x <- 1:5
y <- matrix(1:4, 2, 2)
if(!is.matrix(x) && is.numeric(x) && is.vector(x)) {
x^2
} else cat("Either not numeric, not a vector, or is a matrix\n")
```

`#> [1] 1 4 9 16 25`

```
if(!is.matrix(y) && is.numeric(y) && is.vector(y)) {
y^2
} else cat("Either not numeric, not a vector, or is a matrix\n")
```

`#> Either not numeric, not a vector, or is a matrix`

The `&&`

and `||`

operators are called “short circuit” operators
because not all its arguments may be evaluated: moving from left to
right, only sufficient arguments are evaluated to determine if the
`if`

function should return TRUE or FALSE. This can save considerable
time if some the arguments are complicated functions that require
significant computing time to evaluate to either TRUE or FALSE. In
the previous example, because `!is.matrix(y)`

evaluates to FALSE, it
was not necessary to evaluate `is.numeric(y)`

or `is.vector(y)`

.

### 5.2.2 Vectorized approach

An important advantage of R is the availability of functions that perform vectorized calculations. For example, suppose we wish to add to columns of a matrix. Here is one approach:

`#> [1] 22 26 30`

However, this can be accomplished more efficiently using the
`apply`

function:

`#> [1] 22 26 30`

In general, we want to use these types of functions (e.g., `tapply`

,
`sweep`

, `outer`

, `mean`

, etc.) because they have been optimized to
perform vectorized calculations.

#### 5.2.2.1 The `ifelse`

function

The `ifelse`

function is a vectorized, element-wise
implementation of the `if`

and `else`

functions. We
demonstrate using the practical example of recoding a 2-level
variable.

```
sex <- c("M", NA, "F", "F", NA, "M", "F", "M")
(sex2 <- ifelse(sex=="M", "Male", "Female")) # binary recode
```

```
#> [1] "Male" NA "Female" "Female" NA "Male"
#> [7] "Female" "Male"
```

If an element of `sex`

contains “M” (TRUE), it is recoded to
“Male” (in `sex2`

), otherwise (FALSE) it is recoded to
“Female”. This assumes that there are only “M”s and “F”s in the data
vector.

### 5.2.3 Looping

Looping is a common programming approach that is discouraged in R because it is inefficient. It is much better to conduct vectorized calculations using existing functions. For example, suppose we want to sum a numeric vector.

```
x <- c(2,4,6,8,10)
index <- 1:length(x) # create vector of sequential integers
xsum <- 0
for (i in index){
xsum <- xsum + x[i]
}
xsum
```

`#> [1] 30`

A much better approach is to use the `sum`

function:

`#> [1] 30`

Unless it is absolutely necessary, we avoid looping.

Looping is necessary when (1) there is no R function to conduct a vectorized calculation, and (2) when the result of an element depends on the result of a preceding element which may not be known beforehand (e.g., when it is the result of a random process).

#### 5.2.3.1 The `for`

function

The previous example was a `for`

loop. Here is the syntax:

In the `for`

function R loops and uses the ith element of *somevector*
either directly or indirectly (e.g., indexing another vector). Here
is using the vector directly:

```
#> 1
#> 4
#> 9
```

The `letters`

object contain the American English alphabet. Here we
use an integer vector for indexing `letters`

:

```
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
#> [16] "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
```

```
#> a
#> b
#> c
```

*Somevector* can be any vector:

```
#> [1] "Tomasito"
#> [1] "Luisito"
#> [1] "Angelita"
```

#### 5.2.3.2 The `while`

function

The `while`

function will continue to evaluate a collection of
R expressions while a condition is true. Here is the syntax:

Here is a trivial example:

```
#> [1] 0
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
```

The `while`

function is used for optimization functions that
are converging to a numerical value.

#### 5.2.3.3 The `break`

and `next`

functions

The `break`

expression will break out of a `for`

or
`while`

loop if a condition is met, and transfers control to
the first statement outside of the inner-most loop. Here is the
general syntax:

The `next`

expression halts the processing of the current
iteration and advances the looping index. Here is the
general syntax:

Both `break`

and `next`

apply only to the innermost of
nested loops.

#### 5.2.3.4 The double `for`

loop

In the next example we nest two `for`

loop to generate a
multiplication table for two integer vectors.

```
x <- 11:15
y <- 16:25
mtab <- matrix(NA, nrow = length(x), ncol = length(y))
rownames(mtab) <- x
colnames(mtab) <- y
for (i in 1:length(x)) {
for (j in 1:length(y)) {
mtab[i, j] <- x[i] * y[j]
}
}
mtab
```

```
#> 16 17 18 19 20 21 22 23 24 25
#> 11 176 187 198 209 220 231 242 253 264 275
#> 12 192 204 216 228 240 252 264 276 288 300
#> 13 208 221 234 247 260 273 286 299 312 325
#> 14 224 238 252 266 280 294 308 322 336 350
#> 15 240 255 270 285 300 315 330 345 360 375
```

Notice that the `outer`

function is a more efficient way of conducting
a calculation using all combinations for two numeric vectors.

```
#> 16 17 18 19 20 21 22 23 24 25
#> 11 176 187 198 209 220 231 242 253 264 275
#> 12 192 204 216 228 240 252 264 276 288 300
#> 13 208 221 234 247 260 273 286 299 312 325
#> 14 224 238 252 266 280 294 308 322 336 350
#> 15 240 255 270 285 300 315 330 345 360 375
```

## 5.3 Writing R functions

Writing R functions is a great way to create “commands” to customize analyses or repeated operations. Here is the general analysis approach:

- Prepare
*inputs* - Do
*calculations* - Collect
*results*

For example, suppose we are writing R code to calculate the odds ratio from a \(2 \times 2\) table with the appropriate format. For this we will use the Oswego outbreak data set.

```
#> ID Age Sex MealDate MealTime Ill OnsetDate TimeOnset
#> 1 2 52 F 04/18/1940 20:00:00 Y 4/19/1940 00:30:00
#> 2 3 65 M 04/18/1940 18:30:00 Y 4/19/1940 00:30:00
#> 3 4 59 F 04/18/1940 18:30:00 Y 4/19/1940 00:30:00
#> 4 6 63 F 04/18/1940 19:30:00 Y 4/18/1940 22:30:00
#> 5 7 70 M 04/18/1940 19:30:00 Y 4/18/1940 22:30:00
#> 6 8 40 F 04/18/1940 19:30:00 Y 4/19/1940 02:00:00
#> Baked.Ham Spinach Mashed.Potatoes Cabbage.Salad Jello Rolls
#> 1 Y Y Y N N Y
#> 2 Y Y Y Y N N
#> 3 Y Y N N N N
#> 4 Y Y N Y Y N
#> 5 Y Y Y N Y Y
#> 6 N N N N N N
#> Brown.Bread Milk Coffee Water Cakes Van.Ice.Cream
#> 1 N N Y N N Y
#> 2 N N Y N N Y
#> 3 N N Y N Y Y
#> 4 N N N Y N Y
#> 5 Y N Y Y N Y
#> 6 N N N N N Y
#> Choc.Ice.Cream Fruit.Salad
#> 1 N N
#> 2 Y N
#> 3 Y N
#> 4 N N
#> 5 N N
#> 6 Y N
```

```
#> Spinach
#> Ill N Y
#> N 12 17
#> Y 20 26
```

```
aa = tab1[1, 1]
bb = tab1[1, 2]
cc = tab1[2, 1]
dd = tab1[2, 2]
#### Do calculations
crossprod.OR = (aa*dd)/(bb*cc)
#### Collect results
list(data = tab1, odds.ratio = crossprod.OR)
```

```
#> $data
#> Spinach
#> Ill N Y
#> N 12 17
#> Y 20 26
#>
#> $odds.ratio
#> [1] 0.9176471
```

Now that we are familiar of what it takes to calculate an odds ratio from a 2-way table we can convert the code into a function and load it at the R console. Here is new function:

```
myOR <- function(x){
#### Prepare input
#### x = 2x2 table amenable to cross-product
aa = x[1, 1]
bb = x[1, 2]
cc = x[2, 1]
dd = x[2, 2]
#### Do calculations
crossprod.OR = (aa*dd)/(bb*cc)
#### Collect results
list(data = x, odds.ratio = crossprod.OR)
}
```

Now we can test the function:

```
#> $data
#> Spinach
#> Ill N Y
#> N 12 17
#> Y 20 26
#>
#> $odds.ratio
#> [1] 0.9176471
```

### 5.3.1 Arguments with default values

Now suppose we wish to add calculating a 95% confidence interval to this function. We will use the following normal approximation standard error formula for an odds ratio:

\[ SE[\log(OR)] = \sqrt{\frac{1}{A_1} + \frac{1}{B_1} + \frac{1}{A_0} + \frac{1}{B_0}} \]

And here is the (\(1-\alpha\))% confidence interval:

\[ OR_L, OR_U = \exp\lbrace log(OR) \pm Z_{\alpha /2} SE[\log(OR)] \rbrace \]

Here is the improved function:

```
myOR2 <- function(x, conf.level){
#### Prepare input
#### x = 2x2 table amenable to cross-product
if(missing(conf.level)) stop("Must specify confidence level")
aa = x[1, 1]
bb = x[1, 2]
cc = x[2, 1]
dd = x[2, 2]
Z <- qnorm((1 + conf.level)/2)
#### Do calculations
logOR <- log((aa*dd)/(bb*cc))
SE.logOR <- sqrt(1/aa + 1/bb + 1/cc + 1/dd)
OR <- exp(logOR)
CI <- exp(logOR + c(-1, 1)*Z*SE.logOR)
#### Collect results
list(data = x, odds.ratio = OR, conf.int = CI)
}
```

Notice that `conf.level`

is a new argument, but with no default value.
If a user forgets to specify a default value, the following line
handles this possibility:

Now we test this function:

```
tab.test <- xtabs(~ Ill + Spinach, data = odatcsv)
myOR2(tab.test)
## Error in myOR2(tab.test) (from #4) : Must specify confidence level
```

```
#> $data
#> Spinach
#> Ill N Y
#> N 12 17
#> Y 20 26
#>
#> $odds.ratio
#> [1] 0.9176471
#>
#> $conf.int
#> [1] 0.3580184 2.3520471
```

If an argument has a usual value, then specify this as an argument default value:

```
myOR3 <- function(x, conf.level = 0.95){
#### Prepare input
#### x = 2x2 table amenable to cross-product
aa = x[1, 1]
bb = x[1, 2]
cc = x[2, 1]
dd = x[2, 2]
Z <- qnorm((1 + conf.level)/2)
#### Do calculations
logOR <- log((aa*dd)/(bb*cc))
SE.logOR <- sqrt(1/aa + 1/bb + 1/cc + 1/dd)
OR <- exp(logOR)
CI <- exp(logOR + c(-1, 1)*Z*SE.logOR)
#### Collect results
list(data = x, odds.ratio = OR, conf.level = conf.level, conf.int = CI)
}
```

We test our new function:

```
#> $data
#> Spinach
#> Ill N Y
#> N 12 17
#> Y 20 26
#>
#> $odds.ratio
#> [1] 0.9176471
#>
#> $conf.level
#> [1] 0.95
#>
#> $conf.int
#> [1] 0.3580184 2.3520471
```

```
#> $data
#> Spinach
#> Ill N Y
#> N 12 17
#> Y 20 26
#>
#> $odds.ratio
#> [1] 0.9176471
#>
#> $conf.level
#> [1] 0.9
#>
#> $conf.int
#> [1] 0.4165094 2.0217459
```

### 5.3.2 Passing optional arguments using the `...`

function

On occasion we will have a function nested inside one of our functions and we need to be able to pass optional arguments to this nested function. This commonly occurs when we write functions for customized graphics but only wish to specify some arguments for the nested function and leave the remaining arguments optional. For example, consider this function:

When using `myplot`

one only needs to provide `x`

and `y`

arguments. The `type`

option has been set to a default value of
“b”. The `...`

function will pass any optional arguments to the nested
`plot`

function. Of course, the optional arguments must be valid
options for `plot`

function.

## 5.4 Lexical scoping

The variables which occur in the body of a function can be divided
into three classes; formal parameters, local variables and free
variables. The *formal parameters* of a function are those occurring
in the argument list of the function. Their values are determined by
the process of binding the actual function arguments to the formal
parameters. *Local variables* are those whose values are determined
by the evaluation of expressions in the body of the
functions. Variables which are not formal parameters or local
variables are called *free variables*. Free variables become local
variables if they are assigned to. Consider the following function
definition.

In this function, `x`

is a *formal* parameter, `y`

is a *local*
variable and `z`

is a *free* variable. In R the value of free
variable is resolved by first looking in the environment in which the
function was defined. This is called *lexical scope*. If the free
variable is not defined there, R looks in the parent environment. For
the function `f`

this would be the global environment (workspace).

To understand the implications of lexical scope consider the following:

```
rm(list = ls())
ls()
## character(0)
f <- function(x){
y <- 2 * x
print(x)
print(y)
print(z)
}
f(5)
## [1] 5
## [1] 10
## Error in print(z) : object 'z' not found
z <- 99
f(5)
## [1] 5
## [1] 10
## [1] 99
```

In the `f`

function `z`

is a free variable. The first time `f`

is
executed `z`

is not defined in the function. R looks in the enclosing
environment and does not find a value for `z`

and reports an error.
However, when a object `z`

is created in the global environment, R is
able to find it and uses it.

To summarize, from Roger Peng (CITE):

- "If the value of a symbol is not found in the environment in which a function was defined, then the search is continued in the parent environment.
- The search continues down the sequence of parent environments until we hit the top-level environment; this usually the global environment (workspace) or the namspace of a package.
- After the top-level environment, the search continues down the search list until we hit the empty environment."

Lexical scoping is convenient because it allows nested functions with free variables to run provided the variable has been defined in an enclosing environment. This convenience becomes obvious when one writes many programs. However, there is a danger: an unintended free variable many find an unintended value in an parent environment. This may go undetected because no error is reported. This can happen when there are many objects in the workspace from previous sessions. A good habit is to clear the workspace of all objects at the beginning of every session.

Here is another example from the R introductory
manual.^{36}
Consider a function called `cube`

.

The variable `n`

in the function `sq`

is not an argument to that
function. Therefore it is a free variable and the scoping rules must
be used to ascertain the value that is to be associated with it.
Under lexical scope (e.g., in R) it is the parameter to the function
cube since that is the active binding for the variable `n`

at the time
the function `sq`

was defined.

`#> [1] 8`

Here is a variation that demonstrates lexical scoping but with an unexpected numerical answer.

`#> [1] 18`

Why does `cube(2) = 18`

? The `sq`

function has a free variable
`n`

. When `sq`

is evaluated inside of `cube`

, `sq`

must find a value
for its free variable `n`

. Under lexical scoping, `sq`

looks in the
parent environment when `sq`

was defined. The `sq`

function was
defined in the global environment, so it looks, finds, and uses `n = 3`

.

### 5.4.1 Functions with a mutatable state

Lexical scoping can be used to create a function with a *mutable
state*. Consider the following function that simulates a banking
savings account balance with deposits and
withdrawals.^{37}

```
open.account <- function(total) {
list(deposit = function(amount) {
if (amount <= 0) stop("Deposits must be positive!\n")
total <<- total + amount
cat(amount, "deposited. Your balance is", total, "\n\n")
}, withdraw = function(amount) {
if (amount > total) stop("You don't have that much money!\n")
total <<- total - amount
cat(amount, "withdrawn. Your balance is", total, "\n\n")
}, balance = function() {
cat("Your balance is", total, "\n\n")
})
}
```

The first `total`

is a formal parameter of the function
`open.account`

. There is a list with three functions: `deposit`

,
`withdraw`

, and `balance`

. Inside these functions, the first `total`

is a free variable; they find their values in the enclosing
environment’s formal parameter (`total`

) which is one level up. The
special assignment operator, “`<<-`

”, makes an assignment to the
enclosing environment, one level up, updating the value of the formal
parameter `total`

.

Let’s open some savings accounts and make deposits and withdrawals.

`#> 75 withdrawn. Your balance is 425`

`#> 250 deposited. Your balance is 350`

`#> Your balance is 425`

`#> Your balance is 350`

## 5.5 Recursive functions

A function that calls itself is a recursive function. Recursive functions break a problem down into small steps that require the same methodologic technique. For example, consider the factorial of 5:

\[ 5! = 5 \times 4! = 5 \times 4 \times 3! = 5 \times 4 \times 3 \times 2! = = 5 \times 4 \times 3 \times 2 \times 1! \]

We can write a recursive function for factorial calculations:

```
recursive.factorial <- function(x) {
if (x == 0) {
return(1)
} else {
return(x * recursive.factorial(x - 1))
}
}
recursive.factorial(5)
```

`#> [1] 120`

`#> [1] 120`

## 5.6 Debugging and profiling R code

TODO: pending

## 5.7 Example: Bootstrap of risk ratio data

Consider the following data (Table 5.1) with the number of coronary heart disease events comparing men with Type A behavior to men with Type B behavior [17]. There is no formula for calculating an exact confidence interval for a risk ratio. When the distribution for a statistical estimate is not known we can use the bootstrap method. Essentially, this involves the following steps:

- Simulate risk ratio estimates using the data
- Estimate the mean of the risk ratios
- Choose appropriate quantile values as confidence limits

CHD | |||

Exposure | Yes | No | Total |

Type A | 178 | 1411 | 1589 |

Type B | 79 | 1486 | 1565 |

```
rr.boot <- function(x1, x0, conf.level = 0.95, replicates = 5000){
#### x1 = c(number cases among exposed, number of exposed)
#### x0 = c(number cases among nonexposed, number of nonexposed)
#### prepare input
n1 <- x1[2]; n0 <- x0[2]
p1 <- x1[1]/n1 # risk among exposed
p0 <- x0[1]/n0 # risk among unexposed
#### do calculations
RR <- p1/p0 # maximum likelihood estimate of risk ratio
r1 <- rbinom(replicates, n1, p1)/n1
r0 <- rbinom(replicates, n0, p0)/n0
rrboot <- r1/r0
rrbar <- mean(rrboot)
alpha <- 1 - conf.level
ci <- quantile(rrboot, c(alpha/2, 1-alpha/2))
#### collect results
list(x = x,
risks = c(p1 = p1, p0 = p0),
risk.ratio = RR,
rrboot.mean = rrbar,
conf.int = unname(ci),
conf.level = conf.level,
replicates = replicates)
}
#### test function
xx1 <- c(178, 1589) # CHD among Type A
xx0 <- c(79, 1565) # CHD among Type B
rr.boot(xx1, xx0)
```

```
#> $x
#> [1] 11 12 13 14 15
#>
#> $risks
#> p1 p0
#> 0.11202014 0.05047923
#>
#> $risk.ratio
#> [1] 2.219133
#>
#> $rrboot.mean
#> [1] 2.251769
#>
#> $conf.int
#> [1] 1.745941 2.919514
#>
#> $conf.level
#> [1] 0.95
#>
#> $replicates
#> [1] 5000
```

## 5.8 Exercises

Probability distributions are either continuous (e.g., normal distribution) or discrete (e.g. binomial distribution). In R, we deal with four aspects of probability distributions (Table 5.2).

Aspects of a probability distribution | Normal | Binomial |
---|---|---|

probability density function | `pnorm` |
`pbinom` |

cumulative distribution function | `dnorm` |
`dbinom` |

quantile (“percentile”) function | `qnorm` |
`qbinom` |

random sample from distribution | `rnorm` |
`rbinom` |

For example, to get the quantile value from a standard normal
distribution (Z value) that corresponds to a 95% confidence interval
we use the `qnorm`

function.

`#> [1] 1.959964`

For a second example, suppose we wish to simulate a binomial
distribution. Out of 350 patients that adhered to a new drug
treatment, 273 recovered. The response proportion was 273/350 or 0.78
(78%). We can use the `rbinom`

function to simulate this binomial
process. Let’s simulate this 10 times (i.e., 10 replicates).

```
replicates <- 10
num_of_events <- 273
num_at_risk <- 350
risk <- num_of_events/num_at_risk
sim_events <- rbinom(n = replicates, size = num_at_risk, prob = risk)
sim_events
```

`#> [1] 277 280 270 264 268 263 254 289 257 273`

`#> [1] 0.791 0.800 0.771 0.754 0.766 0.751 0.726 0.826 0.734 0.780`

Bootstrapping is a resampling method to estimate a confidence
interval. For example, we could simulate 5000 replicates and use the
`quantile`

function to estimate a 0.95 confidence interval.

```
replicates <- 5000
num_of_events <- 273 #recovered
num_at_risk <- 350
risk <- num_of_events/num_at_risk
sim_events <- rbinom(n = replicates, size = num_at_risk, prob = risk)
conf.level <- 0.95
alpha <- 1 - conf.level
## 0.95 CI for binomial counts
quantile(sim_events, c(alpha/2, 1-alpha/2))
```

```
#> 2.5% 97.5%
#> 258 288
```

```
#> 2.5% 97.5%
#> 0.7371429 0.8228571
```

**Exercise 5.1 (Bootstrap a confidence interval)**In the same study above, 350 patients did not take the new drug treatment, and 289 recovered. Calculate the bootstrap confidence interval for this proportion (289/350).

**Exercise 5.2 (Function for bootstrapping a confidence interval)**Adapt your code from the previous problem to write and test a function for calculating the bootstrap confidence interval for a binomial proportion.

For example, if your previous code was

then, your function and test might be

`#> [1] 64`

**Exercise 5.3 (Calculating risks and risk ratio)**Consider Table 5.3 for cohort (risk) data.

Exposed | |||

Outcome | Yes | No | Total |

Yes | \(a\) | \(b\) | \(M_1\) |

No | \(c\) | \(d\) | \(M_0\) |

Total | \(N_1\) | \(N_0\) | \(T\) |

The risk ratio is calculated by

\[ RR = \frac{a/N_1}{b/N_0} = \frac{R_1}{R_0}, \]

and the standard error is calculated by

\[ SE(ln(RR)) = \sqrt{\frac{1}{a}-\frac{1}{N_1}+\frac{1}{b}-\frac{1}{N_0}}. \]

Consider an observational study where 700 patients were given access to a new drug for an ailment. A total of 350 patients chose to take the drug and 350 patients did not. Table 5.4 displays the results of this study.

Treatment | |||

Recovered | Yes | No | Total |

Yes | 273 | 289 | 562 |

No | 77 | 61 | 138 |

Total | 350 | 350 | 700 |

Using the results from Table 5.4 create and test a function that:

- Calculates the risks \(R_1\) and \(R_0\).
- Calculates the risk ratio (\(RR\)).
- Calculates 95% confidence intervals.
- Uses the
`fisher.test`

function (Fisher’s Exact Test) to calculate the two-sided \(p\)-value of the null hypothesis of no association. (Hint: see`epitable`

function in Exercise 5.5) - Collects and returns all the results as a ‘list’.

**Exercise 5.4 (Evaluating drug treatment)**Consider an observational study where 700 patients were given access to a new drug for an ailment. A total of 350 patients chose to take the drug and 350 patients did not. Table 5.4 displays the results of this study.

Using the `read.csv`

function import the data frame for Table
5.4. Use the code below.

Analyze the `drugrx`

data frame using the `xtabs`

function, your
function from the previous problem, and any R functions you wish. Is
the drug treatment effective? Is there a difference in drug treatment
effectiveness comparing men and women?

**Exercise 5.5 (Oswego outbreak investigation)**Study this clean version of the Oswego data set and import it as a data frame:

`~/data/oswego/oswego2.txt`

. Remember this data set is
available from https://github.com/taragonmd/data. Locate the data
set and select the “Raw” tab. Copy the URL and use with the `read.csv`

function.
Import this clean version.

```
oswego2 <- read.csv("~/data/oswego/oswego2.txt", strip.white = TRUE)
food.item <- c("Baked.ham", "Spinach", "Mashed.potato", "Cabbage.salad",
"Jello", "Rolls", "Brown.bread", "Milk", "Coffee", "Water", "Cakes",
"Vanilla.ice.cream", "Chocolate.ice.cream", "Fruit.salad")
names(oswego2)[9:22] <- food.item
str(oswego2)
```

```
#> 'data.frame': 75 obs. of 22 variables:
#> $ ID : int 2 3 4 6 7 8 9 10 14 16 ...
#> $ Age : int 52 65 59 63 70 40 15 33 10 32 ...
#> $ Sex : Factor w/ 2 levels "F","M": 1 2 1 1 2 1 1 1 2 1 ...
#> $ MealDate : Factor w/ 1 level "04/18/1940": 1 1 1 1 1 1 1 1 1 1 ...
#> $ MealTime : Factor w/ 6 levels "11:00:00","18:30:00",..: 5 2 2 4 4 4 6 3 4 NA ...
#> $ Ill : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
#> $ OnsetDate : Factor w/ 2 levels "4/18/1940","4/19/1940": 2 2 2 1 1 2 2 1 2 2 ...
#> $ OnsetTime : Factor w/ 17 levels "00:00:00","00:30:00",..: 2 2 2 15 15 4 3 16 4 7 ...
#> $ Baked.ham : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 1 2 1 2 ...
#> $ Spinach : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 1 2 1 2 ...
#> $ Mashed.potato : Factor w/ 2 levels "N","Y": 2 2 1 1 2 1 1 2 1 1 ...
#> $ Cabbage.salad : Factor w/ 2 levels "N","Y": 1 2 1 2 1 1 1 1 1 1 ...
#> $ Jello : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 1 1 1 ...
#> $ Rolls : Factor w/ 2 levels "N","Y": 2 1 1 1 2 1 1 2 1 2 ...
#> $ Brown.bread : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 2 1 1 ...
#> $ Milk : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
#> $ Coffee : Factor w/ 2 levels "N","Y": 2 2 2 1 2 1 1 1 1 2 ...
#> $ Water : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 2 1 1 ...
#> $ Cakes : Factor w/ 2 levels "N","Y": 1 1 2 1 1 1 2 1 1 2 ...
#> $ Vanilla.ice.cream : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
#> $ Chocolate.ice.cream: Factor w/ 2 levels "N","Y": 1 2 2 1 1 2 2 2 2 2 ...
#> $ Fruit.salad : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
```

Study the following `epitable`

function, especially the use of the
`...`

argument to past options to the `fisher.test`

function.

```
epitable <- function(exposure, outcome, digits = 3, ...){
# Updated 2016-10-16
# 2x2 table from 'xtabs' function
# row names are exposure status
# col names are illness outcome
# digits = significant digits (default = 3)
# ... = options to fisher.test function
x <- xtabs(~ exposure + outcome, na.action=na.pass)
rowtot <- apply(x, 1, sum)
rowdist <- sweep(x, 1, rowtot, '/')
risk0 <- rowdist['N','Y']
risk1 <- rowdist['Y','Y']
rr <- risk1/risk0
or <- (risk1/(1-risk1)) / (risk0/(1-risk0))
ft <- fisher.test(x, ...)
pv <- ft$p.value
results <- signif(c(risk0, risk1, rr, or, pv), digits = digits)
names(results) <- c('R_0', 'R_1', 'RR', 'OR', 'p.value')
list(data = x, results = results)
}
```

Here is an example of testing one food item:

```
#> $data
#> outcome
#> exposure N Y
#> N 12 17
#> Y 17 29
#>
#> $results
#> R_0 R_1 RR OR p.value
#> 0.586 0.630 1.080 1.200 0.809
```

Use the `epitable`

function to test each food item to see if it is
associated with developing illness. Construct a table with the
following column headings:

- Food item
- \(R_0\)
- \(R_1\)
- \(RR\) (risk ratio)
- \(OR\) (odds ratio)
- Fisher \(p\) value

Discuss your findings.

**Exercise 5.6 (Generate outbreak investigation results table)**Write a

`for`

loop to automate the creation of the outbreak table from
previous exercise.
Hint provided:

```
outbreak <- matrix(NA, length(food.item), 5)
rownames(outbreak) <- food.item
colnames(outbreak) <- c('R_0','R_1','RR','OR','P value')
for(i in 1:length(food.item)) {
outbreak[i,] <- epitable(oswego2[,food.item[i]], oswego2$Ill)$results
}
outbreak
```

**Exercise 5.7 (Comparing lapply and sapply)**Study the following use of

`lapply`

and `sapply`

. Explain how both
functions work, and what are their differences.
```
#> $Baked.ham
#> R_0 R_1 RR OR p.value
#> 0.586 0.630 1.080 1.200 0.809
#>
#> $Spinach
#> R_0 R_1 RR OR p.value
#> 0.625 0.605 0.967 0.918 1.000
#>
#> $Mashed.potato
#> R_0 R_1 RR OR p.value
#> 0.622 0.622 1.000 1.000 1.000
#>
#> $Cabbage.salad
#> R_0 R_1 RR OR p.value
#> 0.596 0.643 1.080 1.220 0.808
#>
#> $Jello
#> R_0 R_1 RR OR p.value
#> 0.577 0.696 1.210 1.680 0.442
#>
#> $Rolls
#> R_0 R_1 RR OR p.value
#> 0.658 0.568 0.863 0.682 0.482
#>
#> $Brown.bread
#> R_0 R_1 RR OR p.value
#> 0.583 0.667 1.140 1.430 0.622
#>
#> $Milk
#> R_0 R_1 RR OR p.value
#> 0.620 0.500 0.807 0.614 0.638
#>
#> $Coffee
#> R_0 R_1 RR OR p.value
#> 0.614 0.613 0.999 0.997 1.000
#>
#> $Water
#> R_0 R_1 RR OR p.value
#> 0.647 0.542 0.837 0.645 0.450
#>
#> $Cakes
#> R_0 R_1 RR OR p.value
#> 0.543 0.675 1.240 1.750 0.342
#>
#> $Vanilla.ice.cream
#> R_0 R_1 RR OR p.value
#> 1.43e-01 7.96e-01 5.57e+00 2.35e+01 2.60e-07
#>
#> $Chocolate.ice.cream
#> R_0 R_1 RR OR p.value
#> 0.7410 0.5320 0.7180 0.3980 0.0891
#>
#> $Fruit.salad
#> R_0 R_1 RR OR p.value
#> 0.609 0.667 1.100 1.290 1.000
```

```
#> R_0 R_1 RR OR p.value
#> Baked.ham 0.586 0.630 1.080 1.200 8.09e-01
#> Spinach 0.625 0.605 0.967 0.918 1.00e+00
#> Mashed.potato 0.622 0.622 1.000 1.000 1.00e+00
#> Cabbage.salad 0.596 0.643 1.080 1.220 8.08e-01
#> Jello 0.577 0.696 1.210 1.680 4.42e-01
#> Rolls 0.658 0.568 0.863 0.682 4.82e-01
#> Brown.bread 0.583 0.667 1.140 1.430 6.22e-01
#> Milk 0.620 0.500 0.807 0.614 6.38e-01
#> Coffee 0.614 0.613 0.999 0.997 1.00e+00
#> Water 0.647 0.542 0.837 0.645 4.50e-01
#> Cakes 0.543 0.675 1.240 1.750 3.42e-01
#> Vanilla.ice.cream 0.143 0.796 5.570 23.500 2.60e-07
#> Chocolate.ice.cream 0.741 0.532 0.718 0.398 8.91e-02
#> Fruit.salad 0.609 0.667 1.100 1.290 1.00e+00
```

**Exercise 5.8 (Lexical scoping)**Study the following R code and explain the final answer for

`g(2)`

:
`#> [1] 4`

**Exercise 5.9 (More lexical scoping)**Study the following R code and explain the final answer for

`g(2)`

:
`#> [1] 5`

### References

17. Selvin S. Modern applied biostatistical methods using s-plus. New York: Oxford University Press; 1998.