Population Health Data Science with R

3.0.1

From RStudio main menu, selet ‘File’ > ‘New Project’ > ‘New Directory’ > ‘Empty Project’. Name the new directory ph251d-homework. Use R to display the file path to the work directory?

3.0.2

Recreate Table 3.1 using any combination of the matrix, cbind, rbind, dimnames, or names functions.

TABLE 3.1: Risk of Death in a 20-year Period Among Women in Whickham, England, According to Smoking Status at the Beginning of the Period (Rothman 2012)
Vital Status	Smoking
	Yes	No
Dead	139	230
Alive	443	502

3.0.2.1 Solution

tab <- matrix(c(139, 443, 230, 502), nrow = 2, ncol = 2,
        dimnames = list("Vital Status" = c("Dead", "Alive"),
                        Smoking = c("Yes", "No"))); tab
#### equivalent
tab <- matrix(c(139, 443, 230, 502), 2, 2)
dimnames(tab) <- list("Vital Status" = c("Dead", "Alive"),
                  Smoking = c("Yes", "No")); tab
#### equivalent
tab <- matrix(c(139, 443, 230, 502), 2, 2)
rownames(tab) <- c("Dead", "Alive")
colnames(tab) <- c("Yes", "No")
names(dimnames(tab)) <- c("Vital Status", "Smoking"); tab

3.0.3

Starting with the 2x2 matrix object we created in Table 3.1, using any combination of apply, cbind, rbind, names, and dimnames functions, recreate the Table 3.2.

TABLE 3.2: Risk of Death in a 20-year Period Among Women in Whickham, England, According to Smoking Status at the Beginning of the Period
Vital Status	Smoking
	Yes	No	Total
Dead	139	230	329
Alive	443	502	945
Total	582	732	1314

3.0.3.1 Solution

Using the tab object from previous solution, study and practice the following R code to recreate Table 3.2.

rowt <- apply(tab, 1, sum)
tab2 <- cbind(tab, Total = rowt)
colt <- apply(tab2, 2, sum)
tab2 <- rbind(tab2, Total = colt)
names(dimnames(tab2)) <- c("Vital Status", "Smoking"); tab2

3.0.4

Using the \(2 \times 2\) data from Table 3.1, use the sweep and apply functions to calculate row marginal, column marginal, and joint distributions (i.e., three tables).

3.0.4.1 Solution

Study and execute the following R code:

rowt <- apply(tab, 1, sum)               # row distrib
rowd <- sweep(tab, 1, rowt, "/"); rowd
colt <- apply(tab, 2, sum)               # col distrib
cold <- sweep(tab, 2, colt, "/"); cold
jtd <- tab/sum(tab); jtd                 # joint distrib
distr <- list(row.distribution = rowd, col.distribution = cold, 
              joint.distribution = jtd); distr

3.0.5

Using the data from the previous problems, recreate Table 3.3 and interpret the results.

TABLE 3.3: Risk Ratio and Odds Ratio of Death in a 20-year Period Among Women in Whickham, England, According to Smoking Status at the Beginning of the Period
	Smoking
	Yes	No
Risk	0.24	0.31
Risk Ratio	0.76	1.00
Odds	0.31	0.46
Odds Ratio	0.68	1.00

3.0.5.1 Solution

Using the tab2 object from previous solution, study and practice the following R code to recreate Table 3.3. Note that the column distributions could also have been used.

risk = tab2[1, 1:2]/tab2[3, 1:2]
risk.ratio <- risk/risk[2]
odds <- risk/(1 - risk)
odds.ratio <- odds/odds[2]
rbind(risk, risk.ratio, odds, odds.ratio) # no rounding
round(rbind(risk, risk.ratio, odds, odds.ratio), 2)

Interpretation: The risk of death among non-smokers is higher than the risk of death among smokers, suggesting that there may be some confounding.

3.0.6

Read in the Whickham, England data using the R code below. Stratified by age category, calculate the risk of death comparing smokers to nonsmokers. Show your results. What is your interpretation.

whickdat = read.table("~/git/phds/data/whickham.txt", sep = ",", header = TRUE)
str(wdat)
xtabs(~Vital.Status + Age + Smoking, data = wdat)

3.0.6.1 Solution

Implement the analysis below:

wdat = read.csv("~/git/phds/data/whickham.txt", header = TRUE)
str(wdat)
wdat.vas = xtabs(~Vital.Status + Age + Smoking, data = wdat) 
wdat.vas
wdat.tot.vas = apply(wdat.vas, c(2, 3), sum)
wdat.risk.vas = sweep(wdat.vas, c(2, 3), wdat.tot.vas, "/")
round(wdat.risk.vas, 2)

Interpretation: The risk of death is not larger in non-smokers, in fact it is larger among smokers in older age groups..