Chapter 3 Working with lists and data frames

3.1 A list is a collection of like or unlike data objects

3.1.1 Understanding lists

Up to now, we have been working with atomic data objects (vector, matrix, array). In contrast, lists, data frames, and functions are recursive data objects. Recursive data objects have more flexibility in combining diverse data objects into one object. A list provides the most flexibility. Think of a list object as a collection of “bins” that can contain any R object. Lists are very useful for collecting results of an analysis or a function into one data object where all its contents are readily accessible by indexing.

Schematic representation of a list of length four.

Figure 3.1: Schematic representation of a list of length four.

Figure 3.1 is a schematic representation of a list of length four. The first bin [1] contains a smiling face [[1]], the second bin [2] contains a flower [[2]], the third bin [3] contains a lightning bolt [[3]], and the fourth bin [[4]] contains a heart [[4]]. When indexing a list object, single brackets [\(\cdot\)] indexes the bin, and double brackets [[\(\cdot\)]] indexes the bin contents. If the bin has a name, then $name also indexes the contents.

For example, using the UGDP clinical trial data, suppose we perform Fisher’s exact test for testing the null hypothesis of independence of rows and columns in a contingency table with fixed marginals.

udat <- read.csv("~/git/phds/data/ugdp.txt")
tab <- xtabs(~Status+Treatment, data = udat)[,2:1]; tab
##           Treatment
## Status     Tolbutamide Placebo
##   Death             30      21
##   Survivor         174     184
ftab <- fisher.test(tab); ftab
## 
##  Fisher's Exact Test for Count Data
## 
## data:  tab
## p-value = 0.1813
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.8013768 2.8872863
## sample estimates:
## odds ratio 
##   1.509142

The default display only shows partial results. The total results are stored in the object ftab. Let’s evaluate the structure of ftab and extract some results:

str(ftab)
## List of 7
##  $ p.value    : num 0.181
##  $ conf.int   : atomic [1:2] 0.801 2.887
##   ..- attr(*, "conf.level")= num 0.95
##  $ estimate   : Named num 1.51
##   ..- attr(*, "names")= chr "odds ratio"
##  $ null.value : Named num 1
##   ..- attr(*, "names")= chr "odds ratio"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Fisher's Exact Test for Count Data"
##  $ data.name  : chr "tab"
##  - attr(*, "class")= chr "htest"
ftab$estimate
## odds ratio 
##   1.509142
ftab$conf.int
## [1] 0.8013768 2.8872863
## attr(,"conf.level")
## [1] 0.95
ftab$p.value
## [1] 0.1812576

Using the str function to evaluate the structure of an output object is a common method employed to extract additional results for display or further analysis. In this case, ftab was a list with 7 bins, each with a name.

3.1.2 Creating lists

To create a list directly, use the list function. A list is a convenient method to save results in our customized functions. Table 3.1 summarizes how to create lists.

Table 3.1: Common ways of creating a list
Function Description** Try these examples
list Creates list object x <- 1:3
y <- matrix(c("a","c","b","d"), 2,2)
z <- c("Pedro", "Paulo", "Maria")
mm <- list(x, y, z); mm
data.frame List in tabular format where each “bin” has a vector of same length x <- data.frame(id = 1:3, sex = c("M", "F", "T")); x
mode(x)
class(x)
as.data.frame Coerces data object into a data frame x <- matrix(1:6, 2, 3); x
y <- as.data.frame(x); y
read.table Reads ASCII text file into data frame object TODO
read.csv TODO
read.delim TODO
read.fmf TODO
vector Creates empty list of length \(n\) vector("list", 2)
as.list Coercion into list object list(1:2) # compare to as.list
as.list(1:2)

For example, here’s a function to calculate an odds ratio from a \(2 \times 2\) table:

orcalc <- function(x){
    or <- (x[1,1] * x[2,2]) /(x[1,2] * x[2,1])
    pval <- fisher.test(x)$p.value
    list(data = x, odds.ratio = or, p.value = pval)
}

The orcalc function has been loaded in R, and now we run the function on the UGDP data.

tab # display 2x2 table
##           Treatment
## Status     Tolbutamide Placebo
##   Death             30      21
##   Survivor         174     184
orcalc(tab) # run function
## $data
##           Treatment
## Status     Tolbutamide Placebo
##   Death             30      21
##   Survivor         174     184
## 
## $odds.ratio
## [1] 1.510673
## 
## $p.value
## [1] 0.1812576

For additional practice, study and implement the examples in Table 3.1.

3.1.3 Naming lists

Table 3.2: Common ways of naming lists
Function Comment Try these examples
names Name after creation of list z <- list(rnorm(20), 'Luis', 1:3); z
names(z) <- c('bin1', 'bin2', 'bin3'); z
Name at creation of list z <- list(bin1 = rnorm(20), bin2 = 'Luis', bin3 = 1:3)
names(z) # without assignment returns names

The components (or ‘bins’) of a list can be unnamed or named. Components of a list can be named at the time the list is created or later using the names function. For practice, try the examples in Table 3.2.

3.1.4 Indexing lists

Table 3.3: Common ways of indexing lists
Indexing Try these examples
By position z <- list(bin1 = rnorm(20), bin2 = 'Luis', bin3 = 1:3)
z[1] # indexes 'bin' # 1
z[[1]] # indexes *contents* of 'bin' #1
By name (if exists) z$bin1
z$bin2
By logical vector num <- sapply(z, is.numeric); num
z[num]

If list components (bins) are unnamed, we can index the list by bin position with single or double brackets. The single brackets [\(\cdot\)] indexes one or more bins, and the double brackets indexes contents of single bins only.

mylist1 <- list(1:5, matrix(1:4,2,2), c('Juan', 'Wilfredo'))
mylist1[c(1, 3)] # index bins 1 and 3
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1] "Juan"     "Wilfredo"
mylist1[[3]]     # index contents of 3rd bin
## [1] "Juan"     "Wilfredo"

When list bins are named, we can index the bin contents by name. Using the matched case-control study infert data set, we will conduct a conditional logistic regression analysis to determine if spontaneous and induced abortions are independently associated with infertility. For this we’ll need to load the survival package which contains the clogit function.

library(survival)
data(infert)
names(infert)  # display field names
## [1] "education"      "age"            "parity"         "induced"       
## [5] "case"           "spontaneous"    "stratum"        "pooled.stratum"
mod1 <- clogit(case ~ spontaneous + induced + strata(stratum),
               data = infert)
mod1           # default display of model
## Call:
## clogit(case ~ spontaneous + induced + strata(stratum), data = infert)
## 
##              coef exp(coef) se(coef)    z       p
## spontaneous 1.986     7.285    0.352 5.63 1.8e-08
## induced     1.409     4.092    0.361 3.91 9.4e-05
## 
## Likelihood ratio test=53.1  on 2 df, p=2.87e-12
## n= 248, number of events= 83
is.list(mod1)
## [1] TRUE
names(mod1)    # names of list components
##  [1] "coefficients"      "var"               "loglik"           
##  [4] "score"             "iter"              "linear.predictors"
##  [7] "residuals"         "means"             "method"           
## [10] "n"                 "nevent"            "terms"            
## [13] "assign"            "wald.test"         "y"                
## [16] "formula"           "xlevels"           "call"             
## [19] "userCall"
mod1$coeff
## spontaneous     induced 
##    1.985876    1.409012

The stratum option is used to specify the field that identifies the cases and their matched controls. The mod1 object has a default display (shown). However, it is a list. We can use str to evaluate the list structure. Instead names was used to display the list component names. Sometimes it is more convenience to display the names for the list rather than the complete structure.

The summary function applied to a regression model object creates a list object with more detailed results. This too has a default display, or we can index list components by name.

summod1 <- summary(mod1)
summod1        # default display of summary results
## Call:
## coxph(formula = Surv(rep(1, 248L), case) ~ spontaneous + induced + 
##     strata(stratum), data = infert, method = "exact")
## 
##   n= 248, number of events= 83 
## 
##               coef exp(coef) se(coef)     z Pr(>|z|)    
## spontaneous 1.9859    7.2854   0.3524 5.635 1.75e-08 ***
## induced     1.4090    4.0919   0.3607 3.906 9.38e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## spontaneous     7.285     0.1373     3.651    14.536
## induced         4.092     0.2444     2.018     8.298
## 
## Rsquare= 0.193   (max possible= 0.519 )
## Likelihood ratio test= 53.15  on 2 df,   p=2.869e-12
## Wald test            = 31.84  on 2 df,   p=1.221e-07
## Score (logrank) test = 48.44  on 2 df,   p=3.032e-11
names(summod1) # names of list components
##  [1] "call"         "fail"         "na.action"    "n"           
##  [5] "loglik"       "nevent"       "coefficients" "conf.int"    
##  [9] "logtest"      "sctest"       "rsq"          "waldtest"    
## [13] "used.robust"
summod1$coef
##                 coef exp(coef)  se(coef)        z     Pr(>|z|)
## spontaneous 1.985876  7.285423 0.3524435 5.634592 1.754734e-08
## induced     1.409012  4.091909 0.3607124 3.906191 9.376245e-05

3.1.5 Replacing lists components

Table 3.4: Common ways of replacing list components
Replacing Try these examples
By position z <- list(bin1 = rnorm(20), bin2 = "Luis", bin3 = 1:3)
z[1] <- list(c(2, 3, 4)) # replaces "bin" contents
z[[1]] <- c(2, 3, 4) \# replaces "bin" contents
By name (if exists) z$bin2 <- c("Tomas", "Luis", "Angela"); z
# replace name of specific "bin"
names(z)[2] <- "mykids"; z
By logical num <- sapply(z, is.numeric); num
z[num]<- list(rnorm(10), rnorm(10)); z

Replacing list components is accomplished by combining indexing with assignment. And of course, we can index by position, name, or logical. Remember, if it can be indexed, it can be replaced. Study and practice the examples in Table 3.4.

3.1.6 Operating on lists

Because lists can have complex structural components, there are not many operations we will want to do on lists. When we want to apply a function to each component (bin) of a list, we use the lapply or sapply function. These functions are identical except that sapply “simplifies” the final result, if possible.

Table 3.5: Common ways of operating on a list
Function Description Try these examples
lapply Applies function to each component of a list and returns a list x <- list(1:5, 6:10); x
lapply(x, mean)
sapply Applies a function to each component of a list and simplifies sapply(x, mean)
do.call Calls and applies a function to the list do.call(rbind, x)
mapply Applies a function to the first elements of each argument, the second elements, the third elements, and so on. x <- list(1:4, 1:4); x
y <- list(4, rep(4, 4)); y
mapply(rep, x, y, SIMPLIFY=FALSE)
mapply(rep, x, y)

The do.call function applies a function to the entire list using each each component as an argument. For example, consider a list where each bin contains a vector and we want to cbind the vectors.

mylist <- list(vec1 = 1:5, vec2 = 6:10, vec3 = 11:15)
cbind(mylist)          # will not work
##      mylist   
## vec1 Integer,5
## vec2 Integer,5
## vec3 Integer,5
do.call(cbind, mylist) # works
##      vec1 vec2 vec3
## [1,]    1    6   11
## [2,]    2    7   12
## [3,]    3    8   13
## [4,]    4    9   14
## [5,]    5   10   15

For additional practice, study and implements the examples in Table 3.5.

3.2 A data frame is a list in a 2-dimensional tabular form

A data frame is a list in 2-dimensional tabular form. Each list component (bin) is a data field of equal length. A data frame is a list that behaves like a matrix. Anything that can be done with lists can be done with data frames. Many operations that can be done with a matrix can be done with a data frame.

3.2.1 Understanding data frames and factors

Epidemiologists are familiar with tabular data sets where each row is a record and each column is a field. A record can be data collected on individuals or groups. We usually refer to the field name as a variable (e.g., age, gender, ethnicity). Fields can contain numeric or character data. In R, these types of data sets are handled by data frames. Each column of a data frame is usually either a factor or numeric vector, although it can have complex, character, or logical vectors. Data frames have the functionality of matrices and lists. For example, here is the first 10 rows of the infert data set, a matched case-control study published in 1976 that evaluated whether infertility was associated with prior spontaneous or induced abortions.

data(infert)
str(infert)
## 'data.frame':    248 obs. of  8 variables:
##  $ education     : Factor w/ 3 levels "0-5yrs","6-11yrs",..: 1 1 1 1 2 2 2 2 2 2 ...
##  $ age           : num  26 42 39 34 35 36 23 32 21 28 ...
##  $ parity        : num  6 1 6 4 3 4 1 2 1 2 ...
##  $ induced       : num  1 1 2 2 1 2 0 0 0 0 ...
##  $ case          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ spontaneous   : num  2 0 0 0 1 1 0 0 1 0 ...
##  $ stratum       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pooled.stratum: num  3 1 4 2 32 36 6 22 5 19 ...
infert[1:5, 1:6]
##   education age parity induced case spontaneous
## 1    0-5yrs  26      6       1    1           2
## 2    0-5yrs  42      1       1    1           0
## 3    0-5yrs  39      6       2    1           0
## 4    0-5yrs  34      4       2    1           0
## 5   6-11yrs  35      3       1    1           1

The fields are obviously vectors. Let’s explore a few of these vectors to see what we can learn about their structure in R.

infert$age[1:10]        # first 10 of age variable
##  [1] 26 42 39 34 35 36 23 32 21 28
mode(infert$age)
## [1] "numeric"
class(infert$age)
## [1] "numeric"
infert$stratum[1:10]    # first 10 of stratum variable
##  [1]  1  2  3  4  5  6  7  8  9 10
mode(infert$stratum)
## [1] "numeric"
class(infert$stratum)
## [1] "integer"
infert$education[1:6]   # first 10 of education variable
## [1] 0-5yrs  0-5yrs  0-5yrs  0-5yrs  6-11yrs 6-11yrs
## Levels: 0-5yrs 6-11yrs 12+ yrs
mode(infert$education)
## [1] "numeric"
class(infert$education)
## [1] "factor"

What have we learned so far? In the infert data frame, age is a vector of mode “numeric” and class “numeric,” stratum is a vector of mode `numeric'' and class "integer," andeducation` is a vector of mode “numeric” and class “factor.” The numeric vectors are straightforward and easy to understand. However, a factor, R’s representation of categorical data, is a bit more complicated.

Contrary to intuition, a factor is a numeric vector, not a character vector, although it may have been created from a character vector (shown later). To see the “true” education factor use the unclass function:

z <- unclass(infert$education); z[1:10]
##  [1] 1 1 1 1 2 2 2 2 2 2

Now let’s create a factor from a character vector and then unclass it:

cointoss <- sample(c('Head','Tail'), 100, replace = TRUE)
cointoss[1:7]
## [1] "Head" "Tail" "Tail" "Head" "Tail" "Tail" "Tail"
fct <- factor(cointoss); fct[1:7]
## [1] Head Tail Tail Head Tail Tail Tail
## Levels: Head Tail
unclass(fct)[1:7]
## [1] 1 2 2 1 2 2 2

Notice that we can still recover the original character vector using the as.character function:

as.character(cointoss)[1:7]
## [1] "Head" "Tail" "Tail" "Head" "Tail" "Tail" "Tail"
as.character(fct)[1:7]
## [1] "Head" "Tail" "Tail" "Head" "Tail" "Tail" "Tail"

Okay, let’s create an ordered factor; that is, levels of a categorical variable that have natural ordering. For this we set ordered = TRUE in the factor function:

samp <- sample(c('Low', 'Medium', 'High'), 100, replace = TRUE)
ofac1 <- factor(samp, ordered = TRUE)
ofac1[1:7]
## [1] Medium Medium Low    Low    High   Low    Low   
## Levels: High < Low < Medium
table(ofac1) # levels and labels not in natural order
## ofac1
##   High    Low Medium 
##     32     38     30

However, notice that the ordering was done in alphabetical order which is not what we want. To correct this, use the levels option in the factor function:

ofac2 <- factor(samp, levels = c('Low', 'Medium', 'High'), ordered = TRUE)
ofac2[1:7]
## [1] Medium Medium Low    Low    High   Low    Low   
## Levels: Low < Medium < High
table(ofac2)
## ofac2
##    Low Medium   High 
##     38     30     32

Great—this is exactly what we want! For review, Table 3.6 summarizes the variable types in epidemiology and how they are represented in R. Factors (unordered and ordered) are used to represent nominal and ordinal categorical data. The infert data set contains nominal factors and the esoph data set contains ordinal factors.

Table 3.6: Variable types in data and their representations in R data frames
R variable type Examples R mode R class Examples
Numeric
Continuous 3.45, 2/3 numeric numeric infert$age
Discrete 1, 2, 3, 4, … numeric integer infert$stratum
Categorical
Nominal male vs. female numeric factor infert$education
Ordinal low < medium < high numeric ordered factor esoph$agegp

3.2.2 Creating data frames

In the creation of data frames, character vectors (usually representing categorical data) are converted to factors (mode numeric, class factor), and numeric vectors are converted to numeric vectors of class numeric or class integer. Common ways of creating data frames are summarized in Table 3.7. Factors can be created directly from vectors.

wt <- c(59.5, 61.4, 45.2)
age <- c(11, 9, 6)
gender <- c('Male', 'Male', 'Female')
df <- data.frame(age, gender, wt); df
##   age gender   wt
## 1  11   Male 59.5
## 2   9   Male 61.4
## 3   6 Female 45.2
str(df)
## 'data.frame':    3 obs. of  3 variables:
##  $ age   : num  11 9 6
##  $ gender: Factor w/ 2 levels "Female","Male": 2 2 1
##  $ wt    : num  59.5 61.4 45.2
table(df$gender)
## 
## Female   Male 
##      1      2

Notice that the character vector was automatically converted to a factor. In this example, possible values of gender were derived from the data vector. However, if possible values of gender included “transgender” (although none were recorded), we can specify this using using levels option.

df$gender <- factor(df$gender, levels = c('Male', 'Female', 'Transgender'))
table(df$gender)
## 
##        Male      Female Transgender 
##           2           1           0

Factors allow us to keep track of possible values in categorical data even if specific values were not recorded in a given data set.

Table 3.7: Common ways of creating data frames
Function Description Try these examples
data.frame Data frames are of mode list x <- data.frame(id=1:2, sex=c('M','F'))
mode(x); x
Convert fully labeled array data(Titanic) # load data
data.frame(Titanic)
as.data.frame Coerces data object into a data frame x <- matrix(1:6, 2, 3); x
as.data.frame(x)
read.table Reads ASCII text file into data frame see help(read.table)
expand.grid Creates data frame from all combinations of provided vectors see help(expand.grid)

Sometimes we want to convert an array into a data frame for analysis or sharing. Consider the Titanic array data that come with R. We can convert an array to a data frame (first 6 rows shown):

data(Titanic)                          # load data 
Titanic[,,'Child',]                    # display chidren
## , , Survived = No
## 
##       Sex
## Class  Male Female
##   1st     0      0
##   2nd     0      0
##   3rd    35     17
##   Crew    0      0
## 
## , , Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st     5      1
##   2nd    11     13
##   3rd    13     14
##   Crew    0      0
dft <- data.frame(Titanic); dft[1:6, ] # convert and display 
##   Class    Sex   Age Survived Freq
## 1   1st   Male Child       No    0
## 2   2nd   Male Child       No    0
## 3   3rd   Male Child       No   35
## 4  Crew   Male Child       No    0
## 5   1st Female Child       No    0
## 6   2nd Female Child       No    0

3.2.3 Naming data frames

Table 3.8: Common ways of naming data frames
Function Try these examples
names x<- data.frame(var1 = 1:3, var2 = c('M', 'F', 'F')); x
names(x) <- c('Subjno', 'Sex');
row.names row.names(x) <- c('Sub 1', 'Sub 2', 'Sub 3'); x

Everything that applies to naming list components (Table 3.2) also applies to naming data frame components (Table 3.8). In general, we may be interested in renaming variables (fields) or row names of a data frame, or renaming the levels (possible values) for a given factor (categorical variable). For example, consider the Oswego data set.

odat <- read.table('~/git/phds/data/oswego.txt', sep = '', header = TRUE, na.strings = '.') 
odat[1:5,1:8]              #Display partial data frame 
##   id age sex meal.time ill onset.date onset.time baked.ham
## 1  2  52   F   8:00 PM   Y       4/19   12:30 AM         Y
## 2  3  65   M   6:30 PM   Y       4/19   12:30 AM         Y
## 3  4  59   F   6:30 PM   Y       4/19   12:30 AM         Y
## 4  6  63   F   7:30 PM   Y       4/18   10:30 PM         Y
## 5  7  70   M   7:30 PM   Y       4/18   10:30 PM         Y
names(odat)[3] <- 'Gender' #Rename 'sex' to 'Gender'
table(odat$Gender)         #Display 'Gender' distribution
## 
##  F  M 
## 44 31
levels(odat$Gender)        #Display 'Gender' levels 
## [1] "F" "M"
 # Replace 'Gender' level labels
levels(odat$Gender) <- c('Female', 'Male')
levels(odat$Gender)        #Display new 'Gender' levels
## [1] "Female" "Male"
table(odat$Gender)         #Confirm distribution is same
## 
## Female   Male 
##     44     31
odat[1:5, 1:7]             #Display partial data frame 
##   id age Gender meal.time ill onset.date onset.time
## 1  2  52 Female   8:00 PM   Y       4/19   12:30 AM
## 2  3  65   Male   6:30 PM   Y       4/19   12:30 AM
## 3  4  59 Female   6:30 PM   Y       4/19   12:30 AM
## 4  6  63 Female   7:30 PM   Y       4/18   10:30 PM
## 5  7  70   Male   7:30 PM   Y       4/18   10:30 PM

On occasion, we might be interested in renaming the row names. Currently, the Oswego data set has default integer values from 1 to 75 as the row names.

row.names(odat)[1:10]
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

We can change the row names by assigning a new character vector.

row.names(odat) <- sample(101:199, size = nrow(odat))
odat[1:5, 1:7]
##     id age Gender meal.time ill onset.date onset.time
## 102  2  52 Female   8:00 PM   Y       4/19   12:30 AM
## 122  3  65   Male   6:30 PM   Y       4/19   12:30 AM
## 153  4  59 Female   6:30 PM   Y       4/19   12:30 AM
## 162  6  63 Female   7:30 PM   Y       4/18   10:30 PM
## 166  7  70   Male   7:30 PM   Y       4/18   10:30 PM

3.2.4 Indexing data frames

Table 3.9: Common ways of indexing data frames
Indexing Try these examples
By position data(infert)
infert[1:5, 1:3]
By name infert[1:5, c("education", "age", "parity")]
By logical agelt30 <- infert$age < 30 # create logical vector
infert[agelt30, c("education", "induced", "parity")]
# can also use 'subset' function
subset(infert, agelt30, c("education", "induced", "parity"))

Indexing a data frame is similar to indexing a matrix or a list: we can index by position, by name, or by logical vector. Consider, for example, the 2004 California West ile virus human disease surveillance data. Suppose we are interested in summarizing the Los Angeles cases with neuroinvasive disease (“WNND”).

wdat <- read.csv("~/git/phds/data/wnv/wnv2004fin.txt")
str(wdat)
## 'data.frame':    779 obs. of  8 variables:
##  $ id         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ county     : Factor w/ 23 levels "Butte","Fresno",..: 14 14 14 14 14 14 14 14 8 12 ...
##  $ age        : int  40 64 19 12 12 17 61 74 71 26 ...
##  $ sex        : Factor w/ 2 levels "F","M": 1 1 2 2 2 2 2 1 2 2 ...
##  $ syndrome   : Factor w/ 3 levels "Unknown","WNF",..: 2 2 2 2 2 2 3 3 2 3 ...
##  $ date.onset : Factor w/ 130 levels "2004-05-14","2004-05-16",..: 3 4 4 2 1 5 7 10 7 9 ...
##  $ date.tested: Factor w/ 104 levels "2004-06-02","2004-06-16",..: 1 2 2 2 2 3 4 5 6 6 ...
##  $ death      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
levels(wdat$county)  # Review levels of 'county' variable
##  [1] "Butte"          "Fresno"         "Glenn"          "Imperial"      
##  [5] "Kern"           "Lake"           "Lassen"         "Los Angeles"   
##  [9] "Merced"         "Orange"         "Placer"         "Riverside"     
## [13] "Sacramento"     "San Bernardino" "San Diego"      "San Joaquin"   
## [17] "Santa Clara"    "Shasta"         "Sn Luis Obispo" "Tehama"        
## [21] "Tulare"         "Ventura"        "Yolo"
levels(wdat$syndrome)  # Review levels of 'syndrome' variable
## [1] "Unknown" "WNF"     "WNND"
myrows <- wdat$county=="Los Angeles" & wdat$syndrome=="WNND"
mycols <- c("id", "county", "age", "sex", "syndrome", "death")
wnv.la <- wdat[myrows, mycols]
wnv.la[1:6, ]
##    id      county age sex syndrome death
## 25 25 Los Angeles  70   M     WNND    No
## 26 26 Los Angeles  59   M     WNND    No
## 27 27 Los Angeles  59   M     WNND    No
## 47 47 Los Angeles  57   M     WNND    No
## 48 48 Los Angeles  60   M     WNND    No
## 49 49 Los Angeles  34   M     WNND    No

In this example, the data frame rows were indexed by logical vector, and the columns were indexed by names. We emphasize this method because it only requires application of previously learned principles that always work with R objects.

An alternative method is to use the subset function. The first argument specifies the data frame, the second argument is a Boolean operation that evaluates to a logical vector, and the third argument specifies what variables (or range of variables) to include or exclude.

wnv.sf2 <- subset(wdat, county == "Los Angeles" & 
                  syndrome == "WNND", select = c(id, county,
                                 age, sex, syndrome, death))
wnv.sf2[1:6,]
##    id      county age sex syndrome death
## 25 25 Los Angeles  70   M     WNND    No
## 26 26 Los Angeles  59   M     WNND    No
## 27 27 Los Angeles  59   M     WNND    No
## 47 47 Los Angeles  57   M     WNND    No
## 48 48 Los Angeles  60   M     WNND    No
## 49 49 Los Angeles  34   M     WNND    No

This example is equivalent but specifies range of variables using the : function:

wnv.sf3 <- subset(wdat, county == "Los Angeles" 
                  & syndrome == "WNND", 
                  select = c(id:syndrome, death))
wnv.sf3[1:6,]
##    id      county age sex syndrome death
## 25 25 Los Angeles  70   M     WNND    No
## 26 26 Los Angeles  59   M     WNND    No
## 27 27 Los Angeles  59   M     WNND    No
## 47 47 Los Angeles  57   M     WNND    No
## 48 48 Los Angeles  60   M     WNND    No
## 49 49 Los Angeles  34   M     WNND    No

This example is equivalent but specifies variables to exclude using the - function:

wnv.sf4 <- subset(wdat, county == "Los Angeles" 
                  & syndrome == "WNND", 
                  select = -c(date.onset, date.tested))
wnv.sf4[1:6, ]
##    id      county age sex syndrome death
## 25 25 Los Angeles  70   M     WNND    No
## 26 26 Los Angeles  59   M     WNND    No
## 27 27 Los Angeles  59   M     WNND    No
## 47 47 Los Angeles  57   M     WNND    No
## 48 48 Los Angeles  60   M     WNND    No
## 49 49 Los Angeles  34   M     WNND    No

The subset function offers some conveniences such as the ability to specify a range of fields to include using the : function, and to specify a group of fields to exclude using the - function.

3.2.5 Replacing data frame components

Table 3.10: Common ways of replacing data frame components
Replacing Try these examples
By position data(infert)
infert[1:4, 1:2]
infert[1:4, 2] <- c(NA, 45, NA, 23)
infert[1:4, 1:2]
By name names(infert)
infert[1:4, c("education", "age")]
infert[1:4, c("age")] <- c(NA, 45, NA, 23)
infert[1:4, c("education", "age")]
By logical table(infert$parity)
# change values of 5 or 6 to missing (NA)
infert$parity[infert$parity == 5 | infert$parity == 6] <- NA
table(infert$parity)
table(infert$parity, exclude = NULL)

With data frames, as with all R data objects, anything that can be indexed can be replaced. We already saw some examples of replacing names. For practice, study and implement the examples in Table 3.10.

3.2.6 Operating on data frames

Table 3.11: Common ways of operating on a data frame
Function Description Try these examples
‘tapply’ Apply a function to strata of a vector data(infert)
args(tapply) # display argument
tapply(infert$age, infert$education, mean, na.rm = TRUE)
‘lapply’ Apply a function to each component of list lapply(infert[,1:3], table)
‘sapply’ Apply a function to each component of list, and simplify sapply(infert[,c("age", "parity")], mean, na.rm = TRUE)
‘aggregate’ Split data into subsets, computes summary statistics for each aggregate(infert[, c("age", "parity")],
by = list(Education = infert$education,
Induced = infert$induced), mean)
‘mapply’ Apply a function to first elements of each argument, the second elements, the third elements, and so on df <- data.frame(var1 = 1:4, var2 = 4:1)
mapply("*", df$var1, df$var2)
mapply(c, df$var1, df$var2)
mapply(c, df$var1, df$var2, SIMPLIFY = F)

A data frame is of mode list, and functions that operate on components of a list will work with data frames. For example, consider the year California population estimates for the year 2000.

capop <- read.csv("~/git/phds/data/calpop/CalCounties2000.txt")
str(capop)
## 'data.frame':    11918 obs. of  11 variables:
##  $ County          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Year            : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ Sex             : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age             : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ White           : int  2751 2673 2654 2646 2780 2826 2871 2942 3087 3070 ...
##  $ Hispanic        : int  2910 2837 2770 2798 2762 2738 2872 2837 2756 2603 ...
##  $ Asian           : int  1921 1820 1930 1998 2029 1998 1984 1993 1957 2031 ...
##  $ Pacific.Islander: int  63 70 64 64 85 93 78 94 85 83 ...
##  $ Black           : int  1346 1343 1442 1455 1558 1632 1657 1732 1817 1819 ...
##  $ American.Indian : int  32 37 37 39 38 45 53 51 43 44 ...
##  $ Multirace       : int  711 574 579 588 575 576 548 514 545 537 ...
capop[1:5, 1:8]
##   County Year Sex Age White Hispanic Asian Pacific.Islander
## 1      1 2000   F   0  2751     2910  1921               63
## 2      1 2000   F   1  2673     2837  1820               70
## 3      1 2000   F   2  2654     2770  1930               64
## 4      1 2000   F   3  2646     2798  1998               64
## 5      1 2000   F   4  2780     2762  2029               85

Now, suppose we want to assess the range of the numeric fields. If we treat the data frame as a list, both lapply or sapply works:

lapply(capop[-3], range)
## $County
## [1]  1 59
## 
## $Year
## [1] 2000 2000
## 
## $Age
## [1]   0 100
## 
## $White
## [1]      0 148246
## 
## $Hispanic
## [1]      0 127415
## 
## $Asian
## [1]     0 35240
## 
## $Pacific.Islander
## [1]    0 1066
## 
## $Black
## [1]     0 21865
## 
## $American.Indian
## [1]    0 1833
## 
## $Multirace
## [1]     0 11299

However, if we treat the data frame as a matrix, apply also works:

apply(capop[,-3], 2, range)
##      County Year Age  White Hispanic Asian Pacific.Islander Black
## [1,]      1 2000   0      0        0     0                0     0
## [2,]     59 2000 100 148246   127415 35240             1066 21865
##      American.Indian Multirace
## [1,]               0         0
## [2,]            1833     11299

Some R functions, such as summary, will summarize every variable in a data frame without having to use lapply or sapply.

summary(capop[,1:7])
##      County        Year      Sex           Age          White       
##  Min.   : 1   Min.   :2000   F:5959   Min.   :  0   Min.   :     0  
##  1st Qu.:15   1st Qu.:2000   M:5959   1st Qu.: 25   1st Qu.:   100  
##  Median :30   Median :2000            Median : 50   Median :   385  
##  Mean   :30   Mean   :2000            Mean   : 50   Mean   :  2693  
##  3rd Qu.:45   3rd Qu.:2000            3rd Qu.: 75   3rd Qu.:  1442  
##  Max.   :59   Max.   :2000            Max.   :100   Max.   :148246  
##     Hispanic            Asian        
##  Min.   :     0.0   Min.   :    0.0  
##  1st Qu.:    10.0   1st Qu.:    2.0  
##  Median :    76.0   Median :   16.0  
##  Mean   :  1859.9   Mean   :  628.7  
##  3rd Qu.:   589.8   3rd Qu.:  139.0  
##  Max.   :127415.0   Max.   :35240.0

3.2.6.1 The aggregate function

The aggregate function is almost identical to the tapply function. Recall that tapply allows us to apply a function to a vector that is stratified by one or more fields; for example, calculating mean age (1 field) stratified by sex and ethnicity (2 fields). In contrast, aggregate allows us to apply a function to a group of fields that are stratified by one or more fields; for example, calculating the mean weight and height (2 fields) stratified by sex and ethnicity (2 fields):

sex <- c("M", "M", "M", "M", "F", "F", "F", "F")
eth <- c("W", "W", "B", "B", "W", "W", "B", "B")
wgt <- c(140, 150, 150, 160, 120, 130, 130, 140)
hgt <- c( 60,  70,  70,  80,  40,  50,  50,  60)
df <- data.frame(sex, eth, wgt, hgt)
aggregate(df[, 3:4], by = list(Gender = df$sex, Ethnicity = df$eth), FUN = mean)
##   Gender Ethnicity wgt hgt
## 1      F         B 135  55
## 2      M         B 155  75
## 3      F         W 125  45
## 4      M         W 145  65

For another example, in the capop data frame, we notice that the variable age goes from 0 to 100 by 1-year intervals. It will be useful to aggregate ethnic-specific population estimates into larger age categories. More specifically, we want to calculate the sum of ethnic-specific population estimates (6 fields) stratified by age category, sex, and year (3 fields). We will create a new 7-level age category field commonly used by the National Center for Health Statistics. Naturally, we use the aggregate function:

to.keep <- c("White", "Hispanic", "Asian", "Pacific.Islander", 
             "Black", "American.Indian", "Multirace") 
age.nchs7 <- c(0, 1, 5, 15, 25, 45, 65, 101)
capop$agecat7 <- cut(capop$Age, breaks = age.nchs7, right = FALSE)
capop7 <- aggregate(capop[,to.keep], by = list(Age = capop$agecat7, Sex = capop$Sex,
                   Year = capop$Year), FUN = sum) 
levels(capop7$Age)[7] <- "65+"
capop7[1:14, 1:6]
##        Age Sex Year   White Hispanic   Asian
## 1    [0,1)   F 2000  151238   231822   41758
## 2    [1,5)   F 2000  627554   929222  172296
## 3   [5,15)   F 2000 1849860  2249146  482094
## 4  [15,25)   F 2000 1737534  1893896  545692
## 5  [25,45)   F 2000 4720500  3484732 1335912
## 6  [45,65)   F 2000 4204180  1470124  890078
## 7      65+   F 2000 2943684   559730  417132
## 8    [0,1)   M 2000  159360   243170   43930
## 9    [1,5)   M 2000  662386   968136  182746
## 10  [5,15)   M 2000 1958466  2350768  515148
## 11 [15,25)   M 2000 1850710  2161736  558628
## 12 [25,45)   M 2000 4930388  3843792 1229216
## 13 [45,65)   M 2000 4149666  1375098  768022
## 14     65+   M 2000 2150452   404598  309932

3.3 Managing data objects

Table 3.12: Common ways of managing data objects
Function Description Try these examples
‘ls’, ‘objects’ List objects ls()
objects() # equivalent
‘rm’, ‘remove’ Remove object(s) yy <- 1:5; ls()
rm(yy); ls() # remove objects in workspace
rm(list = ls()) # Don't unless really sure
‘save.image’ Saves current workspace save.image()
‘save’ Writes R objects to external file. x <- runif(20)
‘load’ Loads R objects previously saved y <- list(a=1, b=TRUE, c="oops")
save(x, y, file = "c:/temp/xy.Rdata")
rm(x,y); x; y
load(file = "c:/temp/xy.Rdata")
x; y

When we work in R we have a workspace. Think of the workspace as our desktop that contains the “objects” (data and tools) we use to conduct our work. To view all the objects in our workspace use ls() or objects() (Table 3.12). To view objects that match a pattern use the pattern option:

ls(pattern = "obj")
## [1] "obj1" "obj2" "obj3" "obj4" "obj5"

The rm or remove functions will remove workspace objects.

rm(obj2, obj4); ls(pattern = "obj")
## [1] "obj1" "obj3" "obj5"

To remove all data objects use the following expression with extreme caution: rm(list = ls())

However, the object names may not be sufficiently descriptive to know what these objects contain. To assess R objects in our workspace we use the functions summarized in Table 3.13. In general, we never go wrong using the str, mode, and class functions.

Objects created in the workspace are available during the R session. Upon closing the R session, R asks whether to save the workspace. To save the objects without exiting an R session, use save.image(). The save.image function is actually a special case of the save function:

save(list = ls(all = TRUE), file = ".RData")

The save function saves an R object as an external file. This file can be loaded using the load function.

x <- 1:5; x
## [1] 1 2 3 4 5
save(x, file = "~/temp/x"); rm(x)
load(file = "~/temp/x"); x
## [1] 1 2 3 4 5
Table 3.13: Assessing and coercing data objects
Query data type Coerce to data type
is.vector as.vector
is.matrix as.matrix
is.array as.array
is.list as.list
is.data.frame as.data.frame
is.factor as.factor
is.ordered as.ordered
is.table as.table
is.numeric as.numeric
is.integer as.integer
is.character as.character
is.logical as.logical
is.function as.function
is.null as.null
is.na n/a
is.nan n/a
is.finite n/a
is.infinite n/a

Table 3.13 provides more functions for conducting specific object queries and for coercing one object type into another. For example, a vector is not a matrix.

is.matrix(1:3)
## [1] FALSE

However, a vector can be coerced into a matrix.

x <- as.matrix(1:3); x
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
is.matrix(x)
## [1] TRUE

A common use would be to coerce a factor into a character vector.

sex <- factor(c("M", "M", "M", "M", "F", "F", "F")); sex
## [1] M M M M F F F
## Levels: F M
unclass(sex)      # does not coerce into character vector
## [1] 2 2 2 2 1 1 1
## attr(,"levels")
## [1] "F" "M"
as.character(sex) # yes, works
## [1] "M" "M" "M" "M" "F" "F" "F"

In R, missing values are represented by the value NA (“not available”). The is.na function evaluates an object and returns a logical vector indicating which positions contain NA. The !is.na version returns positions that do not contain NA.

x <- c(12, 34, NA, 56, 89)
is.na(x)
## [1] FALSE FALSE  TRUE FALSE FALSE
!is.na(x)
## [1]  TRUE  TRUE FALSE  TRUE  TRUE

We can use is.na to replace missing values.

x[is.na(x)] <- 999; x
## [1]  12  34 999  56  89

In R, NaN represents “not a number” and Inf represent an infinite value. Therefore, we can use is.nan and is.infinite to assess which positions contain NaN and Inf, respectively.

x <- c(0, 3, 0, -6)
y <- c(4, 0, 0, 0)
z <- x/y
z
## [1]    0  Inf  NaN -Inf
is.nan(z)
## [1] FALSE FALSE  TRUE FALSE
is.infinite(z)
## [1] FALSE  TRUE FALSE  TRUE

3.4 Managing our workspace

Our workspace is like a desktop that contains the “objects” (data and tools) we use to conduct our work. Use the getwd function to list the file path to the workspace file .RData.

getwd()
## [1] "/Users/tja/Dropbox/tja/Rproj/home"

Use the setwd function to set up a new workspace location. A (.RData) file will automatically be created there.

setwd("~/Dropbox/tja/Rproj/panflu")

This is one method to manage multiple workspaces for one’s projects.

Use the search function to list the packages, environments, or data frames attached and available.

search()[1:5] # Linux; showing first five
## [1] ".GlobalEnv"       "package:foreign"  "package:survival"
## [4] "ESSR"             "package:stats"

The global environment .GlobalEnv is our workspace. The searchpaths function list the full paths:

searchpaths()[1:5] # showing first five
## [1] ".GlobalEnv"                                                             
## [2] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/foreign" 
## [3] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/survival"
## [4] "ESSR"                                                                   
## [5] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/stats"

3.5 Exercises

3.5.1

Use the read.table function to read in the syphilis. Evaluate structure of data frame. Do not attach the data frame (yet). Create a 3-dimensional array using both the table or xtabs function. Now attach the data frame using the attach function. Create the same 3-dimensional array using both the table or xtabs function.

std89c = read.table("~/git/phds/data/syphilis89c.txt", sep = ",", header = TRUE)

3.5.1.1 Solution

First, look at the data set at [~/git/phds/data/syphilis89c.txt]. Then read in and try the following:

std <- read.table("~/git/phds/data/syphilis89c.txt", head = TRUE, sep = ",")
str(std)
head(std)
lapply(std, table)

Creating 3-D array without attaching std data frame.

table(std$Race, std$Age, std$Sex)
xtabs(~ Race + Age + Sex, data = std)

Now repeat attaching std data frame using attach function. Study the differences.

attach(std)
table(Race, Age, Sex)
xtabs(~ Race + Age + Sex)
detach(std)

3.5.2

Use the apply function to get marginal totals for the syphilis 3-dimensional array.

3.5.2.1 Solution

Here we use the apply function to get marginal totals for the syphilis 3-dimensional array.

tab.ars <- xtabs(~ Age + Race + Sex, data = std)
#### 2-D tables
tab.ar <- apply(tab.ars, c(1, 2), sum); tab.ar
tab.as <- apply(tab.ars, c(1, 3), sum); tab.as
tab.rs <- apply(tab.ars, c(2, 3), sum); tab.rs

#### 1-D tables
tab.a <- apply(tab.ars, 1, sum); tab.a
tab.r <- apply(tab.ars, 2, sum); tab.r
tab.s <- apply(tab.ars, 3, sum); tab.s

3.5.3

Use the sweep and apply functions to get marginal and joint distributions for a 3-D array.

3.5.3.1 Solution

For this syphilis data example, we’ll choose one 3-D array.

tab.ars <- xtabs(~ Age + Race + Sex, data = std)
rowt <- apply(tab.ars, c(1, 3), sum) # row distrib
rowd <- sweep(tab.ars, c(1, 3), rowt, "/"); rowd
apply(rowd, c(1, 3), sum) # confirm
colt <- apply(tab.ars, c(2, 3), sum) #  col distrib
cold <- sweep(tab.ars, c(2, 3), colt, "/"); cold
apply(cold, c(2, 3), sum) # confirm
jtt <- apply(tab.ars, 3, sum) # joint distrib
jtd <- sweep(tab.ars, 3, jtt, "/"); jtd
apply(jtd, 3, sum) # confirm
distr <- list(rowd, cold, jtd); distr 

3.5.4

Review and read in the group-level, tabular data set of primary and secondary syphilis cases in the United States in 1989. Use the rep function on the data frame fields to recreate the individual-level data frame with over 40,000 observations.

std89b = read.table("~/git/phds/data/syphilis89b.txt", sep = ",", 
                     header = TRUE)

3.5.4.1 Solution

It is a good idea to understand how the function works with two vectors:

rep(4:6, 1:3)
## [1] 4 5 5 6 6 6

We can see that the second vector determines the frequency of the first vector elements. Now use this understanding with the syphilis data.

sdat89b <- read.csv("~/git/phds/data/syphilis89b.txt")
str(sdat89b); sdat89b[1:10,] # display 10 rows
Sex <- rep(sdat89b$Sex, sdat89b$Freq)
Race <- rep(sdat89b$Race, sdat89b$Freq)
Age <- rep(sdat89b$Age, sdat89b$Freq)
sdat89.df <- data.frame(Sex, Race, Age)
str(sdat89.df)

3.5.5

Working with population estimates can be challenging because of the amount of data manipulation. Study the 2000 population estimates for California Counties: [~/git/phds/data/calpop/CalCounties2000.txt]. Now, study and implement this R code. For each expression or group of expressions, explain in words what the R code is doing. Be sure to display intermediate objects to understand each step.

#### 1 Read county names
cty <- scan("~/git/phds/data/calpop/calcounty.txt", what="")

#### 2 Read county population estimates
calpop = read.csv("~/git/phds/data/calpop/CalCounties2000.txt", header = T)

#### 2 Replace county number with county name
for(i in 1:length(cty)){
    calpop$County[calpop$County==i] <- cty[i]
}

#### 3 Discretize age into categories
calpop$Agecat <- cut(calpop$Age, c(0,20,45,65,100), include.lowest = TRUE, right = FALSE) 

#### 4 Create combined API category
calpop$AsianPI <- calpop$Asian + calpop$Pacific.Islander

#### 5 Shorten selected ethnic labels
names(calpop)[c(6, 9, 10)] = c("Latino", "AfrAmer", "AmerInd")

#### 6 Index Bay Area Counties
baindex <- calpop$County=="Alameda" | calpop$County=="San Francisco" 
bapop <- calpop[baindex,]
bapop

#### 7 Labels for later use 
agelabs <- names(table(bapop$Agecat))
sexlabs <- c("Female", "Male")
racen <- c("White", "AfrAmer", "AsianPI", "Latino", "Multirace", "AmerInd")
ctylabs <- names(table(bapop$County))

#### 8 Aggregate 
bapop2 <- aggregate(bapop[,racen], list(Agecat = bapop$Agecat, 
                   Sex = bapop$Sex, County = bapop$County), sum)
bapop2

#### 9 Temp matrix of counts
tmp <- as.matrix(cbind(bapop2[1:4,racen], bapop2[5:8,racen],
             bapop2[9:12,racen], bapop2[13:16,racen]))

#### 10 Convert into final array
bapop3 <- array(tmp, c(4, 6, 2, 2))
dimnames(bapop3) <- list(agelabs, racen, sexlabs, ctylabs)
bapop3