Chapter 4 Managing epidemiologic data in R
4.1 Entering and importing data
There are many ways of getting our data into R for analysis. In the section that follows we review how to enter the Unversity Group Diabetes Program data (4.1) as well as the original data from a comma-delimited text file. We will use the following approaches:
- Entering data at the command prompt
- Importing data from a file
- Importing data using an URL
\(Age<55\) | \(Age\ge 55\) | Combined | ||||||
---|---|---|---|---|---|---|---|---|
Tolbutamide | Placebo | Tolbutamide | Placebo | Tolbutamide | Placebo | |||
Deaths | 8 | 5 | 22 | 16 | 30 | 21 | ||
Survivors | 98 | 115 | 76 | 69 | 174 | 184 |
4.1.1 Entering data
We review four methods. For Methods 1 and 2, data are entered directly at the R console prompt . Method 3 uses the same R expressions and data as Methods 1 and 2, but they are entered into a text editor (R script editor), saved as a text file with a .R
extension (e.g., job02.R
), and then executed from the R console prompt using the source
function. Alternatively, the R expressions and data can be copied and pasted into R. And, for Method 4 we use RStudio’s spreadsheet editor (least preferred).
4.1.1.1 Method 1: Enter data at the console prompt
For review, a convenient way to enter data at the command prompt is to use the c
function:
#### enter data for a vector
vec1 <- c(8, 98, 5, 115); vec1;
## [1] 8 98 5 115
vec2 <- c(22, 76, 16, 69); vec2
## [1] 22 76 16 69
#### enter data for a matrix
mtx1 <- matrix(vec1, 2, 2); mtx1
## [,1] [,2]
## [1,] 8 5
## [2,] 98 115
#### enter data for an array and sum across strata
udat <- array(c(vec1, vec2), c(2, 2, 2)); udat
## , , 1
##
## [,1] [,2]
## [1,] 8 5
## [2,] 98 115
##
## , , 2
##
## [,1] [,2]
## [1,] 22 16
## [2,] 76 69
udat.tot <- apply(udat, c(1, 2), sum); udat.tot
## [,1] [,2]
## [1,] 30 21
## [2,] 174 184
#### enter a list
x <- list(crude.data = udat.tot, stratified.data = udat)
x$crude.data
## [,1] [,2]
## [1,] 30 21
## [2,] 174 184
x$stratified
## , , 1
##
## [,1] [,2]
## [1,] 8 5
## [2,] 98 115
##
## , , 2
##
## [,1] [,2]
## [1,] 22 16
## [2,] 76 69
#### enter simple data frame
subjname <- c('Pedro', 'Paulo', 'Maria')
subjno <- 1:length(subjname)
age <- c(34, 56, 56)
sex <- c('Male', 'Male', 'Female')
dat <- data.frame(subjno, subjname, age, sex); dat
## subjno subjname age sex
## 1 1 Pedro 34 Male
## 2 2 Paulo 56 Male
## 3 3 Maria 56 Female
#### enter a simple function
odds.ratio <- function(aa, bb, cc, dd){(aa*dd)/(bb*cc)}
odds.ratio(30, 174, 21, 184)
## [1] 1.510673
4.1.1.2 Method 2: Enter data at the console prompt using scan
function
Method 2 is identical to Method 1 except one uses the scan
function. It does not matter if we enter the numbers on different lines, it will still be a vector. Remember that we must press the Enter key twice after we have entered the last number.
udat.tot <- scan()
## 1: 30 174
## 3: 21 184
## 5:
## Read 4 items
udat.tot
## [1] 30 174 21 184
To read in a matrix at the command prompt combine the matrix
and scan
functions. Again, it does not matter on what lines we enter the data, as long as they are in the correct order because the matrix
function reads data in column-wise.
udat.tot <- matrix(scan(), 2, 2)
## 1: 30 174 21 184
## 5:
## Read 4 items
udat.tot
## [,1] [,2]
## [1,] 30 21
## [2,] 174 184
To read in an array at the command prompt combine the array
and scan
functions. Again, it does not matter on what lines we enter the data, as long as they are in the correct order because the array function reads the numbers column-wise. In this example we include the dimnames
argument.
udat <- array(scan(), dim = c(2, 2, 2),
dimnames = list(Vital.Status = c('Dead','Survived'),
Treatment = c('Tolbutamide', 'Placebo'),
Age.Group = c('<55', '55+')))
## 1: 8 98 5 115 22 76 16 69
## 9:
## Read 8 items
udat
## , , Age.Group = <55
##
## Treatment
## Vital.Status Tolbutamide Placebo
## Dead 8 5
## Survived 98 115
##
## , , Age.Group = 55+
##
## Treatment
## Vital.Status Tolbutamide Placebo
## Dead 22 16
## Survived 76 69
To read in a list of vectors of the same length (“fields”) at the command prompt combine the list
and scan
function. We will need to specify the type of data that will go into each “bin” or “field.” This is done by specifying the what
argument as a list. This list must be values that are either logical, integer, numeric, or character. For example, for a character vector we can use any expression, say x, that would evaluate to TRUE
for is.character(
x)
. For brevity, use ""
for character, 0
for numeric, 1:2
for integer, and T
or F
for logical. Study this example:
#### list includes field names
dat <- scan("", what = list(id = 1:2, name = "", age = 0,
sex = "", dead = TRUE))
## 1: 3 'John Paul' 84.5 Male F
## 2: 4 'Jane Doe' 34.5 Female T
## 3:
## Read 2 records
dat
## $id
## [1] 3 4
##
## $name
## [1] 'John Paul' 'Jane Doe'
##
## $age
## [1] 84.5 34.5
##
## $sex
## [1] 'Male' 'Female'
##
## $dead
## [1] FALSE TRUE
To read in a data frame at the command prompt combine the data.frame
, scan
, and list
functions.
dat <- data.frame(scan('', what = list(id = 1:2, name = "", age = 0, sex = "",
dead = TRUE)) )
## 1: 3 'John Paul' 84.5 Male F
## 2: 4 'Jane Doe' 34.5 Female T
## 3:
## Read 2 records
dat
## id name age sex dead
## 1 3 John Paul 84.5 Male FALSE
## 2 4 Jane Doe 34.5 Female TRUE
4.1.1.3 Method 3: Execute script (.R) files using the source
function
Method 3 uses the same R expressions and data as Methods 1 and 2, but they are entered into the R script (text) editor, saved as an ASCII text file with a .R
extension (e.g., job01.R
), and then executed from the command prompt using the source
function. Alternatively, the R expressions and data can be copied and pasted into R. For example, the following expressions are in a R script editor and saved to a file named job01.R
.
x <- 1:10; x
One can copy and paste this code into R at the commmand prompt
x <- 1:10; x
## [1] 1 2 3 4 5 6 7 8 9 10
However, if we execute the code using the source
function, it will only display to the screen those objects that are printed using the show
or print
function. Here is the text editor code again, but including show
.
x <- 1:10; show(x)
Now, source job01.R
using source
at the command prompt.
source('~/Rproj/home/job01.R')
## [1] 1 2 3 4 5 6 7 8 9 10
In general, we highly recommend using RStudio’s script editor for all your work. The script file (e.g., job01.R
) created with the editor facilitates documenting, reviewing, and debugging our code; replicating our analytic steps; and auditing by external reviewers.
4.1.1.4 Method 4: Use a spreadsheet editor (optional read)
Method 4 uses R’s spreadsheet editor. This is not a preferred method because we like the original data to be in a text editor or read in from a data file. We will be using the data.entry
and edit
functions. The data.entry
function allows editing of an existing object and automatically saves the changes to the original object name. In contrast, the edit
function allows editing of an existing object but it will not save the changes to the original object name; we must explicitly assign it to a new object name (even if it is the original name).
To enter a vector we need to initialize a vector and then use the data.entry
function (4.1).
x <- numeric(10) # initialize vector with zeros
x
## [1] 0 0 0 0 0 0 0 0 0 0
data.entry(x) # enter numbers, then close window
x
## [1] 34 56 34 56 45 34 23 67 87 99
However, the edit
function applied to a vector does not open a spreadsheet. Try the edit
function and see what happens.
xnew <- edit(numeric(10)) #Edit number, then close window
To enter data into a spreadsheet matrix, first initialize a matrix and then use the data.entry
or edit
function. Notice that the editor added default column names. However, to add our own column names just click on the column heading with our mouse pointer (unfortunately we cannot give row names).
xnew <- matrix(numeric(4),2,2)
data.entry(xnew)
xnew <- edit(xnew) #equivalent
#### open spreadsheet editor in one step
xnew <- edit(matrix(numeric(4),2,2)); xnew
## col1 col2
## [1,] 11 33
## [2,] 22 44
Arrays and nontabular lists cannot be entered using a spreadsheet editor. Hence, we begin to see the limitations of spreadsheet-type approach to data entry. One type of list, the data frame, can be entered using the edit
function.
To enter a data frame use the edit
function. However, we do not need to initialize a data frame (unlike with a matrix). Again, click on the column headings to enter column names.
df <- edit(data.frame()) # spreadsheet screen not shown
df
## mykids age
## 1 Tomasito 17
## 2 Luisito 16
## 3 Angelita 13
When using the edit
function to create a new data frame we must assign it an object name to save the data frame. Later we will see that when we edit an existing data object we can use the edit
or fix
function. The fix
function differs in that fix(
data_object)
saves our edits directly back to data_object
without the need to make a new assignment.
mypower <- function(x, n){x^n}
fix(mypower) # edits saved to 'mypower' object
mypower <- edit(mypower) # equivalent
4.1.2 Importing data from a file
4.1.2.1 Reading an ASCII text data file
In this section we review how to read the following types of text data files:
- Comma-separated variable (csv) data file (\(\pm\) headers and \(\pm\) row names)
- Fixed width formatted data file (\(\pm\) headers and \(\pm\) row names)
Here is the University Group Diabetes Program randomized clinical trial text data file that is comma-delimited, and includes row names and a header (ugdp.txt).10 The header is the first line that contains the column (field) names. The row names is the first column that starts on the second line and uniquely identifies each row. Notice that the row names do not have a column name associated with it. A data file can come with either row names or header, neither, or both. Our preference is to work with data files that have a header and data values that are self-explanatory. Even without a data dictionary one can still make sense out of this data set.
Status,Treatment,Agegrp
1,Dead,Tolbutamide,<55
2,Dead,Tolbutamide,<55
...
408,Survived,Placebo,55+
409,Survived,Placebo,55+
Notice that the header row has 3 items (field names), and the second row has 4 items. This is because the row names start in the second row and have no column name. This data file can be read in using the read.table
function, and R figures out that the first column are row names.11 Here we read in and display the first four lines of this data set:
udat1 <- read.table('~/git/phds/data/ugdp.txt', header = TRUE, sep = ",")
udat1[1:4,] # display first 4 lines
## Status Treatment Agegrp
## 1 Death Tolbutamide <55
## 2 Death Tolbutamide <55
## 3 Death Tolbutamide <55
## 4 Death Tolbutamide <55
Next (below) is the same data (ugdp2.txt
) but without a header and without row names:
Dead,Tolbutamide,<55
Dead,Tolbutamide,<55
...
Survived,Placebo,55+
Survived,Placebo,55+
And here is how we read this comma-delimited data set (ugdp2.txt
):
cnames <- c('Status', 'Treatment', 'Agegrp')
udat2 <- read.table('~/git/phds/data/ugdp2.txt', header = FALSE,
sep = ",", col.names = cnames)
udat2[1:4,] # display first 4 lines
## Status Treatment Agegrp
## 1 Dead Tolbutamide <55
## 2 Dead Tolbutamide <55
## 3 Dead Tolbutamide <55
## 4 Dead Tolbutamide <55
Here is the same data (ugdp3.txt
) as a fix formatted file. In this file, columns 1 to 8 are for field #1, columns 9 to 19 are for field #2, and columns 20 to 22 are for field #3. This type of data file is more compact. One needs a data dictionary to know which columns contain which fields.
Dead Tolbutamide<55
...
Dead Tolbutamide55+
...
Dead Placebo <55
...
SurvivedTolbutamide<55
...
SurvivedPlacebo <55
...
SurvivedPlacebo 55+
This data file would be read in using the read.fwf
function. Because the field widths are fixed, we must strip the white space using the strip.white
option.
cnames <- c('Status', 'Treatment', 'Agegrp')
udat3 <- read.fwf('~/git/phds/data/ugdp3.txt', width = c(8, 11, 3),
col.names = cnames, strip.white = TRUE)
udat3[1:4,]
## Status Treatment Agegrp
## 1 Dead Tolbutamide <55
## 2 Dead Tolbutamide <55
## 3 Dead Tolbutamide <55
## 4 Dead Tolbutamide <55
Finally, here is the same data file as a fixed width formatted file but with integer codes (ugdp4.txt
). In this file, column 1 is for field #1, column 2 is for field #2, and column 3 is for field #3. This type of text data file is the most compact, however, one needs a data dictionary to make sense of all the 1s and 2s.
121
121
...
122
122
...
211
211
...
212
212
Here is how this data file would be read in using the read.fwf
function.
cnames <- c('Status', 'Treatment', 'Agegrp')
udat4 <- read.fwf('~/git/phds/data/ugdp4.txt', width = c(1, 1, 1), col.names = cnames)
udat4[1:4,]
## Status Treatment Agegrp
## 1 1 2 1
## 2 1 2 1
## 3 1 2 1
## 4 1 2 1
R has other functions for reading text data files (read.csv
, read.csv2
, read.delim
, read.delim2
). In general, read.table
is the function used most commonly for reading in data files.
4.1.2.2 Reading data from a binary format (e.g., Stata)
To read data that comes in a binary or proprietary format load the foreign
package using the library
function. To review available functions in the the foreign
package try help(package = foreign)
. For example, here we read in the infert
data set which is also available as a Stata data file.
library(foreign)
idat <- read.dta('~/git/phds/data/infert.dta')
idat[1:4, 1:7]
## id education age parity induced case spontaneous
## 1 1 0 26 6 1 1 2
## 2 2 0 42 1 1 1 0
## 3 3 0 39 6 2 1 0
## 4 4 0 34 4 2 1 0
4.1.3 Importing data using a URL
As we have already seen, text data files can be read directly off a web server into R using read.table
or equivalent function.
4.2 Editing data
In the ideal setting, our data has already been checked, errors corrected, and ready to be analyzed. Post-collection data editing can be minimized by good design and data collection. However, we may still need to make corrections or changes in data values.
4.2.1 Text editor
Figure 4.2 displays West Nile virus (WNV) infection surveillance data in the GNU Emacs text editor.12 In RStudio, this is equivalent to the script editor (left upper window), and the console (left lower window).
4.2.2 The data.entry
, edit
, or fix
functions
For vector and matrices we can use the data.entry
function to edit these data object elements. For data frames and functions use the edit
or fix
functions. Remember that changes made with the edit
function are not saved unless we assign it to the original or new object name. In contrast, changes made with the fix
function are saved back to the original data object name. Therefore, be careful when we use the fix
function because we may unintentionally overwrite data.
Now let’s read in the WNV surveillance raw data as a data frame. Then, using the fix
function, we will edit the first three records where the value for the syndome variable is “Unknown” and change it to NA for missing. We will also change “.” to NA.
wd <- read.table('~/git/phds/data/wnv/wnv2004raw.txt',
header = TRUE, sep = ",", as.is = TRUE)
wd[wd$syndrome=='Unknown',][1:3, -7] # before edits
## id county age sex syndrome date.onset death
## 128 128 Los Angeles 81 M Unknown 07/28/2004 .
## 129 129 Riverside 44 F Unknown 07/25/2004 .
## 133 133 Los Angeles 36 M Unknown 08/04/2004 No
fix(wd) # opens R data editor and edits made -- not shown
wd[c(128, 129, 133), -7] # after edits (3 records)
## id county age sex syndrome date.onset death
## 128 128 Los Angeles 81 M NA 07/28/2004 NA
## 129 129 Riverside 44 F NA 07/25/2004 NA
## 133 133 Los Angeles 36 M NA 08/04/2004 No
First, notice that in the read.table
function as.is=TRUE
. This means the data is read in without R making any changes to it. In other words, character vectors are not automatically converted to factors. We set the option because we knew we were going to edit and make corrections to the data set, and create factors later. In this example, we manually changed the missing values “Unknown” to NA (R’s representation of missing values). However, the manual approach is very inefficient. A better approach is to specify which values in the data frame should be converted to NA. In the read.table
function we can set the option na.string = c('Unknown', '.')
, converting the character strings “Unknown” and “.” into NA. Let’s replace the missing values with NAs upon reading the data file.
wd <- read.table('~/git/phds/data/wnv/wnv2004raw.txt', header = TRUE,
sep = ",", as.is = TRUE, na.string=c('Unknown', '.'))
wd[c(128, 129, 133),] # verify change
## id county age sex syndrome date.onset date.tested death
## 128 128 Los Angeles 81 M <NA> 07/28/2004 08/11/2004 <NA>
## 129 129 Riverside 44 F <NA> 07/25/2004 08/11/2004 <NA>
## 133 133 Los Angeles 36 M <NA> 08/04/2004 08/11/2004 No
4.2.3 Vectorized approach
How do we make these and other changes after the data set has been read into R? Although using R’s spreadsheet function is convenient, we do not recommend it because manual editing is inefficient, our work cannot be replicated and audited, and documentation is poor. Instead use R’s vectorized approach. Let’s look at the distribution of responses for each variable to assess what needs to be “cleaned up,” in addition to converting missing values to NA. We use the following code to read in (again) and evaluate the raw data.
wd <- read.table('~/git/phds/data/wnv/wnv2004raw.txt',
header = TRUE, sep = ",", as.is = TRUE)
str(wd)
## 'data.frame': 779 obs. of 8 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ county : chr "San Bernardino" "San Bernardino" "San Bernardino" "San Bernardino" ...
## $ age : chr "40" "64" "19" "12" ...
## $ sex : chr "F" "F" "M" "M" ...
## $ syndrome : chr "WNF" "WNF" "WNF" "WNF" ...
## $ date.onset : chr "05/19/2004" "05/22/2004" "05/22/2004" "05/16/2004" ...
## $ date.tested: chr "06/02/2004" "06/16/2004" "06/16/2004" "06/16/2004" ...
## $ death : chr "No" "No" "No" "No" ...
lapply(wd, table)[c(2:5, 8)] # apply 'table' function to list display selected
## $county
##
## Butte Fresno Glenn Imperial Kern
## 7 11 3 1 59
## Lake Lassen Los Angeles Merced Orange
## 1 1 306 1 62
## Placer Riverside Sacramento San Bernardino San Diego
## 1 109 3 187 2
## San Joaquin Santa Clara Shasta Sn Luis Obispo Tehama
## 2 1 5 1 10
## Tulare Ventura Yolo
## 3 2 1
##
## $age
##
## . 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 27 28 29 30 31
## 6 1 1 1 3 2 3 3 1 4 6 5 1 4 2 3 6 8 3 9 4 8 1 8 6
## 32 33 34 35 36 37 38 39 4 40 41 42 43 44 45 46 47 48 49 5 50 51 52 53 54
## 3 6 15 15 8 8 5 7 1 9 17 15 19 11 14 23 25 19 22 2 14 16 24 17 14
## 55 56 57 58 59 6 60 61 62 63 64 65 66 67 68 69 7 70 71 72 73 74 75 76 77
## 17 13 16 8 17 2 29 13 15 7 12 10 7 8 8 14 2 12 10 9 11 12 7 10 7
## 78 79 8 80 81 82 83 84 85 86 87 88 89 9 91 93 94
## 7 7 1 6 11 10 5 6 4 2 2 1 6 1 4 1 1
##
## $sex
##
## . F M
## 2 294 483
##
## $syndrome
##
## Unknown WNF WNND
## 105 391 283
##
## $death
##
## . No Yes
## 66 686 27
What did we learn? First, there are 779 observations and 781 id’s; therefore, three observations were removed from the original data set. Second, we see that the variables age, sex, syndrome, and death have missing values that need to be converted to NAs. This can be done one field at a time, or for the whole data frame in one step. Here is the R code:
wd$age[wd$age=='.'] <- NA
wd$sex[wd$sex=='.'] <- NA
wd$syndrome[wd$syndrome == 'Unknown'] <- NA
wd$death[wd$death=='.'] <- NA
table(wd$death) # verify
##
## No Yes
## 686 27
table(wd$death, exclude = NULL) # show missing NAs
##
## No Yes <NA>
## 686 27 66
Note that the NA replacement could have been done globally like this:
wd[wd=='.' | wd=='Unknown'] <- NA
We also notice that the entry for one of the counties, San Luis Obispo, was misspelled (Sn Luis Obispo
). We can use replacement to correct this:
wd$County[wd$county=='Sn Luis Obispo'] <- 'San Luis Obispo'
4.2.4 Text processing
On occasion, we will need to process and manipulate character vectors using a vectorized approach. For example, suppose we need to convert a character vector of dates from “mm/dd/yy” to “yyyy-mm-dd.”13 We’ll start by using the substr
function. This function extracts characters from a character vector based on position.
bd <- c('07/17/96','12/09/00','11/07/97')
mon <- substr(bd, start = 1, stop = 2); mon
## [1] "07" "12" "11"
day <- substr(bd, 4, 5); day
## [1] "17" "09" "07"
yr <- as.numeric(substr(bd, 7, 8)); yr
## [1] 96 0 97
yr2 <- ifelse(yr <= 19, yr + 2000, yr + 1900); yr2
## [1] 1996 2000 1997
bdfin <- paste(yr2, '-', mon, '-', day, sep=''); bdfin
## [1] "1996-07-17" "2000-12-09" "1997-11-07"
In this example, we needed to convert “00” to “2000”, and “96” and “97” to “1996” and “1997”, respectively. The trick here was to coerce the character vector into a numeric vector so that 1900 or 2000 could be added to it. Using the ifelse
function, for values \(\le 19\) (arbitrarily chosen), 2000 was added, otherwise 1900 was added. The paste
function was used to paste back the components into a new vector with the standard date format.
The substr
function can also be used to replace characters in a character vector. Remember, if it can be indexed, it can be replaced.
bd
## [1] "07/17/96" "12/09/00" "11/07/97"
substr(bd, 3, 3) <- '-'; substr(bd, 6, 6) <- '-'; bd
## [1] "07-17-96" "12-09-00" "11-07-97"
Function | Description | Try these examples |
---|---|---|
nchar |
Returns the number of characters in each element of a character vector | x = c('a', 'ab', 'abc', 'abcd') |
nchar(x) |
||
substr |
Extract or replace substrings in a character vector | #### extraction |
mon = substr(some.dates, 1, 2); mon |
||
day = substr(some.dates, 4, 5); day |
||
yr = substr(some.dates, 7, 8); yr |
||
#### replacement |
||
mdy = paste(mon, day, yr); mdy |
||
substr(mdy, 3, 3) = '/' |
||
substr(mdy, 6, 6) = '/'; mdy |
||
paste |
Concatenate vectors after converting to character | rd = paste(mon, '/', day, '/', yr, sep="") |
rd |
||
strsplit |
Split the elements of a character vector into substrings | some.dates = c('10/02/70', '02/04/67') |
some.dates |
||
strsplit(some.dates, '/') |
4.3 Sorting data
The sort
function sorts a vector as expected:
x <- sample(1:10, 10); x
## [1] 10 9 2 1 6 7 3 4 5 8
sort(x)
## [1] 1 2 3 4 5 6 7 8 9 10
sort(x, decreasing = TRUE) # reverse sort
## [1] 10 9 8 7 6 5 4 3 2 1
rev(sort(x)) # reverse sort
## [1] 10 9 8 7 6 5 4 3 2 1
However, if we want to sort one vector based on the ordering of elements from another vector, use the order
function. The order
function generates an indexing/repositioning vector. Study the following example:
x <- c(15, 7, 20, 16, 8)
sort(x) # sorts as expected
## [1] 7 8 15 16 20
y <- c(6, 18, 14, 9, 3)
order(y) # integer vector specifying position change
## [1] 5 1 4 3 2
x[order(y)] # sort elements of x based on order(y)
## [1] 8 15 16 20 7
Based on this we can see that sort(x)
is just x[order(x)]
.
Now let us see how to use the order
function for data frames. First, we create a small data set.
sex <- rep(c('Male', 'Female'), c(4, 4))
ethnicity <- rep(c('White', 'Black', 'Latino', 'Asian'), 2)
age <- c(57, 93, 7, 65, 38, 27, 66, 72)
dat <- data.frame(age, sex, ethnicity)
dat <- dat[sample(1:8, 8), ] # randomly order rows
dat
## age sex ethnicity
## 4 65 Male Asian
## 3 7 Male Latino
## 6 27 Female Black
## 8 72 Female Asian
## 2 93 Male Black
## 5 38 Female White
## 1 57 Male White
## 7 66 Female Latino
Okay, now we will sort the data frame based on the ordering of one field, and then the ordering of two fields:
dat[order(dat$age),] # sort based on 1 variable
## age sex ethnicity
## 3 7 Male Latino
## 6 27 Female Black
## 5 38 Female White
## 1 57 Male White
## 4 65 Male Asian
## 7 66 Female Latino
## 8 72 Female Asian
## 2 93 Male Black
dat[order(dat$sex, dat$age),] # sort based on 2 variables
## age sex ethnicity
## 6 27 Female Black
## 5 38 Female White
## 7 66 Female Latino
## 8 72 Female Asian
## 3 7 Male Latino
## 1 57 Male White
## 4 65 Male Asian
## 2 93 Male Black
4.4 Indexing (subsetting) data
In R, we use indexing to subset (or extract) parts of a data object. For this section, please load the well known Oswego foodborne illness data frame using:
odat <- read.table('~/git/phds/data/oswego/oswego.txt',
header = TRUE, as.is = TRUE, sep = "")
names(odat) # display field names
## [1] "id" "age" "sex"
## [4] "meal.time" "ill" "onset.date"
## [7] "onset.time" "baked.ham" "spinach"
## [10] "mashed.potato" "cabbage.salad" "jello"
## [13] "rolls" "brown.bread" "milk"
## [16] "coffee" "water" "cakes"
## [19] "vanilla.ice.cream" "chocolate.ice.cream" "fruit.salad"
4.4.1 Indexing
Now, we practice indexing rows from this data frame. First, we create a new data set that contains only cases. To index the rows with cases we need to generate a logical vector that is TRUE
for every value of odat$ill
that is equivalent to'' `'Y'`. For
is equivalent to’’ we use the ==
relational operator.
cases <- odat$ill=='Y' # logical vector to index cases
odat.ca <- odat[cases, ]
odat.ca[1:4, 1:7] # display part of data set
## id age sex meal.time ill onset.date onset.time
## 1 2 52 F 8:00 PM Y 4/19 12:30 AM
## 2 3 65 M 6:30 PM Y 4/19 12:30 AM
## 3 4 59 F 6:30 PM Y 4/19 12:30 AM
## 4 6 63 F 7:30 PM Y 4/18 10:30 PM
It is very important to understand what we just did: we extracted the rows with cases by indexing the data frame with a logical vector.
Now, we combine relational operators with logical operators to extract rows based on multiple criteria. Let’s create a data set with female cases, age less than the median age, and consumed vanilla ice cream.
fem.cases.vic <- odat$ill == 'Y' & odat$sex == 'F' &
odat$vanilla.ice.cream == 'Y' & odat$age < median(odat$age)
odat.fcv <- odat[fem.cases.vic, ]
odat.fcv[ , c(1:6, 19)]
## id age sex meal.time ill onset.date vanilla.ice.cream
## 8 10 33 F 7:00 PM Y 4/18 Y
## 10 16 32 F <NA> Y 4/19 Y
## 13 20 33 F <NA> Y 4/18 Y
## 14 21 13 F 10:00 PM Y 4/19 Y
## 18 27 15 F 10:00 PM Y 4/19 Y
## 23 36 35 F <NA> Y 4/18 Y
## 31 48 20 F 7:00 PM Y 4/19 Y
## 37 58 12 F 10:00 PM Y 4/19 Y
## 40 65 17 F 10:00 PM Y 4/19 Y
## 41 66 8 F <NA> Y 4/19 Y
## 42 70 21 F <NA> Y 4/19 Y
## 44 72 18 F 7:30 PM Y 4/19 Y
In summary, we see that indexing rows of a data frame consists of using relational operators (<, >, <=, >=, ==, !=
) and logical operators (&, |, !
) to generate a logical vector for indexing the appropriate rows.
4.4.2 Using the subset
function
Indexing a data frame using the subset
function is equivalent to using logical vectors to index the data frame. In general, we prefer indexing because it is generalizable to indexing any R data object. However, the subset
function is a convenient alternative for data frames. Again, let’s create data set with female cases, age \(<\) median, and ate vanilla ice cream.
odat.fcv <- subset(odat, subset = {ill == 'Y' & sex == 'F' &
vanilla.ice.cream == 'Y' & age < median(odat$age)},
select = c(id:onset.date, vanilla.ice.cream))
odat.fcv
## id age sex meal.time ill onset.date vanilla.ice.cream
## 8 10 33 F 7:00 PM Y 4/18 Y
## 10 16 32 F <NA> Y 4/19 Y
## 13 20 33 F <NA> Y 4/18 Y
## 14 21 13 F 10:00 PM Y 4/19 Y
## 18 27 15 F 10:00 PM Y 4/19 Y
## 23 36 35 F <NA> Y 4/18 Y
## 31 48 20 F 7:00 PM Y 4/19 Y
## 37 58 12 F 10:00 PM Y 4/19 Y
## 40 65 17 F 10:00 PM Y 4/19 Y
## 41 66 8 F <NA> Y 4/19 Y
## 42 70 21 F <NA> Y 4/19 Y
## 44 72 18 F 7:30 PM Y 4/19 Y
In the subset
function, the first argument is the data frame object name, the second argument (also called subset
) evaluates to a logical vector, and third argument (called select
) specifies the fields to keep. In the second argument, subset = {...}
, the curly brackets are included for convenience to group the logical and relational operations. In the select
argument, using the :
operator, we can specify a range of fields to keep.
4.5 Transforming data
Transforming fields in a data frame is very common. The most common transformations include the following:
- Numerical transformation of a numeric vector
- Discretizing a numeric vector into categories or levels (“categorical variable”)
- Recoding factor levels (categorical variables)
For each of these, we must decide whether the newly created vector should be a new field in the data frame, overwrite the original field in the data frame, or not be a field in the data frame (but rather a vector object in the R workspace). For the examples that follow load the well known Oswego foodborne illness dataset:
odat <- read.table('~/git/phds/data/oswego/oswego.txt',
header = TRUE, as.is = TRUE, sep = "")
4.5.1 Numerical transformation
In this example we “center” the age
field by substracting the mean age from each age.
summary(odat$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 16.50 36.00 36.81 57.50 77.00
#### create new field in same data frame
odat$age.centered <- odat$age - mean(odat$age)
summary(odat$age.centered)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -33.8133 -20.3133 -0.8133 0.0000 20.6867 40.1867
#### create new vector in workspace; data frame unchanged
age.centered <- odat$age - mean(odat$age)
summary(age.centered)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -33.8133 -20.3133 -0.8133 0.0000 20.6867 40.1867
For convenience, the transform
function facilitates the transformation of numeric vectors in a data frame. The transform
function comes in handy when we plan on transforming many fields: we do not need to specify the data frame each time we refer to a field name. For example, the following lines are are equivalent. Both add a new transformed field to the data frame.
odat$age.centered <- odat$age - mean(odat$age)
odat <- transform(odat, age.centered = age - mean(age))
TODO
4.5.2 Recoding vector values
4.5.2.1 Recoding using replacement
4.5.2.2 Recoding using a lookup table
4.5.3 Creating categorical variables (factors)
Now, reload the Oswego data set to recover the original odat$age
field. We are going to create a new field with the following six age categories (in years): <5
, 5 to 14
, 15 to 24
, 25 to 44
, 45 to 64
, and 65+
. We will demonstrate this using several methods.
4.5.3.1 Using cut
function (preferred method)
Here we use the cut
function.
agecat <- cut(odat$age, breaks = c(0, 5, 15, 25, 45, 65, 100))
table(agecat)
## agecat
## (0,5] (5,15] (15,25] (25,45] (45,65] (65,100]
## 1 17 12 17 22 6
Note that the cut
function generated a factor with 7 levels for each interval. The notation (15, 25]
means that the interval is open on the left boundary (>15
) and closed on the right boundary (\(\le 25\)). However, in the United States, for age categories, the age boundaries are closed on the left and open on the right: [a, b)
. To change this we set the option right = FALSE
agecat <- cut(odat$age, breaks = c(0, 5, 15, 25, 45, 65, 100), right = FALSE)
table(agecat)
## agecat
## [0,5) [5,15) [15,25) [25,45) [45,65) [65,100)
## 1 14 13 18 20 9
Okay, this looks good, but we can add labels since our readers may not be familiar with open and closed interval notation \([a, b)\).
agelabs <- c('<5', '5-14', '15-24', '25-44', '45-64', '65+')
agecat <- cut(odat$age, breaks = c(0, 5, 15, 25, 45,
65, 100), right = FALSE, labels = agelabs)
table(Illness = odat$ill, 'Age category' = agecat) # display
## Age category
## Illness <5 5-14 15-24 25-44 45-64 65+
## N 0 8 5 8 5 3
## Y 1 6 8 10 15 6
4.5.3.2 Using indexing and assignment (replacement)
The cut
function is the preferred method to create a categorical variable. However, suppose one does not know about the cut
function. Applying basic R concepts always works!
agecat <- odat$age
agecat[odat$age < 5] <- 1
agecat[odat$age >= 5 & odat$age < 15] <- 2
agecat[odat$age >= 15 & odat$age < 25] <- 3
agecat[odat$age >= 25 & odat$age < 45] <- 4
agecat[odat$age >= 45 & odat$age < 65] <- 5
agecat[odat$age >= 65] <- 6
#### create factor
agelabs <- c('<5', '5-14', '15-24', '25-44', '45-64', '65+')
agecat <- factor(agecat, levels = 1:6, labels = agelabs)
table(Illness = odat$ill, 'Age category' = agecat) # display
## Age category
## Illness <5 5-14 15-24 25-44 45-64 65+
## N 0 8 5 8 5 3
## Y 1 6 8 10 15 6
In these previous examples, notice that agecat
is a factor object in the workspace and not a field in the odat
data frame.
4.5.4 Recoding factor levels (categorical variables)
In the previous example the categorical variable was a numeric vector (1, 2, 3, 4, 5, 6) that was converted to a factor and provided labels (‘<5’, ‘5-14’, …). In fact, categorical variables are often represented by integers (for example, 0 = no, 1 = yes; or 0 = non-case, 1 = case) and provided labels. Often, ASCII text data files are integer codes that require a data dictionary to convert these integers into categorical variables in a statistical package. In R, keeping track of integer codes for categorical variables is unnecessary. Therefore, re-coding the underlying integer codes is also unnecessary; however, if we feel the need to do so, here’s how:
#### Create categorical variable
ethlabs <- c('White', 'Black', 'Latino', 'Asian')
ethnicity <- sample(ethlabs, 100, replace = TRUE)
ethnicity <- factor(ethnicity, levels = ethlabs)
table(ethnicity) # display frequency
## ethnicity
## White Black Latino Asian
## 22 29 29 20
The levels
option allowed us to determine the display order, and the first level becomes the reference level in statistical models. To display the underlying numeric code use unclass
function which preserves the levels attribute.14
ethcode <- unclass(ethnicity)
table(ethcode) # display frequency
## ethcode
## 1 2 3 4
## 22 29 29 20
levels(ethcode) # display levels
## [1] "White" "Black" "Latino" "Asian"
To recover the original factor,
eth.orig <- factor(ethcode, labels = levels(ethcode))
table(eth.orig) # display frequency
## eth.orig
## White Black Latino Asian
## 22 29 29 20
Although we can extract the integer code, why would we need to do so? One is tempted to use the integer codes as a way to share data sets. However, we recommend not using the integer codes, but rather just provide the data in its native format.15 This way, the raw data is more interpretable and eliminates the intermediate step of needing to label the integer code. Also, if the data dictionary is lost or not provided, the raw data is still interpretable.
In R, we can re-label the levels using the levels
function and assigning to it a character vector of new labels. Make sure the order of the new labels corresponds to the order of the factor levels.
levels(ethnicity) # display levels
## [1] "White" "Black" "Latino" "Asian"
eth2 <- ethnicity # create duplicate
levels(eth2) <- c('Caucasion', 'Afr. American', 'Hispanic', 'Asian') # assign new levels
table(eth2) # display frequency with new levels
## eth2
## Caucasion Afr. American Hispanic Asian
## 22 29 29 20
In R, we can re-order and re-label at the same time using the levels
function and assigning to it a list. The list
function is necessary to assure the re-ordering.
eth3 <- ethnicity # create duplicate
levels(eth3) <- list(Hispanic = 'Latino', Asian = 'Asian',
Caucasion = 'White', 'Afr. American' = 'Black')
table(eth3) # display frequency with new levels & labels
## eth3
## Hispanic Asian Caucasion Afr. American
## 29 20 22 29
To re-order without re-labeling use the list
function
table(ethnicity)
## ethnicity
## White Black Latino Asian
## 22 29 29 20
eth4 <- ethnicity
levels(eth4) <- list(Latino = 'Latino', Asian = 'Asian',
White = 'White', Black = 'Black')
table(eth4)
## eth4
## Latino Asian White Black
## 29 20 22 29
In R, we can sort the factor levels by using the factor
function in one of two ways:
table(ethnicity)
## ethnicity
## White Black Latino Asian
## 22 29 29 20
eth5a <- factor(ethnicity, sort(levels(ethnicity)))
table(eth5a)
## eth5a
## Asian Black Latino White
## 20 29 29 22
eth5b <- factor(as.character(ethnicity))
table(eth5b)
## eth5b
## Asian Black Latino White
## 20 29 29 22
In the first example, we assigned to the levels
argument the sorted level names. In the second example, we started from scratch by coercing the original factor into a character vector which is then ordered alphabetically by default.
4.5.4.1 Setting factor reference level
The first level of a factor is the reference level for some statistical models (e.g., logistic regression). To set a different reference level use the relevel
function.%
levels(ethnicity)
## [1] "White" "Black" "Latino" "Asian"
eth6 <- relevel(ethnicity, ref = 'Asian')
levels(eth6)
## [1] "Asian" "White" "Black" "Latino"
As we can see, there is tremendous flexibility in dealing with factors without the need to ``re-code’’ categorical variables. This approach facilitates reviewing our work and minimizes errors.
4.5.5 Use factors instead of dummy variables
Factor | Dummy | variable | codes |
---|---|---|---|
Ethnicity | Asian | Black | Latino |
White | 0 | 0 | 0 |
Asian | 1 | 0 | 0 |
Black | 0 | 1 | 0 |
Latino | 0 | 0 | 1 |
A nonordered factor (nominal categorical variable) with \(k\) levels can also be represented with \(k-1\) dummy variables. For example, the ethnicity
factor has four levels: white, Asian, black, and Latino. Ethnicity can also be represented using 3 dichotomous variables, each coded 0 or 1. For example, using white as the reference group, the dummy variables would be asian
, black
, and latino
(see Table 4.3). The values of those three dummy variables (0 or 1) are sufficient to represents one of four possible ethnic categories. Dummy variables can be used in statistical models. However, in R, it is unnecessary to create dummy variables, just create a factor with the desired number of levels and set the reference level.
4.5.6 Conditionally transforming the elements of a vector
We can conditionally transform the elements of a vector using the ifelse
function. This function works as follows:
ifelse(test,
if test = TRUE do this,
else do this)
x <- sample(c('M', 'F'), 10, replace = TRUE); x
## [1] "M" "M" "M" "M" "M" "M" "F" "M" "F" "M"
y <- ifelse(x=='M', 'Male', 'Female'); y
## [1] "Male" "Male" "Male" "Male" "Male" "Male" "Female"
## [8] "Male" "Female" "Male"
Function | Description | Try these examples |
---|---|---|
<- |
Transforming a vector and assigning it to a new data frame variable name | dat <- data.frame(id=1:3, x=c(0.5,1,2)) |
dat$logx <- log(x) # creates new field |
||
dat |
||
transform | Transform one or more variables from a data frame | dat <- data.frame(id=1:3, x=c(0.5,1,2)) |
dat <- transform(dat, logx = log(x)) |
||
dat |
||
cut | Creates a factor by dividing the range of a numeric vector into intervals | age <- sample(1:100, 500, replace = TRUE) |
#### cut into 2 intervals |
||
agecut <- cut(age, 2, right = FALSE) |
||
table(agecut) |
||
#### cut using specified intervals |
||
agecut2 <- cut(age, c(0, 50 100),right = FALSE, include.lowest = TRUE) |
||
table(agecut2) |
||
levels | Gives access to the levels attribute of a factor | sex <- sample(c('M','F','T'),500, replace=T) |
sex <- factor(sex) |
||
table(sex) |
||
#### relabel each level; use same order |
||
levels(sex) <- c('Female', 'Male', 'Transgender') |
||
table(sex) |
||
#### relabel/recombine |
||
levels(sex) <- c('Female', 'Male', 'Male') |
||
table(sex) |
||
#### reorder and/or relabel |
||
levels(sex) <- list ('Men' = 'Male', 'Women' = 'Female') |
||
table(sex) |
||
relevel | Set the reference level for a factor | sex2 <- relevel(sex, ref = 'Women') |
table(sex2) |
||
ifelse | Conditionally operate on elements of a vector based on a test | age <- sample(1:100, 1000, replace = TRUE) |
agecat <- ifelse(age<=50, '<=50', '>50') |
||
table(agecat) |
4.6 Merging data
In general, R’s strength is not data management but rather data analysis. Because R can access and operate on multiple objects in the workspace it is generally not necessary to merge data objects into one data object in order to conduct analyses. On occasion, it may be necessary to merge two data frames into one data frames
Data frames that contain data on individual subjects are generally of two types: (1) each row contains data collected on one and only one individual, or (2) multiple rows contain repeated measurements on individuals. The latter approach is more efficient at storing data. For example, here are two approaches to collecting multiple telephone numbers for two individuals.
tab1
## name wphone fphone mphone
## 1 Tomas Aragon 643-4935 643-2926 847-9139
## 2 Wayne Enanoria 643-4934 <NA> <NA>
tab2
## name telephone teletype
## 1 Tomas Aragon 643-4935 Work
## 2 Tomas Aragon 643-2926 Fax
## 3 Tomas Aragon 847-9139 Mobile
## 4 Wayne Enanoria 643-4934 Work
The first approach is represented by tab1
, and the second approach by tab2
.16 Data is more efficiently stored in tab2
, and adding new types of telephone numbers only requires assigning a new value (e.g., Pager) to the teletype
field.
tab2 # pager added to Tomas Aragon
## name telephone teletype
## 1 Tomas Aragon 643-4935 Work
## 2 Tomas Aragon 643-2926 Fax
## 3 Tomas Aragon 847-9139 Mobile
## 4 Wayne Enanoria 643-4934 Work
## 5 Tomas Aragon 719-1234 Pager
In both these data frames, an indexing field identifies an unique individual that is associated with each row. In this case, the name
column is the indexing field for both data frames.
Now, let’s look at an example of two related data frames that are linked by an indexing field. The first data frame contains telephone numbers for 5 employees and fname
is the indexing field. The second data frame contains email addresses for 3 employees and name
is the indexing field.
phone
## fname phonenum phonetype
## 1 Tomas 643-4935 work
## 2 Tomas 847-9139 mobile
## 3 Tomas 643-4926 fax
## 4 Chris 643-3932 work
## 5 Chris 643-4926 fax
## 6 Wayne 643-4934 work
## 7 Wayne 643-4926 fax
## 8 Ray 643-4933 work
## 9 Ray 643-4926 fax
## 10 Diana 643-3931 work
email
## name mail mailtype
## 1 Tomas aragon@berkeley.edu Work
## 2 Tomas aragon@medepi.net Personal
## 3 Wayne enanoria@berkeley.edu Work
## 4 Wayne enanoria@idready.org Work
## 5 Chris cvsiador@berkeley.edu Work
## 6 Chris cvsiador@yahoo.com Personal
To merge these two data frames use the merge
function. The by.x
and by.y
options identify the indexing fields.
dat <- merge(phone, email, by.x='fname', by.y='name')
dat
## fname phonenum phonetype mail mailtype
## 1 Chris 643-3932 work cvsiador@berkeley.edu Work
## 2 Chris 643-3932 work cvsiador@yahoo.com Personal
## 3 Chris 643-4926 fax cvsiador@berkeley.edu Work
## 4 Chris 643-4926 fax cvsiador@yahoo.com Personal
## 5 Tomas 643-4926 fax aragon@berkeley.edu Work
## 6 Tomas 643-4926 fax aragon@medepi.net Personal
## 7 Tomas 643-4935 work aragon@berkeley.edu Work
## 8 Tomas 643-4935 work aragon@medepi.net Personal
## 9 Tomas 847-9139 mobile aragon@berkeley.edu Work
## 10 Tomas 847-9139 mobile aragon@medepi.net Personal
## 11 Wayne 643-4926 fax enanoria@berkeley.edu Work
## 12 Wayne 643-4926 fax enanoria@idready.org Work
## 13 Wayne 643-4934 work enanoria@berkeley.edu Work
## 14 Wayne 643-4934 work enanoria@idready.org Work
By default, R selects the rows from the two data frames that is based on the intersection of the indexing fields (by.x
, by.y
). To merge the union of the indexing fields, set all=TRUE
:
dat <- merge(phone, email, by.x='fname', by.y='name', all=TRUE)
dat
## fname phonenum phonetype mail mailtype
## 1 Chris 643-3932 work cvsiador@berkeley.edu Work
## 2 Chris 643-3932 work cvsiador@yahoo.com Personal
## 3 Chris 643-4926 fax cvsiador@berkeley.edu Work
## 4 Chris 643-4926 fax cvsiador@yahoo.com Personal
## 5 Diana 643-3931 work <NA> <NA>
## 6 Ray 643-4926 fax <NA> <NA>
## 7 Ray 643-4933 work <NA> <NA>
## 8 Tomas 643-4926 fax aragon@berkeley.edu Work
## 9 Tomas 643-4926 fax aragon@medepi.net Personal
## 10 Tomas 643-4935 work aragon@berkeley.edu Work
## 11 Tomas 643-4935 work aragon@medepi.net Personal
## 12 Tomas 847-9139 mobile aragon@berkeley.edu Work
## 13 Tomas 847-9139 mobile aragon@medepi.net Personal
## 14 Wayne 643-4926 fax enanoria@berkeley.edu Work
## 15 Wayne 643-4926 fax enanoria@idready.org Work
## 16 Wayne 643-4934 work enanoria@berkeley.edu Work
## 17 Wayne 643-4934 work enanoria@idready.org Work
%%TODO: MAKE TABLE OF MERGE, STACK, reshape
To `reshape'' tabular data look up and study the
reshapeand
stack` functions.
4.7 Executing commands from, and directing output to, a file
4.7.1 The source
function
The source
function is used to evaluate R expressions that have beenn collected in a R script file. For example, consider the contents this script file (outer-multiplication-table.R
):
i <- 1:10
x <- outer(i, i, '*')
show(x)
Here the source
function is used at the R console to “source” (evaluate and execute) the outer-multiplication-table.R
file:
source('~/git/phds/example/outer-multiplication-table.R')
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 2 3 4 5 6 7 8 9 10
## [2,] 2 4 6 8 10 12 14 16 18 20
## [3,] 3 6 9 12 15 18 21 24 27 30
## [4,] 4 8 12 16 20 24 28 32 36 40
## [5,] 5 10 15 20 25 30 35 40 45 50
## [6,] 6 12 18 24 30 36 42 48 54 60
## [7,] 7 14 21 28 35 42 49 56 63 70
## [8,] 8 16 24 32 40 48 56 64 72 80
## [9,] 9 18 27 36 45 54 63 72 81 90
## [10,] 10 20 30 40 50 60 70 80 90 100
Nothing is printed to the console unless we explicitly use the the show
(or print
) function. This enables us to view only the results we want to review.
An alternative approach is to print everything to the console as if the R commands were being enter directly at the command prompt. For this we do not need to use the show
function in the source file; however, we must set the echo
option to TRUE in the source
function. Here is the edited source file (chap03.R)
i <- 1:5
x <- outer(i, i, '*')
x
Here the source
function is used with the option echo = TRUE
:
source('~/git/phds/example/outer-multiplication-table2.R', echo = TRUE)
##
## > i <- 1:10
##
## > x <- outer(i, i, '*')
##
## > x
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 2 3 4 5 6 7 8 9 10
## [2,] 2 4 6 8 10 12 14 16 18 20
## [3,] 3 6 9 12 15 18 21 24 27 30
## [4,] 4 8 12 16 20 24 28 32 36 40
## [5,] 5 10 15 20 25 30 35 40 45 50
## [6,] 6 12 18 24 30 36 42 48 54 60
## [7,] 7 14 21 28 35 42 49 56 63 70
## [8,] 8 16 24 32 40 48 56 64 72 80
## [9,] 9 18 27 36 45 54 63 72 81 90
## [10,] 10 20 30 40 50 60 70 80 90 100
4.7.2 The sink
and capture.output
functions
To send console output to a separate file use the sink
or capture.output
functions. To work correctly, the sink
function is used twice: first, to open a connection to an output file and, second, to close the connection. By default, sink
creates a new output file. To append to an existing file set the option append = TRUE
. Consider the following script file (sink-outer.R
):
i <- 1:5
x <- outer(i, i, '*')
sink('~/git/phds/example/sink-outer.log') # open connection
cat('Here are results of outer function', fill = TRUE)
show(x)
sink() # close connection
Now the script file (sink-outer.R
) is sourced:
source('~/git/phds/example/sink-outer.R')
Notice that nothing was printed to the console because sink
sent it to the output file. Here are the contents of sink-outer.log
.
Here are results of outer function
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 2 4 6 8 10
[3,] 3 6 9 12 15
[4,] 4 8 12 16 20
[5,] 5 10 15 20 25
The first sink
opened a connection and created the output file (sink-outer.log
). The cat
and show
functions printed results to output file. The second sink
closed the connection.
Alternatively, as before, if we use the echo = TRUE
option in the source
function, everything is either printed to the console or output file. The sink
connection determines what is printed to the output file. At the R console, the source
option is set to echo = TRUE
:
source('~/git/phds/example/sink-outer.R', echo = TRUE)
##
## > i <- 1:5
##
## > x <- outer(i, i, '*')
##
## > sink('~/git/phds/example/sink-outer.log') # open connection
Notice that nothing appears after sink
. That is because the output was redirected to the new output file sink-outer.log
, shown here:
> cat('Here are results of outer function', fill = TRUE)
Here are results of outer function
> show(x)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 2 4 6 8 10
[3,] 3 6 9 12 15
[4,] 4 8 12 16 20
[5,] 5 10 15 20 25
> sink() # close connection
The sink
and capture.output
functions accomplish the same task: sending results to an output file. The sink
function works in pairs: opening and closing the connection to the output file. In contrast, capture.output
function appears once and only prints the last object to the output file. Here is the edited script file (capture-output.R
) using capture.output
instead of sink
:
i <- 1:5
x <- outer(i, i, '*')
capture.output({
x # display x
}, file = '~/git/phds/example/capture-output.log')
Now the script file is sourced at the R console:
source('~/git/phds/example/capture-output.R')
And, here are the contents of capture-output.log
:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 2 4 6 8 10
[3,] 3 6 9 12 15
[4,] 4 8 12 16 20
[5,] 5 10 15 20 25
Even though the capture.output
function can run several R expressions (between the curly brackets), only the final object (x
) was printed to the output file. This would be true even if the echo
option was not set to TRUE in the source
function.
In summary, the source
function evaluates and executes R expression collected in a script file. Setting the echo
option to TRUE prints expressions and results to the console as if the commands were directly entered at the command prompt. The sink
and capture.output
functions direct results to an output file.
4.8 Working with missing and “not available” values
In R, missing values are represented by NA, but not all NAs represent missing values—some are just “not available.” NAs can appear in any data object. The NA can represent a true missing value, or it can result from an operation to which a value is “not available.” Here are three vectors that contain true missing values.
x <- c(2, 4, NA, 5); x
## [1] 2 4 NA 5
y <- c("M", NA, "M", "F"); y
## [1] "M" NA "M" "F"
z <- c(F, NA, F, T); z
## [1] FALSE NA FALSE TRUE
However, elementary numerical operations on objects that contain NA return a single NA (“not available”). In this instance, R is saying “An answer is ‘not available’ until you tell R what to do with the NAs in the data object.” To remove NAs for a calculation specify the na.rm
(“NA remove”) option.
sum(x) # answer not available
## [1] NA
mean(x) # answer not available
## [1] NA
sum(x, na.rm = TRUE) # better
## [1] 11
mean(x, na.rm = TRUE) # better
## [1] 3.666667
Here are more examples where NA means an answer is not available:
as.numeric(c("4", "six")) # Inappropriate coercion
## Warning: NAs introduced by coercion
## [1] 4 NA
c(1:5)[7] # Indexing out of range
## [1] NA
df <- data.frame(v1 = 1:3, v2 = 4:6)
df[4, ] # There is no 4th row
## v1 v2
## NA NA NA
df["4th",] # Indexing with non-existing name
## v1 v2
## NA NA NA
NA + 8 # Operations with NAs
## [1] NA
var(55) # Variance of a single number
## [1] NA
In general, these “not available” NAs indicate a missing information issue that must or should be addressed.
4.8.1 Testing, indexing, replacing, and recoding
Regardless of the source of the NAs—missing or not available values—using the is.na
function, we can generate a logical vector to identify which positions contain or do not contain NAs. This logical vector can be used index the original or another vector. Values that can be indexed can be replaced
x <- c(10, NA, 33, NA, 57)
y <- c(NA, 24, NA, 47, NA)
is.na(x) # generate logical vector
## [1] FALSE TRUE FALSE TRUE FALSE
which(is.na(x)) # which positions are NA?
## [1] 2 4
x[!is.na(x)] # index original vector
## [1] 10 33 57
y[is.na(x)] # index other vector
## [1] 24 47
x[is.na(x)] <- 999; x # replacement
## [1] 10 999 33 999 57
For a vector, recoding missing values to NA is accomplished using replacement.
x <- c(1, -99, 3, -88, 5)
x[x==-99 | x==-88] <- NA; x # recode vector
## [1] 1 NA 3 NA 5
For a matrix, we can recode missing values to NA by using replacement one column at a time, or globablly like a vector (this is because a matrix is just a reshaped vector).
m <- m2 <- matrix (c(1, -99, 3, 4, -88, 5), 2, 3)
m[m[,1]==-99, 1] <- NA # Replacement one column at a time
m[m[,3]==-88, 3] <- NA; m
## [,1] [,2] [,3]
## [1,] 1 3 NA
## [2,] NA 4 5
m2[m2==-99 | m2==-88] <- NA; m2 # Global replacement
## [,1] [,2] [,3]
## [1,] 1 3 NA
## [2,] NA 4 5
Likewise, for a data frame, we can recode missing values to NA by using replacement one field at a time, or globablly like a vector.
fname <- c("Tom", "Unknown", "Jerry")
age <- c(56, 34, -999)
z1 <- z2 <- data.frame(fname, age); z1
## fname age
## 1 Tom 56
## 2 Unknown 34
## 3 Jerry -999
z1$fname[z1$fname=="Unknown"] <- NA # Replacement
z1$age[z1$age==-999] <- NA; z1
## fname age
## 1 Tom 56
## 2 <NA> 34
## 3 Jerry NA
z2[z2=="Unknown" | z2==-999] <- NA; z2 # Global replacement
## fname age
## 1 Tom 56
## 2 <NA> 34
## 3 Jerry NA
4.8.2 Importing missing values with the read.table
function
When importing ASCII data files using the read.table
function, use the na.strings
option to specify what characters are to be converted to NA. The default setting is na.strings = "NA"
. Blank fields (\(\textvisiblespace\)) are also considered to be missing values in logical, integer, numeric, and complex fields. For example, suppose the data set contains 999, 888, .
, and \(\textvisiblespace\) (space) to represent missing values, then import the data like this:
mydat <- read.table("dataset.txt", na.strings = c(999, 888, ".", ""))
If a number, say 999, represents a missing value in one field but a valid value in another field, then import the data using the as.is = TRUE
option. Then replace the missing values in the data frame one field at a time, and convert categorical fields to factors.
4.8.3 Working with NA values in data frames and factors
There are several functions for working with NA values in data frames. First, the na.fail
function tests whether a data frame contains any NA values, returning an error message if it contains NAs.
## name gender age
## 1 Jose M 34
## 2 Ana F NA
## 3 Roberto M 22
## 4 Isabel <NA> 18
## 5 Jen F 34
gender <- c("M", "F", "M", NA, "F")
age <- c(34, NA, 22, 18, 34)
df <- data.frame(name, gender, age); df
## name gender age
## 1 Jose M 34
## 2 Ana F NA
## 3 Roberto M 22
## 4 Isabel <NA> 18
## 5 Jen F 34
na.fail(df) # NAs in data frame --- returns error
## Error in na.fail.default(df) : missing values in object
na.fail(df[c(1, 3, 5),]) # no NAs in data frame
## name gender age
## 1 Jose M 34
## 3 Roberto M 22
## 5 Jen F 34
Both na.omit
and na.exclude
remove row observations for any field that contain NAs. The na.exclude
differs from na.omit
only in the class of the na.action'' attribute of the result, which is
exclude’’ (see help for details).
na.omit(df)
## name gender age
## 1 Jose M 34
## 3 Roberto M 22
## 5 Jen F 34
na.exclude(df)
## name gender age
## 1 Jose M 34
## 3 Roberto M 22
## 5 Jen F 34
The complete.cases
function returns a logical vector for observations that are ``complete’’ (i.e., do not contain NAs).
complete.cases(df)
## [1] TRUE FALSE TRUE FALSE TRUE
df[complete.cases(df),] # equivalent to na.omit
## name gender age
## 1 Jose M 34
## 3 Roberto M 22
## 5 Jen F 34
4.8.3.1 NA values in factors
By default, factor levels do not include NA. To include NA as a factor level, use the factor
function, setting the exclude
option to NULL. Including NA as a factor level enables counting and displaying the number of NAs in tables, and analyzing NA values in statistical models.
## [1] M F M <NA> F
## Levels: F M
## gender.na
## F M <NA>
## 2 2 1
df$gender
## [1] M F M <NA> F
## Levels: F M
xtabs(~gender, data = df)
## Error in na.fail.default(list(gender = c(2L, 1L, 2L, NA, 1L))) :
## missing values in object
df$gender.na <- factor(df$gender, exclude = NULL)
xtabs(~gender.na, data = df)
## gender.na
## F M <NA>
## 2 2 1
%%% HERE
4.8.3.2 Indexing data frames that contain NAs
Using the original data frame (that can contain NAs), we can index sujects with ages less than 25.
df$age # age field
## [1] 34 NA 22 18 34
df[df$age<25, ] # index ages < 25
## name gender age gender.na
## NA <NA> <NA> NA <NA>
## 3 Roberto M 22 M
## 4 Isabel <NA> 18 <NA>
The row that corresponds to the age that is missing (NA) has been converted to NAs (`not available'') by R. To remove this uninformative row we use the
is.na` function.
df[df$age<25 & !is.na(df$age), ]
## name gender age gender.na
## 3 Roberto M 22 M
## 4 Isabel <NA> 18 <NA>
This differs from the na.omit
, na.exclude
, and complete.cases
functions that remove all missing values from the data frame first.
4.8.4 Viewing number of missing values in tables
By default, NAs are not tabulated in tables produced by the table
and xtabs
functions. The table
function can tabulate character vectors and factors. The xtabs
function only works with fields in a data frame. To tabulate NAs in character vectors using the table
function, set the exclude
function to NULL in the table
function.
df$gender.chr <- as.character(df$gender)
df$gender.chr
## [1] "M" "F" "M" NA "F"
table(df$gender.chr)
##
## F M
## 2 2
table(df$gender.chr, exclude = NULL)
##
## F M <NA>
## 2 2 1
However, this will not work with factors: we must change the factor levels first.
table(df$gender) #does not tabulate NAs
##
## F M
## 2 2
table(df$gender, exclude = NULL) #does not work
##
## F M <NA>
## 2 2 1
df$gender.na <- factor(df$gender, exclude = NULL) #works
table(df$gender.na)
##
## F M <NA>
## 2 2 1
Finally, whereas the exclude
option works on character vectors tabulated with table
function, it does not work on character vectors or factors tabulated with the xtabs
function. In a data frame, we must convert the character vector to a factor (setting the exclude
option to NULL), then the xtabs
functions tabulates the NA values.
xtabs(~gender, data=df, exclude=NULL) # does not work
## Error in na.fail.default(list(gender = c(2L, 1L, 2L, NA, 1L))) :
## missing values in object
xtabs(~gender.chr, data=df, exclude=NULL) # still does not work
## Error in na.fail.default(list(gender.chr = c("M", "F", "M", NA, "F"))) :
## missing values in object
df$gender.na <- factor(df$gender, exclude = NULL) #works
xtabs(~gender.na, data = df)
## gender.na
## F M <NA>
## 2 2 1
4.8.5 Setting default NA behaviors in statistical models
Statistical models, for example the glm
function for generalized linear models, have default NA behaviors that can be reset locally using the na.action
option in the glm
function, or reset globally using the na.action
option setting in the options
function.
options("na.action") # display global setting
## $na.action
## [1] "na.fail"
options(na.action="na.fail") # reset global setting
options("na.action")
## $na.action
## [1] "na.fail"
By default, na.action
is set to `na.omit'' in the
optionsfunction. Globally (inside the
optionsfunction) or locally (inside a statistical function),
na.action` can be set to the following:
- “na.fail”
- “na.omit”
- “na.exclude”
- “na.pass”
With “na.fail,” a function that calls na.action
will return an error if the data object contains NAs. Both “na.omit” and “na.exclude” will remove row observations from a data frame that contain NAs.17 With na.pass
, the data object is returned unchanged.
4.8.6 Working with finite, infinite, and NaN numbers
In R, some numerical operations result in negative infinity, positive infinity, or an indeterminate value (NAN for “not a number”). To assess whether values are finite or infinite, use the is.finite
or is.infinite
functions, respectively. To assess where a value is NAN, use the is.nan
function. While is.na
can identify NANs, is.nan
cannot identify NAs.
x <- c(-2:2)/c(2, 0, 0, 0, 2); x
## [1] -1 -Inf NaN Inf 1
is.infinite(x)
## [1] FALSE TRUE FALSE TRUE FALSE
x[is.infinite(x)]
## [1] -Inf Inf
is.finite(x)
## [1] TRUE FALSE FALSE FALSE TRUE
x[is.finite(x)]
## [1] -1 1
is.nan(x)
## [1] FALSE FALSE TRUE FALSE FALSE
x[is.nan(x)]
## [1] NaN
is.na(x) # does index NAN
## [1] FALSE FALSE TRUE FALSE FALSE
x[is.na(x)]
## [1] NaN
x[is.nan(x)] <- NA; x
## [1] -1 -Inf NA Inf 1
is.nan(x) # does not index NA
## [1] FALSE FALSE FALSE FALSE FALSE
4.9 Working with dates and times
There are 60 seconds in 1 minute, 60 minutes in 1 hour, 24 hours in 1 day, 7 days in 1 week, and 365 days in 1 year (except every 4th year we have a leap year with 366 days). Although this seems straightforward, doing numerical calculations with these time measures is not. Fortunately, computers make this much easier. Functions to deal with dates are available in the base
, chron
, and survival
packages.
Summarized in Figure 4.3 is the relationship between recorded data (calendar dates and times) and their representation in R as date-time class objects (Date
, POSIXlt
, POSIXct
). The as.Date
function converts a calendar date into a Date class object. The strptime
function converts a calendar date and time into a date-time class object (POSIXlt, POSIXct). The as.POSIXlt
and as.POSIXct
functions convert date-time class objects into POSIXlt
and POSIXct
, respectively.
The format
function converts date-time objects into human legible character data such as dates, days, weeks, months, times, etc. These functions are discussed in more detail in the paragraphs that follow.
4.9.1 Date functions in the base
package
4.9.1.1 The as.Date
function
Let’s start with simple date calculations. The as.Date
function in R converts calendar dates (e.g., 11/2/1949) into a Date
objects—a numeric vector of class Date
. The numeric information is the number of days since January 1, 1970—also called Julian dates. However, because calendar date data can come in a variety of formats, we need to specify the format so that as.Date
does the correct conversion. Study the following analysis carefully.
bdays <- c("11/2/1959", "1/1/1970")
bdays # character vector
## [1] "11/2/1959" "1/1/1970"
#### convert to Julian dates
bdays.julian <- as.Date(bdays, format = "%m/%d/%Y")
bdays.julian # numeric vector of class data
## [1] "1959-11-02" "1970-01-01"
#### display Julian dates
as.numeric(bdays.julian)
## [1] -3713 0
#### calculate age as of today's date
date.today <- Sys.Date()
date.today
## [1] "2017-08-26"
age <- (date.today - bdays.julian)/365.25
age
## Time differences in days
## [1] 57.81520 47.64956
#### the display of 'days' is not correct
#### truncate number to get "age"
age2 <- trunc(as.numeric(age))
age2
## [1] 57 47
#### create date frame
bd <- data.frame(Birthday = bdays, Standard = bdays.julian,
Julian = as.numeric(bdays.julian), Age = age2)
bd
## Birthday Standard Julian Age
## 1 11/2/1959 1959-11-02 -3713 57
## 2 1/1/1970 1970-01-01 0 47
Notice that although bdays.julian
appears like a character vector, it is actually a numeric vector of class date
. Use the class
and mode
functions to verify this.
To summarize, as.Date
converted the character vector of calendar dates into Julian dates (days since 1970-01-01) are displayed in a standard format (yyyy-mm-dd). The Julian dates can be used in numerical calculations. To see the Julian dates use as.numeric
or julian
function. Because the calendar dates to be converted can come in a diversity of formats (e.g., November 2, 1959; 11-02-59; 11-02-1959; 02Nov59), one must specify the format
option in as.Date
. Below are selected format options; for a complete list see help(strptime)
.
Abbreviation | Description |
---|---|
%a |
Abbreviated weekday name. |
%A |
Full weekday name. |
%b |
Abbreviated month name. |
%B |
Full month name. |
%d |
Day of the month as decimal number (01-31) |
%j |
Day of year as decimal number (001-366). |
%m |
Month as decimal number (01-12). |
%U |
Week of the year as decimal number (00-53) using the first Sunday as day 1 of week 1. |
%w |
Weekday as decimal number (0-6, Sunday is 0). |
%W |
Week of the year as decimal number (00-53) using the first Monday as day 1 of week 1. |
%y |
Year without century (00-99). If you use this on input, which century you get is system-specific. So don’t! Often values up to 69 (or 68) are prefixed by 20 and 70-99 by 19. |
%Y |
Year with century. |
Here are some examples of converting dates with different formats:
as.Date("November 2, 1959", format = "%B %d, %Y")
## [1] "1959-11-02"
as.Date("11/2/1959", format = "%m/%d/%Y")
## [1] "1959-11-02"
as.Date("11/2/59", format = "%m/%d/%y") # caution using 2-digit year
## [1] "2059-11-02"
as.Date("02Nov1959", format = "%d%b%Y")
## [1] "1959-11-02"
as.Date("02Nov59", format = "%d%b%y") # caution using 2-digit year
## [1] "2059-11-02"
as.Date("1959-11-02") # standard format does not require format option
## [1] "1959-11-02"
Notice how Julian dates can be used like any integer:
as.Date("2004-01-15"):as.Date("2004-01-23")
## [1] 12432 12433 12434 12435 12436 12437 12438 12439 12440
seq(as.Date("2004-01-15"), as.Date("2004-01-18"), by = 1)
## [1] "2004-01-15" "2004-01-16" "2004-01-17" "2004-01-18"
4.9.1.2 The weekdays
, months
, quarters
, julian
functions
Use the weekdays
, months
, quarters
, or julian
functions to extract information from Date
and other date-time objects in R.
mydates <- c("2004-01-15","2004-04-15","2004-10-15")
mydates <- as.Date(mydates)
weekdays(mydates)
## [1] "Thursday" "Thursday" "Friday"
months(mydates)
## [1] "January" "April" "October"
quarters(mydates)
## [1] "Q1" "Q2" "Q4"
julian(mydates)
## [1] 12432 12523 12706
## attr(,"origin")
## [1] "1970-01-01"
4.9.1.3 The strptime
function
So far we have worked with calendar dates; however, we also need to be able to work with times of the day. Whereas as.Date
only works with calendar dates, the strptime
function will accept data in the form of calendar dates and times of the day (HH:MM:SS, where H = hour, M = minutes, S = seconds). For example, let’s look at the Oswego foodborne ill outbreak that occurred in 1940. The source of the outbreak was attributed to the church supper that was served on April 18, 1940. The food was available for consumption from 6 pm to 11 pm. The onset of symptoms occurred on April 18th and 19th. The meal consumption times and the illness onset times were recorded.
odat <- read.table("~/git/phds/data/oswego/oswego.txt",
sep = "", header = TRUE, as.is = TRUE)
str(odat)
## 'data.frame': 75 obs. of 21 variables:
## $ id : int 2 3 4 6 7 8 9 10 14 16 ...
## $ age : int 52 65 59 63 70 40 15 33 10 32 ...
## $ sex : chr "F" "M" "F" "F" ...
## $ meal.time : chr "8:00 PM" "6:30 PM" "6:30 PM" "7:30 PM" ...
## $ ill : chr "Y" "Y" "Y" "Y" ...
## $ onset.date : chr "4/19" "4/19" "4/19" "4/18" ...
## $ onset.time : chr "12:30 AM" "12:30 AM" "12:30 AM" "10:30 PM" ...
## $ baked.ham : chr "Y" "Y" "Y" "Y" ...
## $ spinach : chr "Y" "Y" "Y" "Y" ...
## $ mashed.potato : chr "Y" "Y" "N" "N" ...
## $ cabbage.salad : chr "N" "Y" "N" "Y" ...
## $ jello : chr "N" "N" "N" "Y" ...
## $ rolls : chr "Y" "N" "N" "N" ...
## $ brown.bread : chr "N" "N" "N" "N" ...
## $ milk : chr "N" "N" "N" "N" ...
## $ coffee : chr "Y" "Y" "Y" "N" ...
## $ water : chr "N" "N" "N" "Y" ...
## $ cakes : chr "N" "N" "Y" "N" ...
## $ vanilla.ice.cream : chr "Y" "Y" "Y" "Y" ...
## $ chocolate.ice.cream: chr "N" "Y" "Y" "N" ...
## $ fruit.salad : chr "N" "N" "N" "N" ...
To calculate the incubation period, for ill individuals, we need to subtract the meal consumption times (occurring on 4/18) from the illness onset times (occurring on 4/18 and 4/19). Therefore, we need two date-time objects to do this arithmetic. First, let’s create a date-time object for the meal times:
#### look at existing data for meals
odat$meal.time[1:5]
## [1] "8:00 PM" "6:30 PM" "6:30 PM" "7:30 PM" "7:30 PM"
#### create character vector with meal date and time
mdt <- paste("4/18/1940", odat$meal.time)
mdt[1:4]
## [1] "4/18/1940 8:00 PM" "4/18/1940 6:30 PM" "4/18/1940 6:30 PM"
## [4] "4/18/1940 7:30 PM"
#### convert into standard date and time
meal.dt <- strptime(mdt, format = "%m/%d/%Y %I:%M %p")
meal.dt[1:4]
## [1] "1940-04-18 20:00:00 PST" "1940-04-18 18:30:00 PST"
## [3] "1940-04-18 18:30:00 PST" "1940-04-18 19:30:00 PST"
#### look at existing data for illness onset
odat$onset.date[1:4]
## [1] "4/19" "4/19" "4/19" "4/18"
odat$onset.time[1:4]
## [1] "12:30 AM" "12:30 AM" "12:30 AM" "10:30 PM"
#### create vector with onset date and time
odt <- paste(paste(odat$onset.date, "/1940", sep=""), odat$onset.time)
odt[1:4]
## [1] "4/19/1940 12:30 AM" "4/19/1940 12:30 AM" "4/19/1940 12:30 AM"
## [4] "4/18/1940 10:30 PM"
#### convert into standard date and time
onset.dt <- strptime(odt, "%m/%d/%Y %I:%M %p")
onset.dt[1:4]
## [1] "1940-04-19 00:30:00 PST" "1940-04-19 00:30:00 PST"
## [3] "1940-04-19 00:30:00 PST" "1940-04-18 22:30:00 PST"
#### calculate incubation period
incub.period <- onset.dt - meal.dt
incub.period
## Time differences in hours
## [1] 4.5 6.0 6.0 3.0 3.0 6.5 3.0 4.0 6.5 NA NA NA NA 3.0 NA NA NA
## [18] 3.0 NA NA 3.0 3.0 NA NA 3.0 NA NA NA NA NA 6.0 NA 4.0 NA
## [35] NA NA 3.0 7.0 4.0 3.0 NA NA 5.5 4.5 NA NA NA NA NA NA NA
## [52] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [69] NA NA NA NA NA NA NA
mean(incub.period, na.rm = T)
## Time difference of 4.295455 hours
median(incub.period, na.rm = T)
## Time difference of 4 hours
#### try 'as.numeric' on 'incub.period'
median(as.numeric(incub.period), na.rm = T)
## [1] 4
To summarize, we used strptime
to convert the meal consumption date and times and illness onset dates and times into date-time objects (meal.dt
and onset.dt
) that can be used to calculate the incubation periods by simple subtraction (and assigned name incub.period
).
Notice that incub.period
is an atomic object of class difftime
:
str(incub.period)
## Class 'difftime' atomic [1:75] 4.5 6 6 3 3 6.5 3 4 6.5 NA ...
## ..- attr(*, "units")= chr "hours"
This is why we had trouble calculating the median (which should not be the case). We got around this problem by coercion using as.numeric
:
as.numeric(incub.period)[20] # show 20 values
## [1] NA
Now, what kind of objects were created by the strptime
function?
str(meal.dt)
## POSIXlt[1:75], format: "1940-04-18 20:00:00" "1940-04-18 18:30:00" ...
str(onset.dt)
## POSIXlt[1:75], format: "1940-04-19 00:30:00" "1940-04-19 00:30:00" ...
The strptime
function produces a named list of class POSIXlt
. POSIX stands for Portable Operating System Interface,'' and
lt’’ stands for “legible time.”18
4.9.1.4 The POSIXlt
and POSIXct
functions
The POSIXlt list contains the date-time data in human readable forms. The named list contains the following vectors:
Names | Values: description |
---|---|
‘sec’ | 0-61: seconds |
‘min’ | 0-59: minutes |
‘hour’ | 0-23: hours |
‘mday’ | 1-31: day of the month |
‘mon’ | 0-11: months after the first of the year. |
‘year’ | Years since 1900. |
‘wday’ | 0-6 day of the week, starting on Sunday. |
‘yday’ | 0-365: day of the year. |
‘isdst’ | Daylight savings time flag. Positive if in force, zero if not, negative if unknown. |
Let’s examine the onset.dt
object we created from the Oswego data.
is.list(onset.dt)
## [1] TRUE
names(onset.dt)
## NULL
onset.dt$min[10]
## [1] 30
onset.dt$hour[10]
## [1] 10
onset.dt$mday[10]
## [1] 19
onset.dt$mon[10]
## [1] 3
onset.dt$year[10]
## [1] 40
onset.dt$wday[10]
## [1] 5
onset.dt$yday[10]
## [1] 109
The POSIXlt list contains useful date-time information; however, it is not in a convenient form for storing in a data frame. Using as.POSIXct
we can convert it to a `continuous time'' object that contains the number of seconds since 1970-01-01 00:00:00.
as.POSIXlt` coerces a date-time object to POSIXlt.
onset.dt.ct <- as.POSIXct(onset.dt)[10]
onset.dt.ct[1:5]
## [1] "1940-04-19 10:30:00 PST" NA
## [3] NA NA
## [5] NA
as.numeric(onset.dt.ct[1:5])
## [1] -937287000 NA NA NA NA
4.9.1.5 The format
function
Whereas the strptime
function converts a character vector of date-time information into a date-time object, the format
function converts a date-time object into a character vector. The format
function gives us great flexibility in converting date-time objects into numerous outputs (e.g., day of the week, week of the year, day of the year, month of the year, year). Selected date-time format options are listed on TODO-REF, for a complete list see help(strptime)
.
For example, in public health, reportable communicable diseases are often reported by “disease week” (this could be week of reporting or week of symptom onset). This information is easily extracted from R date-time objects. For weeks starting on Sunday use the %U
option in the format
function, and for weeks starting on Monday use the %W
option.
decjan <- seq(as.Date("2003-12-15"), as.Date("2004-01-15"), by =1)
decjan
## [1] "2003-12-15" "2003-12-16" "2003-12-17" "2003-12-18" "2003-12-19"
## [6] "2003-12-20" "2003-12-21" "2003-12-22" "2003-12-23" "2003-12-24"
## [11] "2003-12-25" "2003-12-26" "2003-12-27" "2003-12-28" "2003-12-29"
## [16] "2003-12-30" "2003-12-31" "2004-01-01" "2004-01-02" "2004-01-03"
## [21] "2004-01-04" "2004-01-05" "2004-01-06" "2004-01-07" "2004-01-08"
## [26] "2004-01-09" "2004-01-10" "2004-01-11" "2004-01-12" "2004-01-13"
## [31] "2004-01-14" "2004-01-15"
disease.week <- format(decjan, "%U")
disease.week
## [1] "50" "50" "50" "50" "50" "50" "51" "51" "51" "51" "51" "51" "51" "52"
## [15] "52" "52" "52" "00" "00" "00" "01" "01" "01" "01" "01" "01" "01" "02"
## [29] "02" "02" "02" "02"
4.9.2 Date functions in the chron
and survival
packages
The chron
and survival
packages have customized functions for dealing with dates. Both packages come with the default R installation. To learn more about date and time classes read R News, Volume 4/1, June 2004.19
4.10 Exporting data objects
Function | Description | Try these examples | |
---|---|---|---|
Export to Generic ASCII text file | |||
write.table | Write tabular data as a data frame to an ASCII text file; read file back in using read.table function |
write.table(infert,"infert.dat") |
|
write.csv | write.csv(infert,"infert.csv") |
||
write | Write matrix elements to an ASCII text file | x <- matrix(1:4, 2, 2) |
|
write(t(x), "x.txt") |
|||
Export to R ASCII text file | |||
dump | “Dumps” list of R objects as R code to an ASCII text file; read back in using source function |
dump("Titanic", "titanic.R") |
|
dput | Writes an R object as R code (but without the object name) to the console, or an ASCII text file; read file back in using dget function |
dput(Titanic, "titanic.R") |
|
Export to R binary file | |||
save | “Saves” list of R objects as binary filename .Rdata file; read back in using load function |
save(Titanic, "titanic.Rdata") |
|
Export to non-R ASCII text file | |||
write.foreign | From foreign package: writes text files (SPSS, Stata, SAS) and code to read them |
write.foreign(infert, datafile="infert.dat", |
|
codefile="infert.txt", package = "SPSS") |
|||
Export to non-R binary file | |||
write.dbf | From foreign package: writes DBF files |
write.dbf(infert, “infert.dbf”) | |
write.dta | From foreign package: writes files in Stata binary format |
write.dta(infert, “infert.dta”) |
On occassion, we need to export R data objects. This can be done in several ways, depending on our needs: - Generic ASCII text file - R ASCII text file - R binary file - Non-R ASCII text files - Non-R binary file
4.10.1 Exporting to a generic ASCII text file
4.10.1.1 The write.table
function
We use the write.table
function to exports a data frame as a tabular ASCII text file which can be read by most statistical packages. If the object is not a data frame, it converts to a data frame before exporting it. Therefore, this function only make sense with tabular data objects. Here are the default arguments:
args(write.table)
## function (x, file = "", append = FALSE, quote = TRUE, sep = " ",
## eol = "\\n", na = "NA", dec = ".", row.names = TRUE, col.names = TRUE,
## qmethod = c("escape", "double"), fileEncoding = "")
## NULL
The 1st argument will be the data frame name (e.g., infert), the 2nd will be the name for the output file (e.g., infert.dat), the sep
argument is set to be space-delimited, and the row.names
argument is set to TRUE
.
The following code:
write.table(infert,"~/git/phds/example/infert.dat")
produces this ASCII text file:
"education" "age" "parity" "induced" "case" ...
"1" "0-5yrs" 26 6 1 1 2 1 3
"2" "0-5yrs" 42 1 1 1 0 2 1
"3" "0-5yrs" 39 6 2 1 0 3 4
"4" "0-5yrs" 34 4 2 1 0 4 2
"5" "6-11yrs" 35 3 1 1 1 5 32
...
Because row.names=TRUE
, the number field names in the header (row 1) will one less that the number of columns (starting with row 2). The default row names is a character vector of integers. The following code:
write.table(infert,"~/git/phds/example/infert.csv", sep=",", row.names=FALSE)
produces a commna-delimited ASCII text file without row names:
"education","age","parity","induced","case", ...
"0-5yrs",26,6,1,1,2,1,3
"0-5yrs",42,1,1,1,0,2,1
"0-5yrs",39,6,2,1,0,3,4
"0-5yrs",34,4,2,1,0,4,2
"6-11yrs",35,3,1,1,1,5,32
...
Note that the write.csv
function produces a comma-delimited data file by default.
4.10.1.2 The write
function
The write
function writes the contents of a matrix in a columnwise order to an ASCII text file. To get the same appearance as the matrix, we must transpose the matrix and specify the number of columns (the default is 5). If we set file=""
, then the output is written to the screen:
infert.tab1 <- xtabs(~case+parity,data=infert)
infert.tab1
## parity
## case 1 2 3 4 5 6
## 0 66 54 24 12 4 5
## 1 33 27 12 6 2 3
write(infert.tab1, file="") # not what we want
## 66 33 54 27 24
## 12 12 6 4 2
## 5 3
#### much better
write(t(infert.tab1), file="", ncol=ncol(infert.tab1))
## 66 54 24 12 4 5
## 33 27 12 6 2 3
To read the raw data back into R, we would use the scan
function. For example, if the data had been written to data.txt
, then the following code reads the data back into R:
matrix(scan("data.txt"), ncol=6, byrow=TRUE)
## Read 12 items
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 66 54 24 12 4 5
## [2,] 33 27 12 6 2 3
Of course, all the labeling was lost.
4.10.2 Exporting to R ASCII text file
Data objects can also be exported as R code in an ASCII text file using the dump
and dput
functions. This has advantages for complex R objects (e.g., arrays, lists) that do not have simple tabular structures, and the R code makes the raw data human legible.
4.10.2.1 The dump
function
The dump
function exports multiple objects as R code as the next example illustrates:
infert.tab1 <- xtabs(~case+parity,data=infert)
infert.tab2 <- xtabs(~education+parity+case,data=infert)
infert.tab1 #display matrix
## parity
## case 1 2 3 4 5 6
## 0 66 54 24 12 4 5
## 1 33 27 12 6 2 3
infert.tab2 #display array
## , , case = 0
##
## parity
## education 1 2 3 4 5 6
## 0-5yrs 2 0 0 2 0 4
## 6-11yrs 28 28 14 8 2 0
## 12+ yrs 36 26 10 2 2 1
##
## , , case = 1
##
## parity
## education 1 2 3 4 5 6
## 0-5yrs 1 0 0 1 0 2
## 6-11yrs 14 14 7 4 1 0
## 12+ yrs 18 13 5 1 1 1
dump(c("infert.tab1", "infert.tab2"),
file = "~/git/phds/example/infert_tab.R") #export
The dump
function produced the following R code in the infert_tab.R
text file:
`infert.tab1` <-
structure(c(66, 33, 54, 27, 24, 12, 12, 6, 4, 2, 5, 3),
.Dim = c(2L, 6L), .Dimnames = structure(list(case = c("0", "1"),
parity = c("1", "2", "3", "4", "5", "6")), .Names = c("case",
"parity")), class = c("xtabs", "table"), call = quote(xtabs(
formula = ~case + parity, data = infert)))
`infert.tab2` <-
structure(c(2, 28, 36, 0, 28, 26, 0, 14, 10, 2, 8, 2, 0, 2,
2, 4, 0, 1, 1, 14, 18, 0, 14, 13, 0, 7, 5, 1, 4, 1, 0, 1, 1,
2, 0, 1), .Dim = c(3L, 6L, 2L), .Dimnames = structure(list(
education = c("0-5yrs", "6-11yrs", "12+ yrs"), parity = c("1",
"2", "3", "4", "5", "6"), case = c("0", "1")), .Names =
c("education", "parity", "case")), class = c("xtabs", "table"),
call = quote(xtabs(formula = ~education + parity + case,
data = infert)))
Notice that the 1st argument was a character vector of object names. The infert_tab.R
file can be run in R using the source
function to recreate all the objects in the workspace.
4.10.2.2 The dput
function
The dput
function is similar to the dump
function except that the object name is not written. By default, the dput
function prints to the screen:
dput(infert.tab1)
## structure(c(66L, 33L, 54L, 27L, 24L, 12L, 12L, 6L, 4L, 2L, 5L,
## 3L), .Dim = c(2L, 6L), .Dimnames = structure(list(case = c("0",
## "1"), parity = c("1", "2", "3", "4", "5", "6")), .Names = c("case",
## "parity")), class = c("xtabs", "table"), call = xtabs(formula = ~case +
## parity, data = infert))
structure(c(66, 33, 54, 27, 24, 12, 12, 6, 4, 2, 5, 3),
.Dim = c(2L, 6L), .Dimnames = structure(list(case = c("0", "1"),
parity = c("1", "2", "3", "4", "5", "6")), .Names = c("case",
"parity")), class = c("xtabs", "table"), call = quote(xtabs(
formula = ~case + parity, data = infert)))
## parity
## case 1 2 3 4 5 6
## 0 66 54 24 12 4 5
## 1 33 27 12 6 2 3
To export to an ASCII text file, give a new file name as the second argument, similar to dump
. To get back the R code use the dget
function:
dput(infert.tab1, "~/git/phds/example/infert_tab1.R")
dget("~/git/phds/example/infert_tab1.R")
## parity
## case 1 2 3 4 5 6
## 0 66 54 24 12 4 5
## 1 33 27 12 6 2 3
4.10.3 Exporting to R binary file
4.10.3.1 The save
function
The save
function exports R data objects to binary file (filename.RData
) which is the most effient, compact method to export objects. The first argument(s) can be the names of the objects to save followed by the output file name, or list with a character vector of object names followed by the output file name. Here is an example of the first option:
x <- 1:5; y <- x^3
save(x, y, file="~/git/phds/example/xy.RData")
rm(x, y)
load(file="~/git/phds/example/xy.RData")
x; y
## [1] 1 2 3 4 5
## [1] 1 8 27 64 125
Notice that we used the load
function to load the binary file back into the workspace.
Now here is an example of the second option using a list:
x <- 1:5; y <- x^3
save(list=c("x", "y"), file="~/git/phds/example/xy.RData")
rm(x, y)
load(file="~/git/phds/example/xy.RData")
x; y
## [1] 1 2 3 4 5
## [1] 1 8 27 64 125
In fact, the save.image
function we use to save the entire workspace is just the following:
save(list = ls(all=TRUE), file = ".RData")
4.10.4 Exporting to non-R ASCII text and binary files
The foreign
package contains functions for exporting R data frames to non-R ASCII text and binary files. The write.foreign
function write two ASCII text files: the first file is the data file, and the second file is the code file for reading the data file. The code file contains either SPSS, Stata, or SAS programming code. The write.dbf
function writes a data frame to a binary DBF file, which can be read back into R using the read.dbf
function. Finally, the write.dta
function writes a data frame to a binary Stata file, which can be read back into R using the read.dta
function.
4.11 Working with regular expressions
A regular expression is a special text string for describing a search pattern which can be used for searching text strings, indexing data objects, and replacing object elements. For example, we applied Global Burden of Disease methods to evaluate causes of premature deaths in San Francisco TODO-cite[Aragon2008_18402698]. Using regular expressions we were able to efficiently code over 14,000 death records, with over 900 ICD-10 cause of death codes, into 117 mutually exclusive cause of death categories. Without regular expressions, this local study would have been prohibitively tedious.
A regular expression is built up from specifying one character at a time. Using this approach, we cover the following:
- Single character: matching a single character;
- Character class: matching a single character from among a list of characters;
- Concatenation: combining single characters into a new match pattern;
- Repetition: specifying how many times a single character or match pattern might be repeated;
- Alternation: a regular expression may be matched from among two or more regular expressions; and
- Metacharacters: special characters that require special treatment.
4.11.1 Single characters
The search pattern is built up from specifying one character at a time. For example, the pattern "x"
looks for the letter x in a text string. Next, consider a character vector of text strings. We can use the grep
function to search for a pattern in this data vector.
vec1 <- c("x", "xa bc", "abc", "ax bc", "ab xc", "ab cx", "z rox z", "y xo y")
grep("x", vec1)
## [1] 1 2 4 5 6 7 8
vec1[grep("x", vec1)] # index by position
## [1] "x" "xa bc" "ax bc" "ab xc" "ab cx" "z rox z" "y xo y"
grep("x", vec1, value = TRUE) # equivalent
## [1] "x" "xa bc" "ax bc" "ab xc" "ab cx" "z rox z" "y xo y"
The grep
function returned an integer vector indicating the positions in the data vector that contain a match. We used this integer vector to index by position.
The caret ^
matches the empty string at the beginning of a line. Therefore, to match this pattern at the beginning of a line we add the ^
character to the regular expression:
grep("^x", vec1, value = TRUE)
## [1] "x" "xa bc"
The $
character matches the empty string at the end of a line. Therefore, to match this pattern at the end of a line we add the $
character to the regular expression:
grep("x$", vec1, value = TRUE)
## [1] "x" "ab cx"
The ^
and $
characters are examples of metacharacters (more on these later).
The symbol \b matches the empty string at the edge of a word, and \B matches the empty string provided it is not at the edge of a word. Study these examples:
grep("\\bx", vec1, value = TRUE) # find 'x' at beginning of a word
## [1] "x" "xa bc" "ab xc" "y xo y"
grep("x\\b", vec1, value = TRUE) # find 'x' at end of a word
## [1] "x" "ax bc" "ab cx" "z rox z"
grep("\\bx\\b", vec1, value = TRUE) # find 'x' at beginning and end of a word
## [1] "x"
grep("\\Bx", vec1, value = TRUE) # find 'x' not at left edge of word
## [1] "ax bc" "ab cx" "z rox z"
grep("x\\B", vec1, value = TRUE) # find 'x' not at right edge of word
## [1] "xa bc" "ab xc" "y xo y"
grep("\\Bx\\B", vec1, value = TRUE) # find 'x' not at right and left edge of word
## character(0)
The period (.
) matches any single character, including a space.
grep(".bc", vec1, value = TRUE)
## [1] "xa bc" "abc" "ax bc"
Regular expression | Description | Example |
---|---|---|
. |
Any single character | "a.c" |
^ |
At beginning of line | "^F" |
$ |
At end of line | "F$" |
\b |
At edge of a word | "\\bF" , "F\\b" |
\B |
Not at edge of a word | "\\BF" , "F\\B" |
4.11.2 Character class
A character class is a list of characters enclosed by square brackets [
and ]
which matches any single character in that list. For example, "[fhr]"
will match the single character f
, h
, or r
. This can be combined with metacharacters for more specificity; for example, "^[fhr]"
will match the single character f
, h
, or r
at the beginning of a line.
vec2 <- c("fat", "bar", "rat", "elf", "mach", "hat")
grep("^[fhr]", vec2, value = TRUE)
## [1] "fat" "rat" "hat"
As already shown, ^
character matches the empty string at the beginning of a line. However, when ^
is the first character in a character class list, it matches any character not in the list. For example, "^[^fhr]"
will match any single character at the beginning of a line except f
, h
, or r
.
vec2 <- c("fat", "bar", "rat", "elf", "mach", "hat")
grep("^[^fhr]", vec2, value = TRUE)
## [1] "bar" "elf" "mach"
Character classes can be specified as a range of possible characters. For example, [0-9]
matches a single digit with possible values from 0 to 9, [A-Z]
matches a single letter with possible values from A to Z, and [a-z]
matches a single letter from a to z. The pattern [0-9A-Za-z]
matches any single alphanumeric character. Table 4.8 displays additional selected character class regular expessions.
Regular expression | Description | Alternative |
---|---|---|
\d |
Digits | [0-9] |
\D |
Non-digits | [^0-9] |
\w |
Word character | [A-Za-z0-9_] |
\W |
Not work character | [^A-Za-z0-9_] |
\s |
Space | |
\S |
Non-space |
wordvec <- c('hello', '?', '1234', ' ', '_', '\n', '&')
grep('\\w', wordvec, value = TRUE)
## [1] "hello" "1234" "_"
grep('\\W', wordvec, value = TRUE)
## [1] "?" " " "\n" "&"
For convenience, certain character classes are predefined and their interpretation depend on locale.20 For example, to match a single lower case letter we place [:lower:]
inside square brackets like this: "[[:lower:]]"
, which is equivalent to "[a-z]"
. Table 4.9 lists predefined character classes. This is very convenient for matching punctuation characters and multiple types of spaces (e.g., tab, newline, carriage return). The [:punct:]
character class matches any of the following punctuation characters:
! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~
Predefined | Description | Alternative |
---|---|---|
[[:lower:]] | Lower-case letters in the current locale | [a-z] |
[[:upper:]] | Upper-case letters in the current locale | [A-Z] |
[[:alpha:]] | Alphabetic characters ([[:lower:][:upper:]] ) |
[A-Za-z] |
[[:digit:]] | Digits | [0-9] |
[[:alnum:]] | Alphanumeric characters ([[:alpha:][:digit:]] ) |
[A-Za-z0-9] |
[[:punct:]] | Punctuation characters: | |
[[:blank:]] | Blank characters: space, tab, and non-breaking space. | |
[[:cntrl:]] | Control characters. In ASCII, octal codes 000 through 037, and 177. | |
[[:space:]] | Space characters: tab, newline, vertical tab, form feed, carriage return, and space | |
[[:graph:]] | Graphical characters | [[:alnum:][:punct:]] |
[[:print:]] | Printable characters | [[:alnum:][:punct:][:space:]] |
[[:xdigit:]] | Hexadecimal digits: | [0-9A-Fa-f] |
In this final example, "^.[^a].+"
will match any first character, followed by any character except a, and followed by any character one or more times:
grep("^.[^a].+", vec2, value = TRUE)
## [1] "elf"
Combining single character matches is call concatenation.
4.11.3 Concatenation
Single characters (including character classes) can be concatenated; for example, the pattern "ˆ[fhr]at$"
will match the single, isolated words fat, hat, or rat.
vec3 <- c("fat", "bar", "rat", "fat boy", "elf", "mach", "hat")
grep("^[fhr]at$", vec3, value = TRUE)
## [1] "fat" "rat" "hat"
The concatenation "[ct]a[br]"
will match the pattern that starts with c or t, followed by a, and followed by b or r.
vec4 <- c("cab", "carat", "tar", "bar", "tab", "batboy", "care")
grep("[ct]a[br]", vec4, value = TRUE)
## [1] "cab" "carat" "tar" "tab" "care"
To match single, 3-letter words use "\\b[ct]a[br]\\b"
.
grep("\\b[ct]a[br]\\b", vec4, value = TRUE)
## [1] "cab" "tar" "tab"
The period (.
) is another metacharacter: it matches any single character. For example, "f.t"
matches the pattern f
+ any character + t
.
vec5 <- c("fate", "rat", "fit", "bat", "futbol")
grep("f.t", vec5, , value = TRUE)
## [1] "fate" "fit" "futbol"
4.11.4 Repetition
Repetition quantifier | Description |
---|---|
? |
Preceding pattern is optional and will be matched at most once |
* |
Preceding pattern will be matched zero or more times |
+ |
Preceding pattern will be matched one or more times |
{ n} |
Preceding pattern is matched exactly n times |
{ n,} |
Preceding pattern is matched n or more times |
{ n, m} |
Preceding pattern is matched at least n times, but not more than m times |
Regular expressions (so far: single characters, character classes, and concatenations) can be qualified by whether a pattern can repeat (Table 4.10). For example, the pattern "^[fF].+t$"
matches single, isolated words that start with f or F, followed by 1 or more of any character, and ending with t.
vec6 <- c("fat", "fate", "feat", "bat", "Fahrenheit", "bat", "foot")
grep("^[fF].+t$", vec6, value = TRUE)
## [1] "fat" "feat" "Fahrenheit" "foot"
Repetition quantifiers gives us great flexibility to specify how often preceding patterns can repeat.
4.11.5 Alternation
Two or more regular expressions (so far: single characters, character classes, concatenations, and repetitions) may be joined by the infix operator |
. The resulting regular expression can match the pattern of any subexpression. For example, the World Health Organization (WHO) Global Burden of Disease (GBD) Study used International Classification of Diseases, 10th Revision (ICD-10) codes (ref). The GBD Study ICD-10 codes for hepatitis B are the following:
B16
,B16.0
,B16.1
,B16.2
,B16.3
,B16.4
,B16.5
,B16.7
,B16.8
,B16.9
,B17
,B17.0
,B17.2
,B17.8
,B18
,B18.0
,B18.1
,B18.8
,B18.9
Notice that B16 and B16.0 are not the same ICD-10 code! The GBD Study methods were used to study causes of death in San Francisco, California (ref). Underlying causes of death were obtained from the State of California, Center for Health Statistics. The ICD-10 code field did not have periods so that the hepatitis B codes were the following.
B16
,B160
,B161
,B162
,B163
,B164
,B165
,B167
,B168
,B169
,B17
,B170
,B172
,B178
,B18
,B180
,B181
,B188
,B189
To match the pattern of ICD-10 codes representing hepatitis B, the following regular expression was used (without spaces):
"^B16[0-9]?$|^B17[0,2,8]?$|^B18[0,1,8,9]?$"
This regular expression matches ˆB16[0-9]?$
or ˆB17[0,2,8]?$
or ˆB18[0,1,8,9]?$
. Similar to the first and third pattern, the second regular expression, ˆB17[0,2,8]?$
, matches B17
, B170
, B172
, or B178
as isolated text strings.
To see how this works, we can match each subexpression individually and then as an alternation:
hepb <- c("B16", "B160", "B161", "B162", "B163", "B164", "B165",
"B167", "B168", "B169", "B17", "B170", "B172", "B178",
"B18", "B180", "B181", "B188", "B189")
grep("^B16[0-9]?$", hepb, value = TRUE) # match 1st subexpression
## [1] "B16" "B160" "B161" "B162" "B163" "B164" "B165" "B167" "B168" "B169"
grep("^B17[0,2,8]?$", hepb, value = TRUE) # match 2nd subexpression
## [1] "B17" "B170" "B172" "B178"
grep("^B18[0,1,8,9]?$", hepb, value = TRUE) # match 3rd subexpression
## [1] "B18" "B180" "B181" "B188" "B189"
#### match any subexpression
grep("^B16[0-9]?$|^B17[0,2,8]?$|^B18[0,1,8,9]?$", hepb, value = TRUE)
## [1] "B16" "B160" "B161" "B162" "B163" "B164" "B165" "B167" "B168" "B169"
## [11] "B17" "B170" "B172" "B178" "B18" "B180" "B181" "B188" "B189"
A natural use for these pattern matches is for indexing and replacement. We illustrate this using the 2nd subexpression.
#### indexing by position
hepb[grep("^B17[0,2,8]?$", hepb)]
## [1] "B17" "B170" "B172" "B178"
#### replacement by position
hepb[grep("^B17[0,2,8]?$", hepb)] <- "HBV"
hepb
## [1] "B16" "B160" "B161" "B162" "B163" "B164" "B165" "B167" "B168" "B169"
## [11] "HBV" "HBV" "HBV" "HBV" "B18" "B180" "B181" "B188" "B189"
Using regular expression alternations allowed us to efficiently code over 14,000 death records, with over 900 ICD-10 cause of death codes, into 117 mutually exclusive cause of death categories for our San Francisco study. Suppose sfdat
was the data frame with San Francisco deaths for 2003–2004. Then the following code would tabulate the deaths caused by hepatatis B:
sfdat$hepb <- rep("No", nrow(sfdat)) #new field
get.hepb <- grep("^B16[0-9]?$|^B17[0,2,8]?$|^B18[0,1,8,9]?$", sfdat$icd10)
sfdat$hepb[get.hepb] <- "Yes"
table(sfdat$hepb)
Therefore, in San Francisco, during the period 2003-2004, there were 23 deaths caused by hepatitis B. Without regular expressions, this mortality analysis would have been prohibitively tedious.
In this next example we use regular expressions to correct misspelled data. Suppose we have a data vector containing my first name (“Tomas”), but sometimes misspelled. We want to locate the most common misspellings and correct them:
tdat <- c("Tom", "Thomas", "Tomas", "Tommy", "tomas")
misspelled <- grep("^[Tt]omm?y?$|^[Tt]homas$|^tomas$", tdat)
misspelled
## [1] 1 2 4 5
tdat[misspelled] <- "Tomas"
tdat
## [1] "Tomas" "Tomas" "Tomas" "Tomas" "Tomas"
4.11.6 Repetition > Concatenation > Alternation
Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole subexpression may be enclosed in parentheses to override these precedence rules. For example, consider how the following regular expression changes when parentheses are used give concatenation precedence over repetition:
vec7 <- c("Tommy", "Tomas", "Tomtom")
grep("[Tt]om{2,}", vec7, value=TRUE) # repetition takes precedence
## [1] "Tommy"
grep("([Tt]om){2,}", vec7, value=TRUE) # concatenation takes precedence
## [1] "Tomtom"
Recall that {2,}
means repeat the previous 2 or more times.
4.11.7 Metacharacters
Char. | Description | Example | Literal search |
---|---|---|---|
^ |
Matches empty string at beginning of line | "^my car" |
See p. TODO |
When 1st character in character class, matches any single character not in the list | "[^abc]" |
See p. TODO | |
$ |
Matches empty string at end of line | "my car$" |
"[$]" |
[ |
Character class | "[[]" |
|
. |
Matches any single character | "p.t" |
"[.]" |
? |
Repetition quantifier (Table 4.10) | ".?" |
"[?]" |
* |
Repetition quantifier (Table 4.10) | ".*" |
"[*]" |
+ |
Repetition quantifier (Table 4.10) | ".+" |
"[+]" |
( ) |
Grouping subexpressions | "([Tt]om){2,}" |
"[(]" or "[)]" |
\(\mid\) | Join subexpresions, any of which can be matched | "Tomas \(\mid\)Luis" |
"[ \(\mid\)]" |
#### (1)
edat <- read.table("~/git/phds/data/evans.txt", header = TRUE, sep = "")
str(edat)
xtabs(~chd, data = edat)
edat$chd2 <- factor(edat$chd, levels = 0:1, labels = c("No", "Yes"))
xtabs(~chd2, data = edat) # test
xtabs(~cat, data = edat)
edat$cat2 <- factor(edat$cat, levels = 0:1, labels = c("Normal", "High"))
xtabs(~cat2, data = edat) # test
xtabs(~smk, data = edat)
edat$smk2 <- factor(edat$smk, levels = 0:1, labels = c("Never", "Ever"))
xtabs(~smk2, data = edat) # test
xtabs(~ecg, data = edat)
edat$ecg2 <- factor(edat$ecg, levels = 0:1, labels = c("Normal", "Abnormal"))
xtabs(~ecg2, data = edat) # test
xtabs(~hpt, data = edat)
edat$hpt2 <- factor(edat$hpt, levels = 0:1, labels = c("No", "Yes"))
xtabs(~hpt2, data = edat) # test
#### (2)
quantile(edat$age)
edat$age4 <- cut(edat$age, quantile(edat$age), right = FALSE, include.lowest = TRUE)
xtabs(~age4, data = edat) # test
#### (3)
hptnew <- rep(NA, nrow(edat))
normal <- edat$sbp < 120 & edat$dbp < 80
hptnew[normal] <- 1
prehyp <- (edat$sbp >= 120 & edat$sbp < 140) | (edat$dbp >= 80 & edat$dbp < 90)
hptnew[prehyp] <- 2
stage1 <- (edat$sbp >= 140 & edat$sbp < 160) | (edat$dbp >= 90 & edat$dbp < 100)
hptnew[stage1] <- 3
stage2 <- edat$sbp >= 160 | edat$dbp >= 100
hptnew[stage2] <- 4
edat$hpt4 <- factor(hptnew, levels = 1:4,
labels=c("Normal", "PreHTN", "HTN.Stage1", "HTN.Stage2"))
xtabs(~hpt4, data = edat) # test
#### (4)
xtabs(~hpt2 + hpt4, data = edat)
4.11.8
Review the California 2004 surveillance data on human West Nile virus cases available at ~/git/phds/data/wnv/wnv2004raw.txt. Read in the data setting as.is = TRUE
, taking into account missing values (use na.strings
option). Convert the calendar dates into the international standard format. Using the write.table
function export the data as an ASCII text file.
4.11.8.1 Solution
wdat <- read.table("~/git/phds/data/wnv/wnv2004raw.txt",
header = TRUE, sep = ",", as.is = TRUE, na.strings = c(".","Unknown"))
str(wdat)
wdat$date.onset2 <- as.Date(wdat$date.onset, format = "%m/%d/%Y")
wdat$date.tested2 <- as.Date(wdat$date.tested, format = "%m/%d/%Y")
write.table(wdat, "~/git/phds/example/wnvdat.txt",
sep = ",", row.names = FALSE)
wdat.test <- read.csv("~/git/phds/example/wnvdat.txt")
head(wdat.test)
str(wdat.test)
4.11.9
On April 19, 1940, the local health officer in the village of Lycoming, Oswego County, New York, reported the occurrence of an outbreak of acute gastrointestinal illness to the District Health Officer in Syracuse. Dr. A. M. Rubin, epidemiologist-in-training, was assigned to conduct an investigation. (See Appendix 16 for data dictionary.)
When Dr. Rubin arrived in the field, he learned from the health officer that all persons known to be ill had attended a church supper held on the previous evening, April 18. Family members who did not attend the church supper did not become ill. Accordingly, Dr. Rubin focused the investigation on the supper. He completed Interviews with 75 of the 80 persons known to have attended, collecting information about the occurrence and time of onset of symptoms, and foods consumed. Of the 75 persons interviewed, 46 persons reported gastrointestinal illness.
qThe onset of illness in all cases was acute, characterized chiefly by nausea, vomiting, diarrhea, and abdominal pain. None of the ill persons reported having an elevated temperature; all recovered within 24 to 30 hours. Approximately 20% of the ill persons visited physicians. No fecal specimens were obtained for bacteriologic examination. The investigators suspected that this was a vehicle-borne outbreak, with food as the vehicle. Dr. Rubin put his data into a line listing. See the raw data set at ~/git/phds/data/oswego/oswego.txt.
The supper was held in the basement of the village church. Foods were contributed by numerous members of the congregation. The supper began at 6:00 p.m. and continued until 11:00 p.m. Food was spread out on a table and consumed over a period of several hours. Data regarding onset of illness and food eaten or water drunk by each of the 75 persons interviewed are provided in the line listing. The approximate time of eating supper was collected for only about half the persons who had gastrointestinal illness.
- Using RStudio plot the cases by time of onset of illness (include appropriate labels and title). What does this graph tell you? (Hint: Process the text data and then use the
hist
function.) - Are there any cases for which the times of onset are inconsistent with the general experience? How might they be explained?
- How could the data be sorted by illness status and illness onset times?
- Where possible, calculate incubation periods and illustrate their distribution with an appropriate graph. Use the
truehist
function in the MASS package. Determine the mean, median, and range of the incubation period.
4.11.9.1 Solution
See Appendix 16 for Oswego data dictionary.
- Plotting an epidemic curve with this data has special challenges because we have dates and times to process. To do this in R, we will create date objects that contain both the date and time for each primary event of interest: meal time, and onset time of illness. From this we can plot the distribution of onset times (epidemic curve). An epidemic curve is the distribution of illness onset times and can be displayed with a histogram. First, carefully study the Oswego data set at ~/git/phds/data/oswego/oswego.txt. We need to do some data preparation in order to work with dates and times. Our initial goal is to get the date/time data to a form that can be passed to R’s
strptime
function for conversion in a date-time R object. To construct the following curve, study, and implement the R code that follows:
odat <- read.table("~/git/phds/data/oswego/oswego.txt", sep = "",
header = TRUE, na.strings = ".")
str(odat)
head(odat)
mdt <- paste("4/18/1940", odat$meal.time) # create vector w/ meal date and time
meal.dt <- strptime(mdt, "%m/%d/%Y %I:%M %p") # convert to standard date and time
#### create vector with onset date and time
odt <- paste(paste(odat$onset.date,"/1940",sep = ""), odat$onset.time)
onset.dt <- strptime(odt, "%m/%d/%Y %I:%M %p") # convert into std date & time
hist(onset.dt, breaks = 30, freq = TRUE)
- Now that we have our data frame in R, we can identify those subjects that correspond to minimum and maximum onset times. We will implement R code that can be interpreted as “which positions in vector Y correspond to the minimum values in Y?” We then use these position numbers to indexing the corresponding rows in the data frame.
#### Generate logical vectors and identify 'which' position
min.obs.pos <- which(onset.dt == min(onset.dt, na.rm = TRUE))
min.obs.pos
max.obs.pos <- which(onset.dt == max(onset.dt, na.rm = TRUE))
max.obs.pos
#### index data frame to display outliers
odat[min.obs.pos, ]
odat[max.obs.pos, ]
- We can sort the data frame based values of one or more fields. Suppose we want to sort on illness status and illness onset times. We will use our onset.times vector we created earlier; however, we will need to convert it to “continuous time” in seconds to sort this vector. Study and implement the R code below.
onset.ct <- as.POSIXct(onset.dt)
odat2 <- odat[order(odat$ill, onset.ct), ]; odat2
- Calculate incubation periods and plot histogram using
truehist
.
#### Calculate incubation periods
incub.dt <- onset.dt - meal.dt
library(MASS) #load MASS package
truehist(as.numeric(incub.dt), nbins = 7, prob = FALSE, ylab = 'Frequency',
col = "skyblue", xlab = "Incubation Period (hours)")
#### Calculate mean, median, range; remember to remove NAs
mean(incub.dt, na.rm = TRUE)
median(incub.dt, na.rm = TRUE)
range(incub.dt, na.rm = TRUE)
Available at ~/git/phds/data/ugdp.txt↩
If row names are supplied, they must be unique.↩
Raw data set available at [~/git/phds/data/wnv/wnv2004raw.txt], and clean data set available at [~/git/phds/data/wnv/wnv2004fin.txt]. This file is a comma-delimited data file with a header.↩
ISO 8601 is an international standard for date and time representations issued by the International Organization for Standardization (ISO). See http://www.iso.org↩
The
as.integer
function also works but does not preserve the levels attribute.↩For example, [~/git/phds/data/oswego/oswego.txt]↩
This approach is the basis for designing and implementing relational databases. A relational database consists of multiple tables linked by an indexing field.↩
If
na.omit
removes cases, the row numbers of the cases form the “na.action” attribute of the result, of class “omit”. Thena.exclude
function differs fromna.omit
only in the class of the “na.action” attribute of the result, which is “exclude”. See help for more details.↩For more information visit the Portable Application Standards Committee site at [http://www.pasc.org/]↩
The locale describes aspects of the internationalization of a program. Initially most aspects of the locale of R are set to “C” (which is the default for the C language and reflects North-American usage).↩