Chapter 40 Dates and Times in R
The horrors of dates and times (especially times and time zones) is a bane of computing. Dates have changed by government fiat (Gregorian to Julian, etc.) several times in history, and time zones and daylight savings time practices change more frequently. To add to the complications, there are many, many date formats around the world, with different countries and cultures mixing and matching days, months, and years willy-nilly to indicate a particular date. For this reason, there is an international standard, ISO 8601, for dates and times. The standard calls for listing all dates in YYYY-MM-DD format, with numbers for the 4 digit year, 2 digit month, and 2 digit day, as in 2021-04-17 for April 17, 2021. All numbers are zero-padded (necessary zeroes added to the left) to maintain 4 or two digits as needed. Use this format whenever you can, for consistency and to avoid errors that will be painful later. Insist that all of your collaborators use ISO8601 format, and your life will be vastly improved.
Time in ISO8601 uses a 24 hour clock, and is formatted in two ways: the basic format is T[hh][mm][ss] and the extended format is T[hh]:[mm]:[ss]. Each number in a time is two digits, and is zero-padded when needed, so that one second after midnight is T00:00:01.
When possible, use only dates in your data analysis. Times are complex and surprisingly painful, due to unusual cases that frequently cause trouble. These include:
GMT (the time zone for Greenwich mean time) is close to, but not the same as UTC. Coordinated Universal Time (UTC) has no daylight savings, while GMT does.
There are leap years, and leap seconds, to keep track of.
January 30 + 1 month = NA, because there is no Feb 30.
There are more than 24 time zones, because of episodic local changes. For example, tz = “America/Detroit” is not quite the same as Eastern Standard Time, as Michigan had daylight savings from 1916-1968, not from 1969-72, then returned to DST in 1973 and thereafter.
With all of the odd geopolitical time zone and DST variations, there are now more than 600 distinct valid time zones that R keeps track of, and you can print these with the base R function OlsonNames(). Arthur David Olson was the original maintainer of the time zone database, which is now managed by the Internet Assigned Numbers Authority (IANA), currently led by Paul Eggert at UCLA.
Unless you absolutely need time stamps to track intervals of less than a day, I strongly urge you to convert all date-times to simple ISO8601 dates.
40.1 Data Types for Dates and Times
Data for dates and times comes in several formats. - POSIXct - calendar
time (also called continuous time or count time, as it counts seconds) -
POSIXlt - local time (also called list time) - Date - date-time
POSIX is probably unfamiliar to you - it stands for the Portable Operating System Interface for uniX used to make standards for UNIX across Operating systems.
POSIX Dates use the reference point of 1970-01-01 T 0:00:00 UTC (midnight at the start of the year 1970 in coordinated universal time) as zero seconds. All dates after that are counted in seconds in POSIXct. Dates prior to 1970 are counted in negative seconds. POSIXct is very memory-efficient for storing dates, but not very human-readable.
In the code chunk below, we will start with a ISO 8601 date, with a time zone. We will then print it out, and then show it as it is stored, are using the unclass function. Copy and run the code chunk to see the results.
## [1] "2018-01-16 04:34:45 EST"
## [1] 1516095285
## attr(,"tzone")
## [1] "America/Detroit"
You get a nicely formatted ISO8601 date when you print the date, but when it is unclassed, you get a large integer, which corresponds to the number of seconds since midnight, UTC on January 1, 1970. POSIXct dates are stored as integer seconds.
40.2 Using POSIXlt
POSIXlt is another version of the POSIX standard, but in a more human-readable format. The ‘lt’ stands for local time, but it is often thought of as list time, as the data of POSIXlt is stored in a list format of human-readable items.
POSIXlt stores separate items in a list. The list includes 10 distinct items:
- seconds (sec)
- minutes (min)
- hours (hour)
- day of month (mday)
- month (mon) - numeric, with January as zero, and December as 11
- year (year) - since 1900 as zero, so the year 2018 = 118
- weekday numeric (wday), with Sunday = 0, Saturday = 6
- daylight savings (isdst), 0 for no, 1 for yes
- time zone (zone)
- Offset from UTC in seconds (gmtoff)
In the code chunk below, we will start with the same ISO 8601 date, with a time zone. We will then print it out, and then show it as it is stored, are using the unclass function. We will then extract the month and year. Copy and run the code chunk to see the results.
## [1] "2018-01-16 04:34:45 EST"
## $sec
## [1] 45
##
## $min
## [1] 34
##
## $hour
## [1] 4
##
## $mday
## [1] 16
##
## $mon
## [1] 0
##
## $year
## [1] 118
##
## $wday
## [1] 2
##
## $yday
## [1] 15
##
## $isdst
## [1] 0
##
## $zone
## [1] "EST"
##
## $gmtoff
## [1] -18000
##
## attr(,"tzone")
## [1] "America/Detroit" "EST" "EDT"
## attr(,"balanced")
## [1] TRUE
## [1] 0
## [1] 118
Again, we get a nicely formatted ISO8601 date when you print the date, but when it is unclassed, you get a list of date items. When we extract the month and year, these can be surprising (January = 0, Year 2018 = 118). POSIXlt dates are stored as a list of date items.
40.5 Including Tables
covid %>%
count(demo_group, gender) %>%
gt() %>%
tab_header(title = "Demographics of COVID Testing",
subtitle = "By Group and Gender") %>%
tab_source_note(source_note = "From CHOP, 2020") %>%
cols_label(demo_group = "Group",
gender = "Gender",
n = "Count")
Demographics of COVID Testing | ||
By Group and Gender | ||
Group | Gender | Count |
---|---|---|
client | female | 314 |
client | male | 290 |
misc adult | female | 1214 |
misc adult | male | 1227 |
other adult | female | 126 |
other adult | male | 97 |
patient | female | 6178 |
patient | male | 6077 |
unidentified | male | 1 |
From CHOP, 2020 |
40.7 Code Chunk Options
Option | Values | Output |
---|---|---|
eval | TRUE/FALSE | Whether or not the code is run. |
echo | TRUE/FALSE | Show or hide the code |
include | TRUE/FALSE | Whether or not the
resulting output of a
code chunk is
displayed in the
document. FALSE
means that the code
will run, but will
not display results.
include = FALSE is
often used for the
setup chunk. |
warning | TRUE/FALSE | Whether warnings generated from your code will be displayed in the document. |
message | TRUE/FALSE | Whether messages generated from your code will be displayed in the document. |
fig.align | default, left, right, center | Where on the page the
output figure should
align. Text options
should be in quotes,
likef ig.align = "right" |
fig.width | default = 7 | figure width in inches |
fig .height | default = 7 | figure height in inches |
error | TRUE/FALSE | If TRUE , will not
stop building the
document if there is
an error in a code
chunk. |
cache | TRUE/FALSE | If TRUE , will store
the results and not
re-run the chunk.
Helpful for long, slow
calculations. But
watch out for this if
your data change and
your results do
not(!!). |
40.8 Try Out Other Chunk Options
Try adding different chunk options, including
include = FALSE
eval = FALSE, echo = TRUE
eval=TRUE, echo = FALSE
40.11 A Quick Quiz
- Which code chunk option hides the code?
- Which code chunk always comes first, and includes libraries and data import steps?
- What is the name of the code block (and the markup language it is
written in), set off with 3 hyphens(—) at the very top of your
Rmarkdown document, that tells
pandoc
how to format the final document? - What symbols do you use to make text bold in Rmarkdown?
- Which {knitr} function do you use to add images to your document with a code chunk?