Chapter 12 Dates and Times

library(tidyverse)
library(lubridate)

Dates within a computer require some special organization because there are several competing conventions for how to write a date (some of them more confusing than others) and because the sort order should be in the order that the dates occur in time.

One useful tidbit of knowledge is that computer systems store a time point as the number of seconds from set point in time, called the epoch. So long as you always use the same epoch, you doesn’t have to worry about when the epoch is, but if you are switching between software systems, you might run into problems if they use different epochs. In R, we use midnight on Jan 1, 1970. In Microsoft Excel, they use Jan 0, 1900.

Base R has a set powerful and flexible functions converting character strings into Date/Time objects. A great reference for the notation for specifying Date/Time is the help file for strptime. These functions require to specifying the format using a relatively complex set of rules. For example %y represents the two digit year, %Y represents the four digit year, %m represents the month, but %b represents the month written as Jan or Mar. These formats are very specific and easily get tripped up by variations such as using / instead of -.

# Why does base R force me to get the separating character correct?
as.Date(c('09-4-2004', '09/04/2004'), format='%m-%d-%Y')
## [1] "2004-09-04" NA

Furthermore doing any math with dates is challenging because day-light savings time, leap days and leap seconds make it so that 1 year is not always \(365\) days and 1 day is not always \(60*60*24=86400\) seconds. So there is conceivably a difference between adding 1 year to a date and adding 365 days (or \(31,536,000\) seconds).

Dr Wickham and his then PhD student Dr Grolemund introduced the lubridate package to address the need for a robust set of input functions that don’t require exact specification date separation characters and to allow for arithmetic in either a person-centric or second-centric fashion.

12.1 Creating Date and Time objects

R gives several mechanism for getting the current date and time.

# base::Sys.Date()     # Today's date
# base::Sys.time()     # Current Time and Date 
lubridate::today()   # Today's date
## [1] "2021-04-13"
lubridate::now()     # Today's Date and Time
## [1] "2021-04-13 13:24:58 MST"

If we have all of our date information as several numerical columns and we can specify year, month, and day, the easiest way to create a date is to use the function make_date() or make_datetime()

make_date(year=2020, month=10, day=31)  # Halloween!
## [1] "2020-10-31"
data.frame(Year_Col=rep(2020, 5), Month_Col=10, Day_col=1:5) %>%
  mutate(date = make_date(year = Year_Col, month=Month_Col, day=Day_col))
##   Year_Col Month_Col Day_col       date
## 1     2020        10       1 2020-10-01
## 2     2020        10       2 2020-10-02
## 3     2020        10       3 2020-10-03
## 4     2020        10       4 2020-10-04
## 5     2020        10       5 2020-10-05

However, we often need to create a Date or DateTime object from a character string that has all the information, but possibly mixed up in the ordering. We need to take a string or number that represents a date and tell the lubridate how to figure out which bits are the year, which are the month, and which are the day. The lubridate package uses the following functions:

Common Orders Uncommon Orders
ymd() Year Month Day dym() Day Year Month
mdy() Month Day Year myd() Month Year Day
dmy() Day Month Year ydm() Year Day Month

The uncommon orders aren’t likely to be used, but the lubridate package includes them for completeness. Once the order has been specified, the lubridate package will try as many different ways to parse the date that make sense. As a result, so long as the order is consistent, all of the following will work:

mdy( 'June 26, 1997', 'Jun 26 97', '6-26-97', '6-26-1997', '6/26/97', '6-26/97' )
## [1] "1997-06-26" "1997-06-26" "1997-06-26" "1997-06-26" "1997-06-26"
## [6] "1997-06-26"
mdy('June 26, 0097', 'June 26, 97',  'June 26, 68', 'June 26, 69')
## [1] "0097-06-26" "1997-06-26" "2068-06-26" "1969-06-26"

This shows by default if you only specify the year using two digits, lubridate will try to do something clever. It will default to either a 19XX or 20XX and it picks whichever is the closer date. This illustrates that you should ALWAYS fully specify the year using four digits.

The lubridate functions will also accommodate if an integer representation of the date, but it has to have enough digits to uniquely identify the month and day.

ymd(20090110)
## [1] "2009-01-10"
ymd(2009722) # only one digit for month --- error!
## Warning: All formats failed to parse. No formats found.
## [1] NA
ymd(2009116) # this is ambiguous! 1-16 or 11-6?
## Warning: All formats failed to parse. No formats found.
## [1] NA

If we want to add a time to a date, we will use a function with the suffix _hm or _hms. Suppose that we want to encode a date and time, for example, the date and time of my wedding ceremony

mdy_hm('Sep 18, 2010 5:30 PM', '9-18-2010 17:30')
## [1] "2010-09-18 17:30:00 UTC" "2010-09-18 17:30:00 UTC"

In the above case, lubridate is correctly parsing the AM/PM designation, but it is better to always specify times using 24 hour notation and skip the AM/PM designations.

12.2 Time Zones

Time zones are incredibly important because as humans, we like to have a reasonable scale designating the morning, evening, and night that is universally understood. However this introduces a huge number of complication in scheduling across time zones. To further complicate matters, daylight savings times have us skipping forward an hour during the spring and falling back an hour in the fall.

By default, R codes the time of day using UTC (Coordinated Universal Time) standard, which is nearly inter-changeable with Greenwich Mean Time (GMT). To specify a different time zone, use the tz= option. For example:

mdy_hm('9-18-2010 5:30 PM', tz='MST') # Mountain Standard Time
## [1] "2010-09-18 17:30:00 MST"

This isn’t bad, but Loveland, Colorado is on MST in the winter and MDT in the summer because of daylight savings time. So to specify the time zone that could switch between standard time and daylight savings time, I should specify tz='US/Mountain'

mdy_hm('9-18-2010 5:30 PM', tz='US/Mountain') # US mountain time
## [1] "2010-09-18 17:30:00 MDT"

Arizona is weird and doesn’t use daylight savings time. Fortunately R has a built-in time zone just for us.

mdy_hm('9-18-2010 17:30', tz='US/Arizona') # US Arizona time
## [1] "2010-09-18 17:30:00 MST"

R recognizes 582 different time zone locals and you can find these using the function OlsonNames(). To find out more about what these mean you can check out the Wikipedia page on timezones http://en.wikipedia.org/wiki/List_of_tz_database_time_zones.

Unfortunately the mdy_hm() family of functions only allows you to input a single timezone regardless of how many data time strings you pass into it. To handle this, the work flow will be to create a vector of datetime object with the default time zone being set to UTC, and then correct it with either force_tz() or force_tzs() where the later function allows the user to give a vector of time zones.

Function Description
force_tz() Take the designated time(s), which is often created without a time zone as is assumed to be UTC and converts it to the given timezone by keeping the clock time and moving the instant in time. That is to say, the initial clock time was WRONG and we are fixing it by giving the correct time zone.
force_tzs() Take the designated times, which is often created without a time zone as is assumed to be UTC and converts it to the given timezones by keeping the clock time and moving the instant in time. That is to say, the initial clock times were WRONG and we are fixing it by giving the correct time zone. Because R hates having vectors of dates with different time zones, this function forces all the date-times into a single output timezone, which by default is UTC.
with_tz() Takes a designated instances in time converts the representation into another timezone. Here the instant in time doesn’t change, but clock time displayed does change.
dates <- c('2013-11-2 10:20', 
           '2013-11-2 9:30', 
           '2013-11-2 16:20') # in chronological order!   
zones <- c('America/New_York', 
           'America/Denver',
           'America/New_York')

ymd_hm(dates) # Without any timezone specification, R assumes UTC, which is wrong here!
## [1] "2013-11-02 10:20:00 UTC" "2013-11-02 09:30:00 UTC"
## [3] "2013-11-02 16:20:00 UTC"
ymd_hm( dates, tz=zones )  # Does not work because of multiple time zones.
## Error in parse_date_time(dates, orders, tz = tz, quiet = quiet, locale = locale): `tz` argument must be a character of length one
ymd_hm(dates) %>% 
  force_tzs(time=., tzones = zones)  # Notice the output is UTC time.
## [1] "2013-11-02 14:20:00 UTC" "2013-11-02 15:30:00 UTC"
## [3] "2013-11-02 20:20:00 UTC"
ymd_hm(dates) %>%                       #
  force_tzs(tzones = zones) %>%         # Now the correct instant in time is set
  with_tz(tzone = 'America/New_York')   # Make these readable to someone in NYC.
## [1] "2013-11-02 10:20:00 EDT" "2013-11-02 11:30:00 EDT"
## [3] "2013-11-02 16:20:00 EDT"

This all seems quite reasonable, but when looking at data frames, the default for R is to not show the timezone for a column of dates.

data.frame( Date = dates, tzone = zones) %>%
  mutate(Date = ymd_hm(Date),
         Date = force_tzs(Date, tzones=tzone),
         Date = with_tz(Date,  'America/New_York')) %>%
  select(Date)
##                  Date
## 1 2013-11-02 10:20:00
## 2 2013-11-02 11:30:00
## 3 2013-11-02 16:20:00

Because time zones are finicky a R, if you have an application that has to deal with more than one timezone, I recommend always storing the information as UTC referenced values. Then you can do all your time point calculations knowing that you are on the same time scale. Then if you want to show a date to a user, you know that you always have to convert the time to the desired time zone.

12.3 Extracting information

The lubridate package provides many functions for extracting information from the date. Suppose we have defined

# Derek's wedding!
x <- mdy_hm('9-18-2010 17:30', tz='US/Mountain') # US Mountain time
Command Output Description
year(x) 2010 Year
month(x) 9 Month
day(x) 18 Day
hour(x) 17 Hour of the day
minute(x) 30 Minute of the hour
second(x) 0 Seconds
wday(x) 7 Day of the week (Sunday = 1)
mday(x) 18 Day of the month
yday(x) 261 Day of the year
tz(x) ‘US/Mountain’ Time Zone

Here we get the output as digits, where September is represented as a 9 and the day of the week is a number between 1-7. To get nicer labels, we can use label=TRUE for some commands. In conjunction with label=TRUE there is the option abbr=TRUE that specifies to return the abbreviation or not.

Command Output
wday(x, label=TRUE) Sat
wday(x, label=TRUE, abbr=FALSE) Saturday
month(x, label=TRUE) Sep
month(x, label=TRUE, abbr=FALSE) September

All of these functions can also be used to update the value. For example, we could move the day of the wedding from September \(18^{th}\) to October \(18^{th}\) by changing the month.

month(x) <- 10             # I really don't like this method of changing the month
x <- update(x, month=10)   # update feels a little better to me
x
## [1] "2010-10-18 17:30:00 MDT"

12.4 Printing Dates

We often need to print out character strings representing a particular date and time in a format that is convenient for humans to read. The output we’ve seen is acceptable for many instances, but if we want more control over the format we have to use one of the following methods.

The base R function format() allows for a wide variety of possibilities but we have to remember the syntax which is found in help file for strptime.

# This is the base R solution
# 
# %A = Day of the week (not abbreviated)
# %B = Month name written out (not abbreviated). 
# %I = Hour on 1-12 scale
# %P = am/pm designation using lowercase am/pm. %p gives the uppercase version
# %Z = Time Zone designation
format(x, '%A, %B %d, %Y at the time of %I:%M %P %Z')  
## [1] "Monday, October 18, 2010 at the time of 05:30 pm MDT"

What lubridate does is allows the user to specify the format using an example date. It then parses the example to figure out the format and then creates a function that will apply that format to any dates you want.

# The weekday needs to match up with the date in the example...
# Notice this still isn't completely un ambiguous and multiple formats are possible
my_fancy_formater <- stamp('Sunday, January 31, 1999 at 12:59 pm')
## Multiple formats matched: "%A, %B %d, %Y at %I:%M %p"(1), "Sunday, %Om %d, %Y at %H:%M %Op"(1), "Sunday, %B %d, %Y at %I:%M %p"(1), "%A, %Om %d, %Y at %H:%M %Op"(0), "%A, %B %d, %Y at %H:%M %Op"(0), "%A, %Om %d, %Y at %I:%M %p"(0), "Sunday, %B %d, %Y at %H:%M %Op"(0), "Sunday, %Om %d, %Y at %I:%M %p"(0)
## Using: "%A, %B %d, %Y at %I:%M %p"
my_fancy_formater( x )  
## [1] "Monday, October 18, 2010 at 05:30 PM"

The lubridate::stamp() function is pretty cool for getting started on an output format string, but I generally prefer to just dig into the strptime help and figure it out the format string I want exactly.

When just printing out dates, R is very reluctant to print out the time zone. When dealing with data frames of dates, I often find myself creating a character string with the time zone, just so I can double check the time zone information is correct.

12.5 Arithmetic on Dates

lubridate provides two different ways of dealing with arithmetic on dates, and Hadley’s chapter on Date/Times in R for Data Science is a good reference.

Recall that dates are stored as the number of seconds since 0:00:00 January 1, 1970 UTC. This fundamental idea that a date is just some number of seconds introduces the idea that a minute is just 60 seconds, an hour is 3600 seconds, a day is \(24*3600=86,400\) seconds, and finally a year is \(365*86,400=31,536,000\) seconds. But what about leap years! Years are not always 365 days and days are not always 24 hours (think about daylight savings times).

With this in mind, we need to be able to do arithmetic using conventional ideas of year/month/day that ignores any clock discontinuities as well as using precise ideas of exactly how many seconds elapsed between two time points.

Object Class Description
Periods Lubridate periods correspond to a person’s natural inclination of adding a year or month and ignores any clock discontinuities.
Durations Lubridate duration correspond to the exact number of seconds between two points in time and adding some number of seconds. I remember that durations are the dorky number of seconds definition.
Intervals Lubridate allows us to create an object that stores a beginning and ending time point.
current <- ymd_hms('2019-10-12 10:25:00', tz='MST')
current + years(1)  # period. There are also minutes, hours, days, months functions.
## [1] "2020-10-12 10:25:00 MST"
current + dyears(1) # duration. There are also dminutes, dhours, ddays, dmonths
## [1] "2020-10-11 16:25:00 MST"

Notice that dyears(1) didn’t just increment the years from 2019 to 2020, but rather added \(60*60*24*365\) seconds, and because 2020 is a leap year and therefore February 29 will exist. Thus adding \(31,536,000\) seconds ended up with a result of October 11\(^{th}\) instead of October 12\(^{th}\).

Once we have two or more Date objects defined, we can calculate the amount of time between the two dates. We’ll first create an interval that defines the exact start and stop of the time interval we care about and then convert that to either a period (person convention) or a duration (# of seconds).

Wedding <- ymd('2010-Sep-18')
Elise <- ymd('2013-Jan-11')

Childless =  interval(Wedding, Elise)  # Two different ways to 
Childless =  Wedding %--% Elise        # create a time interval

TimePeriod <- as.period( Childless )          # Turn it into person readable
TimePeriod
## [1] "2y 3m 24d 0H 0M 0S"
year ( TimePeriod )        # extract the number of years from above
## [1] 2
month( TimePeriod )        # extract the number of additional months from above
## [1] 3
as.period(Childless, unit = 'days')    # This accounts for leap years but not leap seconds! 
## [1] "846d 0H 0M 0S"
as.period(Childless, unit = 'seconds') # This accounts for leap years and leap seconds! 
## [1] "73094400S"
as.duration( Childless )                     # always accounts for clock shifts
## [1] "73094400s (~2.32 years)"
as.numeric( as.duration(Childless), 'years') # The decimal number of years
## [1] 2.316222

While working with dates, I like creating intervals whenever possible and try to NEVER just subtract two data/time objects because that will always just return the number of seconds (aka the duration answer).

As a demonstration, lets consider a data set where we have the individuals birthdays and we are interested in calculated the individuals age in years.

data <- tibble(
  Name = c('Steve', 'Sergey', 'Melinda', 'Bill', 'Alexa', 'Siri'),
  dob = c('Feb 24, 1955', 'August 21, 1973', 'Aug 15, 1964', 
          'October 28, 1955', 'November 6, 2014', 'October 12, 2011') )

data %>%
  mutate( dob = mdy(dob) ) %>%
  mutate( Life = dob %--% today() ) %>%
  mutate( Age = as.period(Life, units='years') ) %>%
  mutate( Years = year(Age))
## # A tibble: 6 x 5
##   Name    dob        Life                           Age                 Years
##   <chr>   <date>     <Interval>                     <Period>            <dbl>
## 1 Steve   1955-02-24 1955-02-24 UTC--2021-04-13 UTC 66y 1m 20d 0H 0M 0S    66
## 2 Sergey  1973-08-21 1973-08-21 UTC--2021-04-13 UTC 47y 7m 23d 0H 0M 0S    47
## 3 Melinda 1964-08-15 1964-08-15 UTC--2021-04-13 UTC 56y 7m 29d 0H 0M 0S    56
## 4 Bill    1955-10-28 1955-10-28 UTC--2021-04-13 UTC 65y 5m 16d 0H 0M 0S    65
## 5 Alexa   2014-11-06 2014-11-06 UTC--2021-04-13 UTC 6y 5m 7d 0H 0M 0S       6
## 6 Siri    2011-10-12 2011-10-12 UTC--2021-04-13 UTC 9y 6m 1d 0H 0M 0S       9

As a final example, suppose that an hourly employee clocked in at 11:30 PM March 7, 2020 and then clocked out at 7:30 AM March 8 2020. How long did he or she work?

In  <- ymd_hm('2020-3-7 11:30 PM', tz='US/Mountain')
Out <- ymd_hm('2020-3-8 7:45 AM', tz='US/Mountain')

In %--% Out
## [1] 2020-03-07 23:30:00 MST--2020-03-08 07:45:00 MDT
as.period(  In %--%Out)       # Does NOT account for daylight savings time!
## [1] "8H 15M 0S"
as.duration(In %--%Out)       # Does account for daylight savings time!
## [1] "26100s (~7.25 hours)"

To use a duration in any subsequent calculation, we need to convert it to a numeric value using the as.numeric() function, which can convert to whatever unit you want.

# 
time.worked <- as.duration(In %--%Out)
as.numeric(time.worked, "hours")
## [1] 7.25
as.numeric(time.worked, "minutes")
## [1] 435

12.6 Exercises

  1. Convert the following to date or date/time objects.
    1. September 13, 2010.
    2. Sept 13, 2010.
    3. Sep 13, 2010.
    4. S 13, 2010. Comment on the month abbreviation needs.
    5. 07-Dec-1941.
    6. 1-5-1998. Comment on why you might be wrong.
    7. 21-5-1998. Comment on why you know you are correct.
    8. 2020-May-5 10:30 am
    9. 2020-May-5 10:30 am PDT (ex Seattle)
    10. 2020-May-5 10:30 am AST (ex Puerto Rico)
  2. Using just your date of birth (ex Sep 7, 1998) and today’s date calculate the following Write your code in a manner that the code will work on any date after you were born.:
    1. Calculate the date of your 64th birthday.
    2. Calculate your current age (in years). Hint: Check your age is calculated correctly if your birthday was yesterday and if it were tomorrow!
    3. Using your result in part (b), calculate the date of your next birthday.
    4. The number of days until your next birthday.
    5. The number of months and days until your next birthday.
  1. Suppose you have arranged for a phone call to be at 3 pm on May 8, 2015 at Arizona time. However, the recipient will be in Auckland, NZ. What time will it be there?

  2. From this book’s GitHub directory, navigate to the data-raw directory and then download the Pulliam_Airport_Weather_Station.csv data file. (There are several weather station files. Make sure you get the correct one!) There is a DATE column (is it of type date when you import the data?) as well as the Maximum and Minimum temperature. For the last 5 years of data we have (exactly, not just starting at Jan 1, 2014!), plot the time series of daily maximum temperature with date on the x-axis. Write your code so that it will work if I update the dateset. Hint: Find the maximum date in the data set and then subtract 5 years. Will there be a difference if you use dyears(5) vs years(5)? Which seems more appropriate here?

  3. It turns out there is some interesting periodicity regarding the number of births on particular days of the year.

    1. Using the mosaicData package, load the data set Births78 which records the number of children born on each day in the United States in 1978. Because this problem is intended to show how to calculate the information using the date, remove all the columns except date and births.
    2. Graph the number of births vs the date with date on the x-axis. What stands out to you? Why do you think we have this trend?
    3. To test your assumption, we need to figure out the what day of the week each observation is. Use dplyr::mutate to add a new column named dow that is the day of the week (Monday, Tuesday, etc). This calculation will involve some function in the lubridate package and the date column.
    4. Plot the data with the point color being determined by the day of the week variable.