Chapter 12 Dates and Times

library(tidyverse)
library(lubridate) # handling of date and time objects

Dates within a computer require some special organization because there are several competing conventions for how to write a date (some of them more confusing than others). This includes conventions such as month-day-year form versus day-month-year, which can depend on your region, country, or just your style in some cases. In addition, time and date must have a logical sorting, such that when ordered dates should be chronological through time.

One useful tidbit of knowledge is that computer systems store a time point as the number of seconds from set point in time, called the epoch. So long as you always use the same epoch, there is no worry about when the epoch is. The only time you might run into problems with different epochs if when you switch between software systems. In R, we use midnight on Jan 1, 1970. In Microsoft Excel, they use Jan 1, 1900. A quick search of computer epoch time will demonstrate that not many systems have the same default, but conversions between epoch times do exist.

Base R has a set of powerful and flexible functions for converting character strings into Date/Time objects, typically of the type POSIXlt or POSIXct. A great reference for the notation for specifying Date/Time is the help file for strptime. These functions require to specifying the format using a relatively complex set of rules. For example %y represents the two digit year, %Y represents the four digit year, %m represents the month in numerical order (01 for Jan, 02 for Feb), but %b represents the month written as Jan or Mar. These formats are very specific and easily get tripped up. Below is a simple example of how variations such as using / instead of - can cause troubles for computer interpretation.

# Why does base R force me to get the separating character correct?
as.Date(c('09-4-2004', '09/04/2004'), format='%m-%d-%Y')
## [1] "2004-09-04" NA

Furthermore doing any math with dates is challenging. Day-light savings time, and concepts like leap years which include a leap day and leap seconds make it so that 1 year is not always \(365\) days and 1 day is not always \(60*60*24=86400\) seconds. This makes it so that there is a conceivable difference between adding 1 year to a date and adding 365 days (or \(31,536,000\) seconds).

Dr. Wickham and his then PhD student Dr. Grolemund introduced the lubridate package to address the need for a robust set of input functions that do not require exact date separation characters and to allow for simplification and easy differentiation of date/time arithmetic. All functions presented below require the package lubridate and are beyond the capabilities of base R.

12.1 Creating Date and Time objects

R gives several mechanism for getting the current date and time. Below are the base R versions of time and date, compared to the lubridate versions.

base::Sys.Date()     # Today's date
## [1] "2024-11-24"
base::Sys.time()     # Current Time and Date 
## [1] "2024-11-24 21:13:40 MST"
lubridate::today()   # Today's date
## [1] "2024-11-24"
lubridate::now()     # Today's Date and Time
## [1] "2024-11-24 21:13:40 MST"

One of the easiest functions to develop a date is the make_date() lubridate command. It is typically possible to make or provide numerical columns that specify a year, month, and day. These can then quickly be converted to a proper date object by supplying the colum information to the make_date() function.

### create a single date
make_date(year=2024, month=10, day=31)  # Halloween!
## [1] "2024-10-31"
### make many dates from each row, with date information in seperate columns
data.frame(Year_Col=rep(2024, 5), Month_Col=10, Day_col=1:5) %>%
  mutate(date = make_date(year = Year_Col, month=Month_Col, day=Day_col))
##   Year_Col Month_Col Day_col       date
## 1     2024        10       1 2024-10-01
## 2     2024        10       2 2024-10-02
## 3     2024        10       3 2024-10-03
## 4     2024        10       4 2024-10-04
## 5     2024        10       5 2024-10-05

Notice in the output above that the year, month, and day information started as numerical objects, but when converted, ended up as a date data type. It can be an important convention to properly convert date/time information into a date data type, so that when working with date/time information, the computer can properly process how to handle such information. Additionally to have numerical information as in the example above, we often need to create a date or date-tme object from a character string that has all the information. As discussed, it is possible that these character strings mix up in the ordering of years, months, and days. A common data cleaning task is to take a string or number that represents a date and tell the lubridate how to figure out which bits are the year, which are the month, and which are the day. The lubridate package uses the following functions:

Common Orders Uncommon Orders
ymd() Year Month Day dym() Day Year Month
mdy() Month Day Year myd() Month Year Day
dmy() Day Month Year ydm() Year Day Month

The uncommon orders aren’t likely to be used, but the lubridate package includes them for completeness. Once the order has been specified, the lubridate package will try as many different ways to parse the date that make sense. As a result, so long as the order is consistent, all of the following will work, regarless of the seperating form and representation style:

mdy( 'June 26, 1997', 'Jun 26 97', '6-26-97', '6-26-1997', '6/26/97', '6-26/97' )
## [1] "1997-06-26" "1997-06-26" "1997-06-26" "1997-06-26" "1997-06-26"
## [6] "1997-06-26"

Notice though that lubridate can get confused with less specific date information, such as its confusion on how to handle two digit years. This should not be surprising, because even the reader should not know if I meant the year to be in the 20th century (1900 - 1999) or 21st century (2000 - 2099), or possibly even some other century like the 1st century.

mdy('June 26, 0097', 'June 26, 97',  'June 26, 68', 'June 26, 69')
## [1] "0097-06-26" "1997-06-26" "2068-06-26" "1969-06-26"

The example above does show by default if you only specify the year using two digits, lubridate will try to do something clever. It will default to either a 19XX or 20XX and it picks whichever is the closer date. In general, this illustrates that you should always fully specify the year using four digits representation.

The lubridate functions will also accommodate an integer representation of the date, but it has to have enough digits to uniquely identify the month and day.

ymd(20090110)
## [1] "2009-01-10"
ymd(2009722) # only one digit for month --- error!
## Warning: All formats failed to parse. No formats found.
## [1] NA
ymd(2009116) # this is ambiguous! 1-16 or 11-6?
## Warning: All formats failed to parse. No formats found.
## [1] NA

If we want to add a time to a date, we will use a function with the suffix _hm or _hms. Suppose that we want to encode a date and time, for example:

mdy_hm('Sep 18, 2010 5:30 PM', '9-18-2010 17:30')
## [1] "2010-09-18 17:30:00 UTC" "2010-09-18 17:30:00 UTC"

In the above case, lubridate is correctly parsing the AM/PM designation, but it can be a better convention to specify times using 24 hour notation and skip the AM/PM designations.

12.2 Time Zones

Time zones are incredibly important because as humans, we like to have a reasonable scale designating the morning, evening, and night that is universally understood. This introduces a huge number of complications when scheduling across time zones. To further complicate matters, daylight savings time has us skipping forward an hour during the spring and falling back an hour in the fall - but of course here in Arizona (and a few other locations around the globe) we do not practice daylight savings time. We want ways to be able to handle changing time zones, with these intricate differences.

By default, R codes the time of day using UTC (Coordinated Universal Time), which is nearly inter-changeable with Greenwich Mean Time (GMT). To specify a different time zone, use the tz= option. For example:

mdy_hm('9-18-2010 5:30 PM', tz='MST') # Mountain Standard Time
## [1] "2010-09-18 17:30:00 MST"

This is not bad, but Loveland, Colorado is on MST in the winter and MDT in the summer because of daylight savings time. So to specify the time zone that could switch between standard time and daylight savings time, we should specify tz='US/Mountain'

mdy_hm('9-18-2010 5:30 PM', tz='US/Mountain') # US mountain time
## [1] "2010-09-18 17:30:00 MDT"

Arizona is weird and doesn’t use daylight savings time. Fortunately R has a built-in time zone just for us.

mdy_hm('9-18-2010 17:30', tz='US/Arizona') # US Arizona time
## [1] "2010-09-18 17:30:00 MST"

R recognizes \(582\) different time zones and you can find these using the function OlsonNames(). To find out more about what these mean you can check out the Wikipedia page on time zones http://en.wikipedia.org/wiki/List_of_tz_database_time_zones.

An unexpected challenge in dealing with vectors or data frames of date is that lubridate expects only a single value for tz. If you pass in a vector, it won’t work. The solution is to use group_by() or rowwise() prior to the calculation. Here is some simple data with different times, each in different time zones.

dates <- c('2013-11-2 10:20', '2013-11-2 9:30', '2013-11-2 16:20') # in chronological order!   
zones <- c('America/New_York', 'America/Denver','America/New_York')

None of the syntax below will work because lubridate cannot properly parse the time zone information. Even if we setup a proper data frame structure with each row being a different date and time zone, R is still casting an error.

# None of these will work due to not being able to parse multiple timezones
ymd_hm( dates, tz=zones ) 
## Error in parse_date_time(dates, orders, tz = tz, quiet = quiet, locale = locale, : `tz` argument must be a character of length one
ymd_hm(dates) %>% with_tz(tzone = zones)
## Error in as.POSIXlt.POSIXct(x, tz): invalid 'tz' value
data.frame(date = dates, zone=zones) %>%
  mutate(date = ymd_hm(date, tz=zone))
## Error in `mutate()`:
## ℹ In argument: `date = ymd_hm(date, tz = zone)`.
## Caused by error in `parse_date_time()`:
## ! `tz` argument must be a character of length one

To apply different time zones in a vectorized operation, we need to send a single time zone to each element in our dates vector. There are a few ways to account for this issue. The first is we can tell R to execute the operations on each row, a command previously discussed that can do this is the rowwise() function. This function will make each row its own group, and then the lubridate commands will execute.

data.frame(date=dates, zone=zones) %>%
  rowwise() %>%
  mutate(date = ymd_hm(date, tz=zone))
## # A tibble: 3 × 2
## # Rowwise: 
##   date                zone            
##   <dttm>              <chr>           
## 1 2013-11-02 10:20:00 America/New_York
## 2 2013-11-02 11:30:00 America/Denver  
## 3 2013-11-02 16:20:00 America/New_York

Unfortunately using rowwise() can be slow when working with large amounts of data. A second, and maybe better option, is to tell R to group the data by time zones and then pass in the first element of the zone vector.

data.frame(date = dates, zone=zones) %>%
  group_by(zone) %>%
  mutate(date = ymd_hm(date, tz=zone[1]))
## # A tibble: 3 × 2
## # Groups:   zone [2]
##   date                zone            
##   <dttm>              <chr>           
## 1 2013-11-02 08:20:00 America/New_York
## 2 2013-11-02 09:30:00 America/Denver  
## 3 2013-11-02 14:20:00 America/New_York

But why do the previous two results not agree? The previous two calculations are different because the calculations depend on the reference time zone stored with the date-time, but R does not output this information when we print the date object. The first example everything is in reference to the New York time zone, while in the second everything was reference to the Denver time zone. This is exceedingly confusing. To prove that these are all giving the right times, even though they look different, we can extract the time zone information and update the zone column after using it

### rowwise calculations, each row used its proper timezone
### but the final times all were stored in reference to Denver
data.frame(date=dates, zone=zones) %>%
  rowwise() %>% # these are the different input zones!
  mutate(date = ymd_hm(date, tz=zone)) %>%
  mutate(zone = tz(date)) # this is the output zone that is identical for all rows!
## # A tibble: 3 × 2
## # Rowwise: 
##   date                zone            
##   <dttm>              <chr>           
## 1 2013-11-02 10:20:00 America/New_York
## 2 2013-11-02 11:30:00 America/New_York
## 3 2013-11-02 16:20:00 America/New_York
### similarly, we can group_by the time zones to perform the calculations,
### but we chose zone[1] as the reference...
data.frame(date = dates, zone=zones) %>%
  group_by(zone) %>%  # these are the different input zones!
  mutate(date = ymd_hm(date, tz=zone[1])) %>%
  mutate(zone = tz(date)) # this is the output zone that is identical for all rows!
## # A tibble: 3 × 2
## # Groups:   zone [1]
##   date                zone          
##   <dttm>              <chr>         
## 1 2013-11-02 08:20:00 America/Denver
## 2 2013-11-02 09:30:00 America/Denver
## 3 2013-11-02 14:20:00 America/Denver

The take home message here is that working with time zones in R is finicky. If you have an application that has to deal with more than one time zone, it is recommend always storing the information as UTC referenced values. By doing this there is no conflicts on what the reference time zone was and how the time zone was stored. You can do all your time point calculations knowing that you are on the same time scale. Then if you want to show a date to a user, you need only convert the time to the desired time zone from the UTC standard. Who knew storing a time and date could be so difficult!

12.3 Extracting information

The lubridate package provides many functions for extracting information from the date. Suppose we have defined a particular date of interest.

x <- mdy_hm('9-18-2010 17:30', tz='US/Mountain') # US Mountain time

There might be many items we are interested in extracting, below lists many of these items and some of the differences in what is output.

Command Output Description
year(x) 2010 Year
month(x) 9 Month
day(x) 18 Day
hour(x) 17 Hour of the day
minute(x) 30 Minute of the hour
second(x) 0 Seconds
wday(x) 7 Day of the week (Sunday = 1)
mday(x) 18 Day of the month
yday(x) 261 Day of the year
tz(x) ‘US/Mountain’ Time Zone

Each of the above gives output as digits that is fine most of the time information. However, we also get an output value for day of week and month, where September is represented as a \(9\) and the day of the week is a number between \(1\) for Sunday and \(7\) for Saturday. If we prefer to get our output for this type of information using a proper string label, we can use label=TRUE argument. In conjunction with label=TRUE there is the option abbr=TRUE that specifies to return the abbreviation or not. Here is the syntax of such commands.

Command Output
wday(x, label=TRUE) Sat
wday(x, label=TRUE, abbr=FALSE) Saturday
month(x, label=TRUE) Sep
month(x, label=TRUE, abbr=FALSE) September

All of these functions can also be used to update the value. For example, we could move the date from September \(18^{th}\) to October \(18^{th}\) by changing the month. There are two ways to do this shown below, although the update command seems more intuitive to understanding that you are pushing a change into the time object.

month(x) <- 10             # less intuitive, but this works!
x <- update(x, month=10)   # update feels more intuitive, update the month to 10
x
## [1] "2010-10-18 17:30:00 MDT"

Often we want to consider some point in time, but need to convert the time zone into another time zone. There are many cases where we deal with reconciling times across time zones. The function with_tz() will take a given moment in time, with a corresponding reference time zone, and figure out when that same moment is in another timezone. For example, the HBO streaming service tends to make their most population shows available at 9pm on Sunday evenings, all based on Eastern time. We really want to know when we can start watching here in Arizona.

HBO <- ymd_hm('2024-10-27 21:00', tz='US/Eastern')
with_tz(HBO, tz='US/Arizona')
## [1] "2024-10-27 18:00:00 MST"

This means that HBO streaming shows are available to watch at 6 PM Arizona time (which changes depending on daylight savings time). A silly example to make sure you do not wait any longer than necessary to watch your favorite shows!

12.4 Printing Dates

We often need to print out character strings representing a particular date-time in a format that is convenient for humans to read. The output we have seen is acceptable for many instances, but if we want more control over the format we have to use one of the following methods. The base R function format() allows for a wide variety of possibilities but we have to remember the cumbersome syntax found in help file for strptime.

# This is the base R solution, works well but requires we look into the syntax
# %A = Day of the week (not abbreviated)
# %B = Month name written out (not abbreviated). 
# %I = Hour on 1-12 scale
# %P = am/pm designation using lowercase am/pm. %p gives the uppercase version
# %Z = Time Zone designation
format(HBO, '%A, %B %d, %Y at the time of %I:%M %P %Z')  
## [1] "Sunday, October 27, 2024 at the time of 09:00 pm EDT"

What lubridate does is allows the user to specify the format using an example date by applying the stamp() command. This function essentially creates a new function that makes it possible to parse an input date-time object into the format you supplied in the example date.

# The weekday needs to match up with the date in the example...
# Notice this still isn't completely unambiguous
# and R warns us that multiple formats are possible
my_fancy_formater <- stamp('Sunday, January 31, 1999 at the time of 12:59 pm')
## Multiple formats matched: "%A, %B %d, %Y at the time of %I:%M %p"(1), "Sunday, %B %d, %Y at the time of %I:%M %p"(1), "%A, %Om %d, %Y at the time of %I:%M %p"(0), "Sunday, %Om %d, %Y at the time of %I:%M %p"(0)
## Using: "%A, %B %d, %Y at the time of %I:%M %p"
my_fancy_formater( HBO )  
## [1] "Sunday, October 27, 2024 at the time of 09:00 PM"

When printing out date objects R is very reluctant to print out the time zone. When dealing with data frames of dates, it can be a useful practice to create a column that stores the time zone as a character string, which allows one to quickly double check if the time zone information is correct.

12.5 Arithmetic on Dates

The lubridate package provides two different ways of dealing with arithmetic on dates, and Hadley’s chapter on Date/Times in R for Data Science is a great reference if you do a lot of work in this area. Recall that dates are stored in R as the number of seconds since 0:00:00 January 1, 1970 UTC. This fundamental idea that a date is just some number of seconds introduces the idea that a minute is just 60 seconds, an hour is 3600 seconds, a day is \(24*3600=86,400\) seconds, and finally a year is \(365*86,400=31,536,000\) seconds. But what about leap years? Years are not always \(365\) days and days are not always \(24\) hours (specifically the day on which daylight savings time switches).

With this in mind, we need to be able to do arithmetic using conventional ideas of year/month/day that ignores clock discontinuities as well as using precise ideas of exactly how many seconds elapsed between two time points. There are three main ways lubridate thinks about how to calculate elapsed time.

Object Class Description
Periods Lubridate periods correspond to a person’s natural inclination of adding a year or month and ignores any clock discontinuities.
Durations Lubridate duration correspond to the exact number of seconds between two points in time and adding some number of seconds.
Intervals Lubridate allows us to create an object that stores a beginning and ending time point.
current <- ymd_hms('2024-10-21 17:00:00', tz='MST')
current + years(1)  # period. There are also minutes, hours, days, months functions.
## [1] "2025-10-21 17:00:00 MST"
current + dyears(1) # duration. There are also dminutes, dhours, ddays, dmonths\
## [1] "2025-10-21 23:00:00 MST"

Notice that dyears(1) did not just increment the years from 2024 to 2025, but rather added \(31557600\) seconds, which is slightly different than the elapsed time between years because it is accounting for any time discontinuities (daylight savings, leap days/seconds). Notice that years() and dyears() sees these calculations differently. Who knew arthmetic with times and dates was so complicated!

years(1)
## [1] "1y 0m 0d 0H 0M 0S"
dyears(1)
## [1] "31557600s (~1 years)"

Once we have two or more date-time objects defined, we can calculate the amount of time between the two dates. We’ll first create an interval that defines the exact start and stop of the time interval we care about and then convert that to either a period (person convention) or a duration (number of seconds).

PhD1 <- ymd('2012-Dec-12')
PhD2 <- ymd('2018-May-04')

MathPhD =  interval(PhD1, PhD2)    # Two different ways to 
MathPhD =  PhD1 %--% PhD2          # create a time interval

as.period(MathPhD)                 # Turn it into person readable (default years)
## [1] "5y 4m 22d 0H 0M 0S"
as.period(MathPhD, unit = 'days')  # Person readable version in days
## [1] "1969d 0H 0M 0S"
as.duration( MathPhD )             # number of seconds answer
## [1] "170121600s (~5.39 years)"

While working with dates, create intervals whenever possible and try to NEVER just subtract two data/time objects because that will always just return the number of seconds (aka the duration answer). As a demonstration, lets consider a data set where we have the individuals birthdays and we are interested in calculated the individuals age in years. Creating an interval then extracting the years from the period gives the ages as we think naturally think about them. Doing these calculations with durations might return some surprising results!

data <- tibble(
  Name = c('Steve', 'Sergey', 'Melinda', 'Bill', 'Alexa', 'Siri'),
  dob = c('Feb 24, 1955', 'August 21, 1973', 'Aug 15, 1964', 
          'October 28, 1955', 'November 6, 2014', 'October 12, 2011') )

data %>%
  mutate( dob = mdy(dob) ) %>%
  mutate( Life = dob %--% today() ) %>%
  mutate( Age = as.period(Life, units='years') ) %>%
  mutate( Age2 = year(Age) )
## # A tibble: 6 × 5
##   Name    dob        Life                           Age                  Age2
##   <chr>   <date>     <Interval>                     <Period>            <dbl>
## 1 Steve   1955-02-24 1955-02-24 UTC--2024-11-24 UTC 69y 9m 0d 0H 0M 0S     69
## 2 Sergey  1973-08-21 1973-08-21 UTC--2024-11-24 UTC 51y 3m 3d 0H 0M 0S     51
## 3 Melinda 1964-08-15 1964-08-15 UTC--2024-11-24 UTC 60y 3m 9d 0H 0M 0S     60
## 4 Bill    1955-10-28 1955-10-28 UTC--2024-11-24 UTC 69y 0m 27d 0H 0M 0S    69
## 5 Alexa   2014-11-06 2014-11-06 UTC--2024-11-24 UTC 10y 0m 18d 0H 0M 0S    10
## 6 Siri    2011-10-12 2011-10-12 UTC--2024-11-24 UTC 13y 1m 12d 0H 0M 0S    13

As a final example, suppose that an hourly employee is set to work from 11:30 PM November 2nd, 2024 until 7:30 AM November 3rd, 2024. This just happens to be the night day light savings time switched in 2024. How long did they work?

In  <- ymd_hm('2024-11-2 11:30 PM', tz='US/Mountain')
Out <- ymd_hm('2024-11-3 7:45 AM', tz='US/Mountain')

In %--% Out
## [1] 2024-11-02 23:30:00 MDT--2024-11-03 07:45:00 MST
as.period(  In %--%Out)       # Does NOT account for daylight savings time!
## [1] "8H 15M 0S"
as.duration(In %--%Out)       # Does account for daylight savings time!
## [1] "33300s (~9.25 hours)"

To use a duration in any subsequent calculation, we need to convert it to a numeric value using the as.numeric() function, which can convert to whatever unit you want.

# 
time.worked <- as.duration(In %--%Out)
as.numeric(time.worked, "hours")
## [1] 9.25
as.numeric(time.worked, "minutes")
## [1] 555

12.6 Exercises

Exercise 1

Convert the following to date or date/time objects.

a) September 13, 2010.

b) Sept 13, 2010.

c) Sep 13, 2010.

d) S 13, 2010. Comment on the month abbreviation needs.

e) 07-Dec-1941.

f) 1-5-1998. Comment on why you might be wrong.

g) 21-5-1998. Comment on why you know you are correct.

h) 2020-May-5 10:30 am

i) 2020-May-5 10:30 am PDT (ex Seattle)

j) 2020-May-5 10:30 am AST (ex Puerto Rico)

Exercise 2

Using your date of birth (ex Sep 7, 1998) and today’s date calculate the following Write your code in a manner that the code will work on any date after you were born.:

a) Calculate the date of your 64th birthday.

b) Calculate your current age (in years). Hint: Check your age is calculated correctly if your birthday was yesterday and if it were tomorrow!

c) Using your result in part (b), calculate the date of your next birthday.

d) The number of days until your next birthday.

f) The number of months and days until your next birthday.

Exercise 3

Suppose you have arranged for a phone call to be at 3 pm on May 8, 2025 at Arizona time. However, the recipient will be in Auckland, NZ. What time will it be there?

Exercise 4

From this book’s GitHub directory, navigate to the data-raw directory and then download the Pulliam_Airport_Weather_Station.csv data file. (There are several weather station files. Make sure you get the correct one!) There is a DATE column (is it of type date when you import the data?) as well as the Maximum and Minimum temperature. For the last 5 years of data included in the file, plot the time series of daily maximum temperature with date on the x-axis. Write your code so that it will work if I update the date set. Hint: Find the maximum date in the data set and then subtract 5 years. Will there be a difference if you use dyears(5) vs years(5)? Which seems more appropriate here?

Exercise 5

It turns out there is some interesting periodicity regarding the number of births on particular days of the year.

a) Using the mosaicData package, load the data set Births78 which records the number of children born on each day in the United States in 1978. Because this problem is intended to show how to calculate the information using the date, remove all the columns except date and births.

b) Graph the number of births vs the date with date on the x-axis. What stands out to you? Why do you think we have this trend?

c) To test your assumption, we need to figure out the what day of the week each observation is. Use dplyr::mutate to add a new column named dow that is the day of the week (Monday, Tuesday, etc). This calculation will involve some function in the lubridate package and the date column.

d) Plot the data with the point color being determined by the day of the week variable.