Chapter 12 Dates and Times
Dates within a computer require some special organization because there are several competing conventions for how to write a date (some of them more confusing than others). This includes conventions such as month-day-year form versus day-month-year, which can depend on your region, country, or just your style in some cases. In addition, time and date must have a logical sorting, such that when ordered dates should be chronological through time.
One useful tidbit of knowledge is that computer systems store a time point as the number of seconds from set point in time, called the epoch. So long as you always use the same epoch, there is no worry about when the epoch is. The only time you might run into problems with different epochs if when you switch between software systems. In R, we use midnight on Jan 1, 1970. In Microsoft Excel, they use Jan 1, 1900. A quick search of computer epoch time will demonstrate that not many systems have the same default, but conversions between epoch times do exist.
Base R has a set of powerful and flexible functions for converting character strings into Date/Time objects, typically of the type POSIXlt
or POSIXct
. A great reference for the notation for specifying Date/Time is the help file for strptime
. These functions require to specifying the format using a relatively complex set of rules. For example %y
represents the two digit year, %Y
represents the four digit year, %m
represents the month in numerical order (01 for Jan, 02 for Feb), but %b
represents the month written as Jan or Mar. These formats are very specific and easily get tripped up. Below is a simple example of how variations such as using /
instead of -
can cause troubles for computer interpretation.
# Why does base R force me to get the separating character correct?
as.Date(c('09-4-2004', '09/04/2004'), format='%m-%d-%Y')
## [1] "2004-09-04" NA
Furthermore doing any math with dates is challenging. Day-light savings time, and concepts like leap years which include a leap day and leap seconds make it so that 1 year is not always \(365\) days and 1 day is not always \(60*60*24=86400\) seconds. This makes it so that there is a conceivable difference between adding 1 year to a date and adding 365 days (or \(31,536,000\) seconds).
Dr. Wickham and his then PhD student Dr. Grolemund introduced the lubridate
package to address the need for a robust set of input functions that do not require exact date separation characters and to allow for simplification and easy differentiation of date/time arithmetic. All functions presented below require the package lubridate
and are beyond the capabilities of base R.
12.1 Creating Date and Time objects
R gives several mechanism for getting the current date and time. Below are the base R versions of time and date, compared to the lubridate versions.
## [1] "2024-11-17"
## [1] "2024-11-17 16:09:33 MST"
## [1] "2024-11-17"
## [1] "2024-11-17 16:09:33 MST"
One of the easiest functions to develop a date is the make_date()
lubridate command. It is typically possible to make or provide numerical columns that specify a year
, month
, and day
. These can then quickly be converted to a proper date object by supplying the colum information to the make_date()
function.
## [1] "2024-10-31"
### make many dates from each row, with date information in seperate columns
data.frame(Year_Col=rep(2024, 5), Month_Col=10, Day_col=1:5) %>%
mutate(date = make_date(year = Year_Col, month=Month_Col, day=Day_col))
## Year_Col Month_Col Day_col date
## 1 2024 10 1 2024-10-01
## 2 2024 10 2 2024-10-02
## 3 2024 10 3 2024-10-03
## 4 2024 10 4 2024-10-04
## 5 2024 10 5 2024-10-05
Notice in the output above that the year, month, and day information started as numerical objects, but when converted, ended up as a date
data type. It can be an important convention to properly convert date/time information into a date
data type, so that when working with date/time information, the computer can properly process how to handle such information. Additionally to have numerical information as in the example above, we often need to create a date
or date-tme
object from a character string that has all the information. As discussed, it is possible that these character strings mix up in the ordering of years, months, and days. A common data cleaning task is to take a string or number that represents a date and tell the lubridate
how to figure out which bits are the year, which are the month, and which are the day. The lubridate package uses the following functions:
Common Orders | Uncommon Orders | |
---|---|---|
ymd() Year Month Day |
dym() Day Year Month |
|
mdy() Month Day Year |
myd() Month Year Day |
|
dmy() Day Month Year |
ydm() Year Day Month |
The uncommon orders aren’t likely to be used, but the lubridate
package includes them for completeness. Once the order has been specified, the lubridate
package will try as many different ways to parse the date that make sense. As a result, so long as the order is consistent, all of the following will work, regarless of the seperating form and representation style:
## [1] "1997-06-26" "1997-06-26" "1997-06-26" "1997-06-26" "1997-06-26"
## [6] "1997-06-26"
Notice though that lubridate can get confused with less specific date information, such as its confusion on how to handle two digit years. This should not be surprising, because even the reader should not know if I meant the year to be in the 20th century (1900 - 1999) or 21st century (2000 - 2099), or possibly even some other century like the 1st century.
## [1] "0097-06-26" "1997-06-26" "2068-06-26" "1969-06-26"
The example above does show by default if you only specify the year using two digits, lubridate
will try to do something clever. It will default to either a 19XX or 20XX and it picks whichever is the closer date. In general, this illustrates that you should always fully specify the year using four digits representation.
The lubridate
functions will also accommodate an integer representation of the date, but it has to have enough digits to uniquely identify the month and day.
## [1] "2009-01-10"
## Warning: All formats failed to parse. No formats found.
## [1] NA
## Warning: All formats failed to parse. No formats found.
## [1] NA
If we want to add a time to a date, we will use a function with the suffix _hm
or _hms
. Suppose that we want to encode a date and time, for example:
## [1] "2010-09-18 17:30:00 UTC" "2010-09-18 17:30:00 UTC"
In the above case, lubridate
is correctly parsing the AM/PM designation, but it can be a better convention to specify times using 24 hour notation and skip the AM/PM designations.
12.2 Time Zones
Time zones are incredibly important because as humans, we like to have a reasonable scale designating the morning, evening, and night that is universally understood. This introduces a huge number of complications when scheduling across time zones. To further complicate matters, daylight savings time has us skipping forward an hour during the spring and falling back an hour in the fall - but of course here in Arizona (and a few other locations around the globe) we do not practice daylight savings time. We want ways to be able to handle changing time zones, with these intricate differences.
By default, R codes the time of day using UTC (Coordinated Universal Time), which is nearly inter-changeable with Greenwich Mean Time (GMT). To specify a different time zone, use the tz=
option. For example:
## [1] "2010-09-18 17:30:00 MST"
This is not bad, but Loveland, Colorado is on MST in the winter and MDT in the summer because of daylight savings time. So to specify the time zone that could switch between standard time and daylight savings time, we should specify tz='US/Mountain'
## [1] "2010-09-18 17:30:00 MDT"
Arizona is weird and doesn’t use daylight savings time. Fortunately R has a built-in time zone just for us.
## [1] "2010-09-18 17:30:00 MST"
R recognizes \(582\) different time zones and you can find these using the function OlsonNames()
. To find out more about what these mean you can check out the Wikipedia page on time zones http://en.wikipedia.org/wiki/List_of_tz_database_time_zones.
An unexpected challenge in dealing with vectors or data frames of date is that lubridate expects only a single value for tz
. If you pass in a vector, it won’t work. The solution is to use group_by()
or rowwise()
prior to the calculation. Here is some simple data with different times, each in different time zones.
dates <- c('2013-11-2 10:20', '2013-11-2 9:30', '2013-11-2 16:20') # in chronological order!
zones <- c('America/New_York', 'America/Denver','America/New_York')
None of the syntax below will work because lubridate cannot properly parse the time zone information. Even if we setup a proper data frame structure with each row being a different date and time zone, R is still casting an error.
# None of these will work due to not being able to parse multiple timezones
ymd_hm( dates, tz=zones )
## Error in parse_date_time(dates, orders, tz = tz, quiet = quiet, locale = locale, : `tz` argument must be a character of length one
## Error in as.POSIXlt.POSIXct(x, tz): invalid 'tz' value
## Error in `mutate()`:
## ℹ In argument: `date = ymd_hm(date, tz = zone)`.
## Caused by error in `parse_date_time()`:
## ! `tz` argument must be a character of length one
To apply different time zones in a vectorized operation, we need to send a single time zone to each element in our dates vector. There are a few ways to account for this issue. The first is we can tell R to execute the operations on each row, a command previously discussed that can do this is the rowwise()
function. This function will make each row its own group, and then the lubridate commands will execute.
## # A tibble: 3 × 2
## # Rowwise:
## date zone
## <dttm> <chr>
## 1 2013-11-02 10:20:00 America/New_York
## 2 2013-11-02 11:30:00 America/Denver
## 3 2013-11-02 16:20:00 America/New_York
Unfortunately using rowwise()
can be slow when working with large amounts of data. A second, and maybe better option, is to tell R to group the data by time zones and then pass in the first element of the zone vector.
## # A tibble: 3 × 2
## # Groups: zone [2]
## date zone
## <dttm> <chr>
## 1 2013-11-02 08:20:00 America/New_York
## 2 2013-11-02 09:30:00 America/Denver
## 3 2013-11-02 14:20:00 America/New_York
But why do the previous two results not agree? The previous two calculations are different because the calculations depend on the reference time zone stored with the date-time, but R does not output this information when we print the date object. The first example everything is in reference to the New York time zone, while in the second everything was reference to the Denver time zone. This is exceedingly confusing. To prove that these are all giving the right times, even though they look different, we can extract the time zone information and update the zone column after using it
### rowwise calculations, each row used its proper timezone
### but the final times all were stored in reference to Denver
data.frame(date=dates, zone=zones) %>%
rowwise() %>% # these are the different input zones!
mutate(date = ymd_hm(date, tz=zone)) %>%
mutate(zone = tz(date)) # this is the output zone that is identical for all rows!
## # A tibble: 3 × 2
## # Rowwise:
## date zone
## <dttm> <chr>
## 1 2013-11-02 10:20:00 America/New_York
## 2 2013-11-02 11:30:00 America/New_York
## 3 2013-11-02 16:20:00 America/New_York
### similarly, we can group_by the time zones to perform the calculations,
### but we chose zone[1] as the reference...
data.frame(date = dates, zone=zones) %>%
group_by(zone) %>% # these are the different input zones!
mutate(date = ymd_hm(date, tz=zone[1])) %>%
mutate(zone = tz(date)) # this is the output zone that is identical for all rows!
## # A tibble: 3 × 2
## # Groups: zone [1]
## date zone
## <dttm> <chr>
## 1 2013-11-02 08:20:00 America/Denver
## 2 2013-11-02 09:30:00 America/Denver
## 3 2013-11-02 14:20:00 America/Denver
The take home message here is that working with time zones in R is finicky. If you have an application that has to deal with more than one time zone, it is recommend always storing the information as UTC referenced values. By doing this there is no conflicts on what the reference time zone was and how the time zone was stored. You can do all your time point calculations knowing that you are on the same time scale. Then if you want to show a date to a user, you need only convert the time to the desired time zone from the UTC standard. Who knew storing a time and date could be so difficult!
12.3 Extracting information
The lubridate
package provides many functions for extracting information from the date. Suppose we have defined a particular date of interest.
There might be many items we are interested in extracting, below lists many of these items and some of the differences in what is output.
Command | Output | Description |
---|---|---|
year(x) |
2010 | Year |
month(x) |
9 | Month |
day(x) |
18 | Day |
hour(x) |
17 | Hour of the day |
minute(x) |
30 | Minute of the hour |
second(x) |
0 | Seconds |
wday(x) |
7 | Day of the week (Sunday = 1) |
mday(x) |
18 | Day of the month |
yday(x) |
261 | Day of the year |
tz(x) |
‘US/Mountain’ | Time Zone |
Each of the above gives output as digits that is fine most of the time information. However, we also get an output value for day of week and month, where September is represented as a \(9\) and the day of the week is a number between \(1\) for Sunday and \(7\) for Saturday. If we prefer to get our output for this type of information using a proper string label, we can use label=TRUE
argument. In conjunction with label=TRUE
there is the option abbr=TRUE
that specifies to return the abbreviation or not. Here is the syntax of such commands.
Command | Output |
---|---|
wday(x, label=TRUE) |
Sat |
wday(x, label=TRUE, abbr=FALSE) |
Saturday |
month(x, label=TRUE) |
Sep |
month(x, label=TRUE, abbr=FALSE) |
September |
All of these functions can also be used to update the value. For example, we could move the date from September \(18^{th}\) to October \(18^{th}\) by changing the month. There are two ways to do this shown below, although the update
command seems more intuitive to understanding that you are pushing a change into the time object.
month(x) <- 10 # less intuitive, but this works!
x <- update(x, month=10) # update feels more intuitive, update the month to 10
x
## [1] "2010-10-18 17:30:00 MDT"
Often we want to consider some point in time, but need to convert the time zone into another time zone. There are many cases where we deal with reconciling times across time zones. The function with_tz()
will take a given moment in time, with a corresponding reference time zone, and figure out when that same moment is in another timezone. For example, the HBO streaming service tends to make their most population shows available at 9pm on Sunday evenings, all based on Eastern time. We really want to know when we can start watching here in Arizona.
## [1] "2024-10-27 18:00:00 MST"
This means that HBO streaming shows are available to watch at 6 PM Arizona time (which changes depending on daylight savings time). A silly example to make sure you do not wait any longer than necessary to watch your favorite shows!
12.4 Printing Dates
We often need to print out character strings representing a particular date-time in a format that is convenient for humans to read. The output we have seen is acceptable for many instances, but if we want more control over the format we have to use one of the following methods. The base R function format()
allows for a wide variety of possibilities but we have to remember the cumbersome syntax found in help file for strptime
.
# This is the base R solution, works well but requires we look into the syntax
# %A = Day of the week (not abbreviated)
# %B = Month name written out (not abbreviated).
# %I = Hour on 1-12 scale
# %P = am/pm designation using lowercase am/pm. %p gives the uppercase version
# %Z = Time Zone designation
format(HBO, '%A, %B %d, %Y at the time of %I:%M %P %Z')
## [1] "Sunday, October 27, 2024 at the time of 09:00 pm EDT"
What lubridate does is allows the user to specify the format using an example date by applying the stamp()
command. This function essentially creates a new function that makes it possible to parse an input date-time object into the format you supplied in the example date.
# The weekday needs to match up with the date in the example...
# Notice this still isn't completely unambiguous
# and R warns us that multiple formats are possible
my_fancy_formater <- stamp('Sunday, January 31, 1999 at the time of 12:59 pm')
## Multiple formats matched: "%A, %B %d, %Y at the time of %I:%M %p"(1), "Sunday, %B %d, %Y at the time of %I:%M %p"(1), "%A, %Om %d, %Y at the time of %I:%M %p"(0), "Sunday, %Om %d, %Y at the time of %I:%M %p"(0)
## Using: "%A, %B %d, %Y at the time of %I:%M %p"
## [1] "Sunday, October 27, 2024 at the time of 09:00 PM"
When printing out date objects R is very reluctant to print out the time zone. When dealing with data frames of dates, it can be a useful practice to create a column that stores the time zone as a character string, which allows one to quickly double check if the time zone information is correct.
12.5 Arithmetic on Dates
The lubridate
package provides two different ways of dealing with arithmetic on dates, and Hadley’s chapter on Date/Times in R for Data Science is a great reference if you do a lot of work in this area. Recall that dates are stored in R as the number of seconds since 0:00:00 January 1, 1970 UTC. This fundamental idea that a date is just some number of seconds introduces the idea that a minute is just 60 seconds, an hour is 3600 seconds, a day is \(24*3600=86,400\) seconds, and finally a year is \(365*86,400=31,536,000\) seconds. But what about leap years? Years are not always \(365\) days and days are not always \(24\) hours (specifically the day on which daylight savings time switches).
With this in mind, we need to be able to do arithmetic using conventional ideas of year/month/day that ignores clock discontinuities as well as using precise ideas of exactly how many seconds elapsed between two time points. There are three main ways lubridate thinks about how to calculate elapsed time.
Object Class | Description |
---|---|
Periods | Lubridate periods correspond to a person’s natural inclination of adding a year or month and ignores any clock discontinuities. |
Durations | Lubridate duration correspond to the exact number of seconds between two points in time and adding some number of seconds. |
Intervals | Lubridate allows us to create an object that stores a beginning and ending time point. |
current <- ymd_hms('2024-10-21 17:00:00', tz='MST')
current + years(1) # period. There are also minutes, hours, days, months functions.
## [1] "2025-10-21 17:00:00 MST"
## [1] "2025-10-21 23:00:00 MST"
Notice that dyears(1)
did not just increment the years from 2024 to 2025, but rather added \(31557600\) seconds, which is slightly different than the elapsed time between years because it is accounting for any time discontinuities (daylight savings, leap days/seconds). Notice that years()
and dyears()
sees these calculations differently. Who knew arthmetic with times and dates was so complicated!
## [1] "1y 0m 0d 0H 0M 0S"
## [1] "31557600s (~1 years)"
Once we have two or more date-time objects defined, we can calculate the amount of time between the two dates. We’ll first create an interval
that defines the exact start and stop of the time interval we care about and then convert that to either a period
(person convention) or a duration
(number of seconds).
PhD1 <- ymd('2012-Dec-12')
PhD2 <- ymd('2018-May-04')
MathPhD = interval(PhD1, PhD2) # Two different ways to
MathPhD = PhD1 %--% PhD2 # create a time interval
as.period(MathPhD) # Turn it into person readable (default years)
## [1] "5y 4m 22d 0H 0M 0S"
## [1] "1969d 0H 0M 0S"
## [1] "170121600s (~5.39 years)"
While working with dates, create intervals whenever possible and try to NEVER just subtract two data/time objects because that will always just return the number of seconds (aka the duration
answer). As a demonstration, lets consider a data set where we have the individuals birthdays and we are interested in calculated the individuals age in years. Creating an interval then extracting the years from the period gives the ages as we think naturally think about them. Doing these calculations with durations might return some surprising results!
data <- tibble(
Name = c('Steve', 'Sergey', 'Melinda', 'Bill', 'Alexa', 'Siri'),
dob = c('Feb 24, 1955', 'August 21, 1973', 'Aug 15, 1964',
'October 28, 1955', 'November 6, 2014', 'October 12, 2011') )
data %>%
mutate( dob = mdy(dob) ) %>%
mutate( Life = dob %--% today() ) %>%
mutate( Age = as.period(Life, units='years') ) %>%
mutate( Age2 = year(Age) )
## # A tibble: 6 × 5
## Name dob Life Age Age2
## <chr> <date> <Interval> <Period> <dbl>
## 1 Steve 1955-02-24 1955-02-24 UTC--2024-11-17 UTC 69y 8m 24d 0H 0M 0S 69
## 2 Sergey 1973-08-21 1973-08-21 UTC--2024-11-17 UTC 51y 2m 27d 0H 0M 0S 51
## 3 Melinda 1964-08-15 1964-08-15 UTC--2024-11-17 UTC 60y 3m 2d 0H 0M 0S 60
## 4 Bill 1955-10-28 1955-10-28 UTC--2024-11-17 UTC 69y 0m 20d 0H 0M 0S 69
## 5 Alexa 2014-11-06 2014-11-06 UTC--2024-11-17 UTC 10y 0m 11d 0H 0M 0S 10
## 6 Siri 2011-10-12 2011-10-12 UTC--2024-11-17 UTC 13y 1m 5d 0H 0M 0S 13
As a final example, suppose that an hourly employee is set to work from 11:30 PM November 2nd, 2024 until 7:30 AM November 3rd, 2024. This just happens to be the night day light savings time switched in 2024. How long did they work?
In <- ymd_hm('2024-11-2 11:30 PM', tz='US/Mountain')
Out <- ymd_hm('2024-11-3 7:45 AM', tz='US/Mountain')
In %--% Out
## [1] 2024-11-02 23:30:00 MDT--2024-11-03 07:45:00 MST
## [1] "8H 15M 0S"
## [1] "33300s (~9.25 hours)"
To use a duration in any subsequent calculation, we need to convert it to a numeric value using the as.numeric()
function, which can convert to whatever unit you want.
## [1] 9.25
## [1] 555
12.6 Exercises
Exercise 1
Convert the following to date or date/time objects.
a) September 13, 2010.
b) Sept 13, 2010.
c) Sep 13, 2010.
d) S 13, 2010. Comment on the month abbreviation needs.
e) 07-Dec-1941.
f) 1-5-1998. Comment on why you might be wrong.
g) 21-5-1998. Comment on why you know you are correct.
h) 2020-May-5 10:30 am
i) 2020-May-5 10:30 am PDT (ex Seattle)
j) 2020-May-5 10:30 am AST (ex Puerto Rico)
Exercise 2
Using your date of birth (ex Sep 7, 1998) and today’s date calculate the following Write your code in a manner that the code will work on any date after you were born.:
a) Calculate the date of your 64th birthday.
b) Calculate your current age (in years). Hint: Check your age is calculated correctly if your birthday was yesterday and if it were tomorrow!
c) Using your result in part (b), calculate the date of your next birthday.
d) The number of days until your next birthday.
f) The number of months and days until your next birthday.
Exercise 3
Suppose you have arranged for a phone call to be at 3 pm on May 8, 2025 at Arizona time. However, the recipient will be in Auckland, NZ. What time will it be there?
Exercise 4
From this book’s GitHub directory, navigate to the data-raw
directory and then download the Pulliam_Airport_Weather_Station.csv
data file. (There are several weather station files. Make sure you get the correct one!) There is a DATE
column (is it of type date
when you import the data?) as well as the Maximum and Minimum temperature. For the last 5 years of data included in the file, plot the time series of daily maximum temperature with date on the x-axis. Write your code so that it will work if I update the date set. Hint: Find the maximum date in the data set and then subtract 5 years. Will there be a difference if you use dyears(5)
vs years(5)
? Which seems more appropriate here?
Exercise 5
It turns out there is some interesting periodicity regarding the number of births on particular days of the year.
a) Using the mosaicData
package, load the data set Births78
which records the number of children born on each day in the United States in 1978. Because this problem is intended to show how to calculate the information using the date
, remove all the columns except date
and births
.
b) Graph the number of births
vs the date
with date on the x-axis. What stands out to you? Why do you think we have this trend?
c) To test your assumption, we need to figure out the what day of the week each observation is. Use dplyr::mutate
to add a new column named dow
that is the day of the week (Monday, Tuesday, etc). This calculation will involve some function in the lubridate
package and the date
column.
d) Plot the data with the point color being determined by the day of the week variable.