6 Date/ time variables
Data Wrangling Recipes in R: Hilary Watt
6.1 Converting to date format: making R recognise data as R date format, when initially it is recognised as character format
Dates in R are stored as number of days since 1 January 1970.
as.Date()
converts from character date format YYYY-MM-DD (and from R POSIX date/time format), into R date format (base R). For other date formats, lubridate (from tidyverse) comes to the rescue like a knight in shining armour. You need only to note the order of the month, day and year, and choose the appropriate function. This is pretty robust to different ways of writing dates – without or with separators (such as “-“ or “/” between day, month and year components). They are generally robust to months in numbers and months in letters, including months as 3 letter abbreviations. Provided the order of day, month and year is consistent, different observations can use different separators or different version of months.
library(lubridate) # gives options more robust to different formats - or else use library(tidyverse)
mdy() # convert from strings with month, then day, then year into R date format
dmy() # convert from strings with day then month then year into R date format
ymd() # convert from strings with year then month then day into R date format
Other functions are available for different orders of day, month, year: simply change the order of d, m and y in the function name accordingly.
Example of use:
# look at date formats - note order of day month & year – any inconsistencies?
head(anaemia[, c("date_operat", "datetime_end_fu") ] )
# change to format recognised by R as dates - order day, month then year so function dmy()
anaemia$date_operat2 <- dmy(anaemia$date_operat)
## Warning: 1 failed to parse.
# Inspect those that end up as NA - failed to parse implies introduction of new NAs
# Order specified is day, month, year, so numbers larger than 12 in month position are invalid and cannot be converted
anaemia[is.na(anaemia$date_operat2)==TRUE, c("date_operat", "date_operat2") ]
##
## FALSE TRUE
## 1030 10
# Visually compare dates by viewing the two variables of interest
head(anaemia[ ,c("date_operat", "date_operat2")])
Creating R date variable from individual variables that give year, month and year.
# converts to date format
aaa$date_entry <- make_date(aaa$year_entry, aaa$month_entry, aaa$day_entry)
class (aaa$date_entry)
View(aaa[, c("date_entry", "year_entry", "month_entry", "day_entry")])
It is possible to use today’s date within R commands:
## [1] "2023-06-19"
Date with inconsistent orderings of day, month, year are relatively unusual. With dates such as 09/09/2012, it is not obvious whether day or month is first. Perhaps you can deduce that date format is based on centre, with consistent date format within centre or similar. Then use different commands to convert for each centre.
Note: if_else
with dates should retain date format, whereas ifelse()
loses the date format. if_else
requires library(dplyr)
, part of the tidyverse set of packages.
6.2 Converting to dates/times (R POSIX) format: making R recognise data as date/ times, when initially it is recognised as character format.
R POSIX format date/times are stored as number of milliseconds since beginning of 1 January 1970. They require us to specify a time-zone. London is (in winter) on UTC time, R’s default, also known as GMT (and as Europe/ London and more versions too). In UK in summer time, we run on GMT+1, known as CET (central European time). You MIGHT be able to ignore summer time, and stick to UTC throughout the year, if you don’t need to calculate time-intervals that take account of hour changes between summer and winter time.
We can convert from character to R POSIX date/ time format, either using Base R’s as.POSIXct() or lubridate (from tidyverse)’s ymd_hms() function, which is more robust to different separators between the elements and to different ways to specify months.
library(lubridate)
# Example: convereting a single date into new format
date1 <- ymd_hms("2022-11-25 11:63:00") # convert character/ string to POSIX date/ time format
## Warning: All formats failed to parse. No formats found.
## [1] NA
## [1] "POSIXct" "POSIXt"
## [1] "27/10/2022 06:44" "12/04/2023 14:53" "17/09/2022 13:35"
## [4] "22/01/2022 14:46" "09/05/2022 20:40" "05/02/2023 05:49"
anaemia$datetime_end_fu2 <- ymd_hms(anaemia$datetime_end_fu)
class(anaemia$datetime_end_fu2) # see data-format as recognised by R
## [1] "POSIXct" "POSIXt"
## [1] "2027-10-20 22:06:44 UTC" "2012-04-20 23:14:53 UTC"
## [3] "2017-09-20 22:13:35 UTC" "2022-01-20 22:14:46 UTC"
## [5] "2009-05-20 22:20:40 UTC" "2005-02-20 23:05:49 UTC"
## Min. 1st Qu.
## "2001-01-20 23:19:37.0000" "2008-12-20 22:18:40.5000"
## Median Mean
## "2016-10-20 22:21:24.0000" "2016-09-07 16:15:38.3685"
## 3rd Qu. Max.
## "2024-02-05 10:42:11.5000" "2031-12-20 22:12:25.0000"
## NA's
## "9"
The following explores time-zones and directly compares when each of the following work: Base R’s as.POSIXct() and ymd_hms() from lubridate / tidyverse.
## [1] "Europe/London"
# as.POSIXct Converts from string format YYYY-MM-DD hh:mm:ss to date/ time
# For date-times we need: need to specify time zones (e.g. tz=“UTC”).
as.POSIXct("2016-01-01 00:00:00", tz="UTC") # convert from character to date/time format, using UTC = GMT time zone
## [1] "2016-01-01 UTC"
# as.POXITct is far less flexible than the lubridate equivalent, as shown here
ymd_hms("20160101 00:00:00", tz="Europe/London") # convert from character to date/time format
## [1] "2016-01-01 GMT"
ymd_hms("2016/january/01 00:00:00", tz="Europe/London") # convert from character to date/time format
## [1] "2016-01-01 GMT"
## [1] "2016-01-01 GMT"
# above uses different name for UTC = GMT this time zone
ymd_hms("2016--01--01 00:00:00", tz="CET") # convert from character to date/time format, using CET= British summer time
## [1] "2016-01-01 CET"
We can also convert from individual year, month, day, time elements into POSIX time/date format. Each element can be a variable, to compile a POSIX variable from variables for each individual element. Alternatively, like here, each element can be a number, to create one fixed value for POSIX date/ time:
# Converts to date/time when specifying different elements
# make_datetime(year, month, day, hour, min, sec)
make_datetime(2023, 7, 13, 12, 45, 54)
## [1] "2023-07-13 12:45:54 UTC"
We can use the current date/time in R commands:
## [1] "2023-06-19 13:35:31 BST"
When we only have time, we can use the lubridate
function hms, to get times in a format that can be used in calculations (period format from lubridate). Differences between these times can then be calculated to give time intervals. Rather than individual values, as shown here, we can calculate the difference between 2 time/ period variables.
## [1] "17H 23M 54S"
## [1] "Period"
## attr(,"package")
## [1] "lubridate"
## [1] 4
6.3 POSIX to date format: converting to R date format can ensure consistency between data-type within a dataset, allowing us to calculate time intervals.
If some variables are in date format, then it is good to convert all to date format for consistency. This allows us to create time-intervals between them in days. However, this does mean we lose information on time of day. Converting to date might also possibly make data management a bit simpler, since we do not then need to worry about time zones.
# Converts from string format YYYY-MM-DD hh:mm:ss to date/ time
# For date-times we need: need to specify time zones (e.g. tz=“UTC”).
date1 <- ymd_hms("2016-01-01 00:00:00", tz="Europe/London") # convert from character to date/time format
date2 <- as.Date (date1)
class(date2)
## [1] "Date"
## [1] "POSIXct" "POSIXt"
anaemia$date_end_fu <- as.Date (anaemia$datetime_end_fu2) # convert to R date format
class (anaemia$date_end_fu)
## [1] "Date"
6.4 Time interval from dates/ POSIX in days and in years: once we have variables in date or POSIX format, calculating time-intervals between 2 variables/ objects in the same format is straight-forward.
We need to specify that we want the result to be an integer or numeric, to enable the variable to be treated just like any other number.
anaemia$time_int_days = as.numeric(anaemia$date_end_fu - anaemia$date_operat2) # dates are naturally stored as number of days from specific date
anaemia$time_int_years = as.numeric(anaemia$date_end_fu - anaemia$date_operat2) / 365.25 # dividing by days in year accounting for leap years
# can display these times intervals in days
summary(anaemia$time_int_years)
hist(anaemia$time_int_years)
For comparing with a specific date, these may be useful:
## [1] "2013-02-27"
## [1] "2023-06-19"
# to find follow-up time, with fixed date as end of follow-up
anaemia$followup_time_from5nov2022.days <- as.integer(make_date(year=2022, month=11, day=5) - anaemia$date_operat2)
summary(anaemia$followup_time_from5nov2022.days)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -152.0 119.5 252.5 264.3 381.8 839.0 10
When we have POSIX R date/time format, similar methods can be used. These also result in differences in days. The following illustrates calculations from individual values. Variable names can readily be used instead, to give intervals between the variables (for each observation in the dataset). This gives the result in days.
as.integer( ymd_hms("2017-01-01 00:00:00", tz="Europe/London") - ymd_hms("2016-11-01 12:59:00", tz="Europe/London") )
as.numeric( ymd_hms("2017-01-01 00:00:00", tz="Europe/London") - ymd_hms("2016-11-01 12:59:00", tz="Europe/London") )
Time intervals, when we only have times and not dates:
## [1] "17H 23M 54S"
## [1] "Period"
## attr(,"package")
## [1] "lubridate"
## [1] 4
The main dataset is called anaemia, available here: https://github.com/hcwatt/data_wrangling_open.
Data Wrangling Recipes in R: Hilary Watt. PCPH, Imperial College London.