10.4 Exercises

ds4psy: Exercises (10: Time data)

Here are some exercises on parsing and manipulating dates and times:

10.4.1 Exercise 1

  1. Use the appropriate lubridate function to parse each of the following dates:
d1 <- "January 20, 2020"
d2 <- "2020-Apr-01"
d3 <- "11-Nov-2020"
d4 <- c("July 13 (1969)", "August 23 (1972)", "July 1 (1975)")

                 # Date: 
d5 <- "08/12/10" # Oct 12, 2008
d6 <- d5         # Aug 12, 2010
d7 <- d5         # Oct 08, 2012

Solution

mdy(d1)
#> [1] "2020-01-20"
ymd(d2)
#> [1] "2020-04-01"
dmy(d3)
#> [1] "2020-11-11"
mdy(d4)
#> [1] "1969-07-13" "1972-08-23" "1975-07-01"

ydm(d5)
#> [1] "2008-10-12"
mdy(d6)
#> [1] "2010-08-12"
dym(d7)
#> [1] "2012-10-08"
  1. Use the appropriate lubridate function to parse each of the following date-times:
t1 <- "2020-11-11 11:11:01"
t2 <- "2020/12/24 07:30"
t3 <- "31:12:20 12:45:59"

t4 <- c("8:05 01/01/2020", "9:20 29/02/2020", "12:30 24/12/2020", "23:58 30/12/2020")

Hint: Note that t4 contains the time component before the date component. To handle this vector, consider creating a tibble and then using dplyr commands for separating its time and date components, and pasting them in reversed order (date before time).

Solution

ymd_hms(t1)
#> [1] "2020-11-11 11:11:01 UTC"
ymd_hm(t2)
#> [1] "2020-12-24 07:30:00 UTC"
dmy_hms(t3)
#> [1] "2020-12-31 12:45:59 UTC"

# t4: 
tb <- tibble(t4 = t4)  # as tibble

tb <- tb %>% 
  separate(t4, into = c("t", "d"), sep = " ", remove = FALSE) %>%
  mutate(ds = paste(d, t),           # as text string
         dt = lubridate::dmy_hm(ds)  # parse text 
         )

# Print tibble:
knitr::kable(tb, caption = "A tibble with `t4` separated and mutated into `dt`.")
Table 10.1: A tibble with t4 separated and mutated into dt.
t4 t d ds dt
8:05 01/01/2020 8:05 01/01/2020 01/01/2020 8:05 2020-01-01 08:05:00
9:20 29/02/2020 9:20 29/02/2020 29/02/2020 9:20 2020-02-29 09:20:00
12:30 24/12/2020 12:30 24/12/2020 24/12/2020 12:30 2020-12-24 12:30:00
23:58 30/12/2020 23:58 30/12/2020 30/12/2020 23:58 2020-12-30 23:58:00

# Print vector: 
tb$dt
#> [1] "2020-01-01 08:05:00 UTC" "2020-02-29 09:20:00 UTC"
#> [3] "2020-12-24 12:30:00 UTC" "2020-12-30 23:58:00 UTC"
  1. Determine the weekdays of the 7 dates in d4 and t4.

Hint: Combine the 7 dates in a vector, before applying either the base R function weekdays() or the lubridate function wday() to it.

Solution

# (a) with base R:
dates_1 <- c(as.Date(d4, format = "%B %d (%Y)"), base::as.Date(tb$d))
dates_1
#> [1] "1969-07-13" "1972-08-23" "1975-07-01" "0001-01-20" "0029-02-20"
#> [6] "0024-12-20" "0030-12-20"
base::weekdays(as.Date(dates_1, format = "%d/%m/%y"))
#> [1] "Sunday"    "Wednesday" "Tuesday"   "Saturday"  "Tuesday"   "Friday"   
#> [7] "Friday"

# (b) with lubridate: 
dates_2 <- c(lubridate::mdy(d4), lubridate::as_date(tb$dt))
dates_2
#> [1] "1969-07-13" "1972-08-23" "1975-07-01" "2020-01-01" "2020-02-29"
#> [6] "2020-12-24" "2020-12-30"
lubridate::wday(dates_2, label = TRUE, week_start = 1, abbr = FALSE)
#> [1] Sunday    Wednesday Tuesday   Wednesday Saturday  Thursday  Wednesday
#> 7 Levels: Monday < Tuesday < Wednesday < Thursday < Friday < ... < Sunday

10.4.2 Exercise 2

The data file dt_10.csv (available at rpository.com) contains the birth dates and times of 10 ficticious people. Read the data into a tibble dt_10:

# dt_10 <- readr::read_csv("./data/dt_10.csv")  # from local file
dt_10 <- readr::read_csv("http://rpository.com/ds4psy/data/dt_10.csv")  # online file
knitr::kable(dt_10, caption = "Data in file `dt_10.csv`.")
Table 10.2: Data in file dt_10.csv.
name day month year hour min sec
Anna 8 8 1994 11 47 57
Beowulf 1 6 1994 5 35 43
Cassandra 14 11 2000 5 58 6
David 17 1 1991 13 3 12
Eva 21 1 2001 21 33 55
Frederic 19 7 2000 13 47 12
Gwendoline 20 9 1996 8 28 37
Hamlet 5 5 1996 17 7 8
Ian 18 8 1996 8 27 17
Joy 18 12 1990 14 44 35
  1. Use the appropriate lubridate functions to parse the data of birth dob and time of birth tob as new columns of dt_10.

Solution

dt_10 <- dt_10 %>%
  mutate(dob = make_date(year, month, day),
         tob = make_datetime(year, month, day, hour, min, sec)
         ) %>%
  select(name, dob, tob)
dt_10
#> # A tibble: 10 x 3
#>    name       dob        tob                
#>    <chr>      <date>     <dttm>             
#>  1 Anna       1994-08-08 1994-08-08 11:47:57
#>  2 Beowulf    1994-06-01 1994-06-01 05:35:43
#>  3 Cassandra  2000-11-14 2000-11-14 05:58:06
#>  4 David      1991-01-17 1991-01-17 13:03:12
#>  5 Eva        2001-01-21 2001-01-21 21:33:55
#>  6 Frederic   2000-07-19 2000-07-19 13:47:12
#>  7 Gwendoline 1996-09-20 1996-09-20 08:28:37
#>  8 Hamlet     1996-05-05 1996-05-05 17:07:08
#>  9 Ian        1996-08-18 1996-08-18 08:27:17
#> 10 Joy        1990-12-18 1990-12-18 14:44:35
  1. Use the appropriate lubridate functions to add 2 columns that specify – given each person’s DOB – the weekday dob_wd (from Monday to Sunday) of their birthday and their current age age_fy in full years (i.e., the numeric value of their age, as an integer).

Hint: Their current age can be computed by subtracting their DOB from today’s date today(). One way of computing their age in full years is by dividing the interval() of their current age by a duration() in the unit of “years”.

Solution

today <- lubridate::today()
today
#> [1] "2020-02-15"

# Compute age:
today - dt_10$dob  # age (in days)
#> Time differences in days
#>  [1]  9322  9390  7032 10621  6964  7150  8548  8686  8581 10651
lubridate::as.duration(today - dt_10$dob)  # as duration
#>  [1] "805420800s (~25.52 years)" "811296000s (~25.71 years)"
#>  [3] "607564800s (~19.25 years)" "917654400s (~29.08 years)"
#>  [5] "601689600s (~19.07 years)" "617760000s (~19.58 years)"
#>  [7] "738547200s (~23.4 years)"  "750470400s (~23.78 years)"
#>  [9] "741398400s (~23.49 years)" "920246400s (~29.16 years)"
lubridate::interval(dt_10$dob, today) / lubridate::duration(num = 1, units = "years")  # interval in years
#>  [1] 25.53973 25.72603 19.26575 29.09863 19.07945 19.58904 23.41918 23.79726
#>  [9] 23.50959 29.18082

dt_10 <- dt_10 %>%
  select(-tob) %>% 
  mutate(dob_wd = lubridate::wday(dob, label = TRUE, week_start = 1, abbr = FALSE),
         age_yr = lubridate::interval(dob, today) / lubridate::duration(num = 1, units = "years"),
         age_fy = floor(age_yr))
dt_10
#> # A tibble: 10 x 5
#>    name       dob        dob_wd    age_yr age_fy
#>    <chr>      <date>     <ord>      <dbl>  <dbl>
#>  1 Anna       1994-08-08 Monday      25.5     25
#>  2 Beowulf    1994-06-01 Wednesday   25.7     25
#>  3 Cassandra  2000-11-14 Tuesday     19.3     19
#>  4 David      1991-01-17 Thursday    29.1     29
#>  5 Eva        2001-01-21 Sunday      19.1     19
#>  6 Frederic   2000-07-19 Wednesday   19.6     19
#>  7 Gwendoline 1996-09-20 Friday      23.4     23
#>  8 Hamlet     1996-05-05 Sunday      23.8     23
#>  9 Ian        1996-08-18 Sunday      23.5     23
#> 10 Joy        1990-12-18 Tuesday     29.2     29

10.4.3 Exercise 3

  1. Pick at least 4 famous people — some of which are still alive, some of which have already died — and enter their name, area of occupation, date of birth (DOB), and date of death (DOD, if deceased) in a tibble fame, in analogy to the following:
fame <- tibble(name = c("Napoleon Bonaparte", "Jimi Hendrix", "Michael Jackson", "Frida Kahlo", 
                        "Angela Merkel", "Kobe Bryant", "Lionel Messi", "Zinedine Zidane"),
               area = c("politics", "music", "music", "arts", 
                        "politics", "sports", "sports", "sports"), 
               DOB = c("August 15, 1769", "November 27, 1942", "August 29, 1958", "July 6, 1907", 
                       "July 17, 1954", "August 23, 1978", "June 24, 1987", "June 23, 1972"),
               DOD = c("May 5, 1821", "September 18, 1970", "June 25, 2009", "July 13, 1954", 
                       NA, "January 26, 2020", NA, NA))

knitr::kable(fame, caption = "Basic info on some famous people.")
Table 10.3: Basic info on some famous people.
name area DOB DOD
Napoleon Bonaparte politics August 15, 1769 May 5, 1821
Jimi Hendrix music November 27, 1942 September 18, 1970
Michael Jackson music August 29, 1958 June 25, 2009
Frida Kahlo arts July 6, 1907 July 13, 1954
Angela Merkel politics July 17, 1954 NA
Kobe Bryant sports August 23, 1978 January 26, 2020
Lionel Messi sports June 24, 1987 NA
Zinedine Zidane sports June 23, 1972 NA
  1. Use the appropriate lubridate functions to replace the DOB and DOD variables by corresponding dob and dod variables of type date.

Solution

fame <- fame %>% 
  mutate(dob = lubridate::mdy(DOB),
         dod = lubridate::mdy(DOD)) %>%
  select(name, area, dob, dod)
# knitr::kable(fame, caption = "Info on some famous people.")
  1. Add two variables to fame that specify the weekday (from Monday to Sunday) of their birth (dob_wd) and – if applicable – of their death (dob_wd).

Solution

fame %>%
  mutate(dob_wd = lubridate::wday(dob, label = TRUE, week_start = 1, abbr = FALSE),
         dod_wd = lubridate::wday(dod, label = TRUE, week_start = 1, abbr = FALSE)
  )
#> # A tibble: 8 x 6
#>   name               area     dob        dod        dob_wd    dod_wd  
#>   <chr>              <chr>    <date>     <date>     <ord>     <ord>   
#> 1 Napoleon Bonaparte politics 1769-08-15 1821-05-05 Tuesday   Saturday
#> 2 Jimi Hendrix       music    1942-11-27 1970-09-18 Friday    Friday  
#> 3 Michael Jackson    music    1958-08-29 2009-06-25 Friday    Thursday
#> 4 Frida Kahlo        arts     1907-07-06 1954-07-13 Saturday  Tuesday 
#> 5 Angela Merkel      politics 1954-07-17 NA         Saturday  <NA>    
#> 6 Kobe Bryant        sports   1978-08-23 2020-01-26 Wednesday Sunday  
#> 7 Lionel Messi       sports   1987-06-24 NA         Wednesday <NA>    
#> 8 Zinedine Zidane    sports   1972-06-23 NA         Friday    <NA>
  1. Add a variable age_days that computes their age in days (relative to today <- Sys.Date()). Then compute two more variables age_yr1 and age_yr2 that determines their age in years (as a decimal number) in different ways. Finally, add a variable age_fy that specifies their current age (in full years) as an integer (i.e., what they would say if they truthfully responded to the question “How old are you today?”).

Solution

# Determine today's date:
today <- Sys.Date()
# today <- lubridate::today()

fame %>%
  mutate(age_days = (today - dob),
         # 1. computation from time difference:  
         age_yr1 = as.numeric(age_days)/365,  
         # 2. computation as interval (in duration of years):
         age_yr2 = lubridate::interval(dob, today) / lubridate::duration(num = 1, units = "years"),
         # 3. computation as interval (in period of years): 
         age_yr3 = lubridate::interval(dob, today) / lubridate::period(num = 1, units = "years"),
         # Round down:
         age_fy1 = floor(age_yr1),          
         age_fy2 = floor(age_yr2),
         age_fy3 = floor(age_yr3)) %>%
  select(-area, -dod)
#> # A tibble: 8 x 9
#>   name       dob        age_days age_yr1 age_yr2 age_yr3 age_fy1 age_fy2 age_fy3
#>   <chr>      <date>     <drtn>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1 Napoleon … 1769-08-15 91494 d…   251.    251.    251.      250     250     250
#> 2 Jimi Hend… 1942-11-27 28204 d…    77.3    77.3    77.2      77      77      77
#> 3 Michael J… 1958-08-29 22450 d…    61.5    61.5    61.5      61      61      61
#> 4 Frida Kah… 1907-07-06 41132 d…   113.    113.    113.      112     112     112
#> 5 Angela Me… 1954-07-17 23954 d…    65.6    65.6    65.6      65      65      65
#> 6 Kobe Brya… 1978-08-23 15151 d…    41.5    41.5    41.5      41      41      41
#> 7 Lionel Me… 1987-06-24 11924 d…    32.7    32.7    32.6      32      32      32
#> 8 Zinedine … 1972-06-23 17403 d…    47.7    47.7    47.6      47      47      47
  1. Correct your previous age_fyr variable so that — for those people who have already died — it should remain at the age at which they died (i.e., dead people do not age further).

Solution

fame %>%
  filter(!is.na(dod)) %>%
  mutate(age_days = (dod - dob),
         age_yr1 = as.numeric(age_days)/365, 
         age_yr2 = lubridate::interval(dob, dod) / lubridate::duration(num = 1, units = "years"),
         age_fyr = floor(age_yr2)
         ) %>%
  select(-area)
#> # A tibble: 5 x 7
#>   name               dob        dod        age_days   age_yr1 age_yr2 age_fyr
#>   <chr>              <date>     <date>     <drtn>       <dbl>   <dbl>   <dbl>
#> 1 Napoleon Bonaparte 1769-08-15 1821-05-05 18890 days    51.8    51.8      51
#> 2 Jimi Hendrix       1942-11-27 1970-09-18 10157 days    27.8    27.8      27
#> 3 Michael Jackson    1958-08-29 2009-06-25 18563 days    50.9    50.9      50
#> 4 Frida Kahlo        1907-07-06 1954-07-13 17174 days    47.1    47.1      47
#> 5 Kobe Bryant        1978-08-23 2020-01-26 15131 days    41.5    41.5      41

10.4.4 Exercise 4

Examine the data file dt.csv (available at rpository.com). This file contains the birth dates and study participation times of 1000 ficticious people. Read the data into a tibble dt:

# dt <- readr::read_csv("./data/dt.csv")  # from local file 
dt <- readr::read_csv("http://rpository.com/ds4psy/data/dt.csv")  # online file

# View tibble: 
# dt
knitr::kable(head(dt), caption = "Head of file `dt.csv`.")
Table 10.4: Head of file dt.csv.
name gender bday bmonth byear height score t_1 t_2
I.G. male 14 12 1968 169 113 2020-01-16 11:00:58 2020-01-16 11:32:21
O.B. male 10 4 1974 181 114 2020-01-17 14:11:07 2020-01-17 15:05:14
M.M. male 28 9 1987 183 108 2020-01-16 10:06:06 2020-01-16 10:51:47
V.J. female 15 2 1978 161 93 2020-01-10 10:06:04 2020-01-10 10:39:48
O.E. male 18 5 1985 164 114 2020-01-20 09:23:51 2020-01-20 10:11:36
Q.W. male 1 3 1968 172 103 2020-01-13 11:10:09 2020-01-13 11:54:07
  1. The time variables t_1 and t_2 indicate the start and end times of each person’s participation in a study. Compute the duration of each person’s participation (in minutes and seconds) and plot the distribution of the resulting durations (e.g., as a histogram).

Solution

dt <- dt %>%
  mutate(t_diff = (t_2 - t_1), 
         dur = as.duration(t_2 - t_1))
dt
#> # A tibble: 1,000 x 11
#>    name  gender  bday bmonth byear height score t_1                
#>    <chr> <chr>  <dbl>  <dbl> <dbl>  <dbl> <dbl> <dttm>             
#>  1 I.G.  male      14     12  1968    169   113 2020-01-16 11:00:58
#>  2 O.B.  male      10      4  1974    181   114 2020-01-17 14:11:07
#>  3 M.M.  male      28      9  1987    183   108 2020-01-16 10:06:06
#>  4 V.J.  female    15      2  1978    161    93 2020-01-10 10:06:04
#>  5 O.E.  male      18      5  1985    164   114 2020-01-20 09:23:51
#>  6 Q.W.  male       1      3  1968    172   103 2020-01-13 11:10:09
#>  7 H.K.  male      27      4  1994    157   110 2020-01-19 13:54:15
#>  8 T.R.  female     5      6  1961    167   103 2020-01-19 09:38:54
#>  9 F.J.  male       1     10  1983    158   107 2020-01-15 08:24:11
#> 10 J.R.  female    29     12  1941    157   107 2020-01-18 08:54:27
#> # … with 990 more rows, and 3 more variables: t_2 <dttm>, t_diff <drtn>,
#> #   dur <Duration>

dur_mn <- mean(dt$dur)  # mean
dur_md <- median(dt$dur)  # median

# Plot histograms:

## base R:
# hist(as.numeric(dt$dur), breaks = 20, col = unikn::Seeblau)  

# ggplot: 
ggplot(dt, aes(x = as.numeric(dur))) +
  geom_histogram(col = "black", binwidth = 200, fill = unikn::Seeblau) +
  geom_vline(xintercept = dur_mn, col = "gold", linetype = 1, size = 1) +
  geom_vline(xintercept = dur_md, col = unikn::Pinky, linetype = 2, size = 1) +
  labs(title = "Distribution of durations", x = "Duration (in seconds)") + 
  ds4psy::theme_ds4psy()

  1. The study officially only ran for 5 days — from “2020-01-13” to “2020-01-18” — and should only include participants that responded in less than 1 hour (60 minutes). Add a filter variable that considers these criteria (i.e., allows to filter out other dates and durations beyond 60 minutes).

Solution

dt_valid <- dt %>%
  mutate(date = as_date(t_1),
         valid_date = (date >= "2020-01-13") & (date <= "2020-01-18"),
         valid_dur  = (dur <= as.duration(60 * 60)),
         valid      = valid_date & valid_dur) %>%
  filter(valid)

# Filtered data: 
dt_valid
#> # A tibble: 519 x 15
#>    name  gender  bday bmonth byear height score t_1                
#>    <chr> <chr>  <dbl>  <dbl> <dbl>  <dbl> <dbl> <dttm>             
#>  1 I.G.  male      14     12  1968    169   113 2020-01-16 11:00:58
#>  2 O.B.  male      10      4  1974    181   114 2020-01-17 14:11:07
#>  3 M.M.  male      28      9  1987    183   108 2020-01-16 10:06:06
#>  4 Q.W.  male       1      3  1968    172   103 2020-01-13 11:10:09
#>  5 F.J.  male       1     10  1983    158   107 2020-01-15 08:24:11
#>  6 J.R.  female    29     12  1941    157   107 2020-01-18 08:54:27
#>  7 K.E.  male      10     12  1951    161   104 2020-01-17 15:22:08
#>  8 U.W.  female    12      1  1996    161   104 2020-01-13 10:33:52
#>  9 J.Y.  female    20      5  1987    155   106 2020-01-16 15:11:40
#> 10 S.X.  female     5      3  1986    169    95 2020-01-14 15:08:48
#> # … with 509 more rows, and 7 more variables: t_2 <dttm>, t_diff <drtn>,
#> #   dur <Duration>, date <date>, valid_date <lgl>, valid_dur <lgl>, valid <lgl>

# Check: Does the filter work as intended?
min(dt_valid$t_1)
#> [1] "2020-01-13 08:09:30 UTC"
max(dt_valid$t_1)
#> [1] "2020-01-18 17:50:53 UTC"
max(dt_valid$dur)
#> [1] 3600
  1. How many participants remain in the sample of valid data? What is their mean height and score?

Solution

# Descriptives (by hand):
nrow(dt_valid)  # N of valid participants
#> [1] 519
mean(dt_valid$height)
#> [1] 166.2852
mean(dt_valid$score, na.rm = TRUE)  # There are NA values!
#> [1] 101.5207
sum(!is.na(dt_valid$score))         # N of non-NA values?
#> [1] 507

# All in one dplyr pipe:
dt_valid %>% summarise(N = n(),
                       mn_height = mean(height),
                       mn_score = mean(score, na.rm = TRUE),
                       N_score_nonNA = sum(!is.na(score))
                       )
#> # A tibble: 1 x 4
#>       N mn_height mn_score N_score_nonNA
#>   <int>     <dbl>    <dbl>         <int>
#> 1   519      166.     102.           507

This concludes our exercises on parsing and manipulating dates and times.