Chapter 14 Time Series: Dealing with Stickiness over Time

In this chapter we will learn to work with time–series data in R.

14.1 Time Series Objects in R

Working with time-series data in R is simplified if the data are structured as a time-series object. There are a host of functions and packages dedicated to time-series data. A time-series object is an R structure that contains the observations, the start and end date of the series, and information about the frequency or periodicity.

We will use the Bike Sharing Dataset39 from the UCI Machine Learning Repository.

  • The data set has 731 observations on 17 variables
    • instant: record index
    • dteday : date
    • season : season (1:spring, 2:summer, 3:fall, 4:winter)
    • yr : year (0: 2011, 1:2012)
    • mnth : month ( 1 to 12)
    • hr : hour (0 to 23)
    • holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
    • weekday : day of the week
    • workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
    • weathersit :
      • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
      • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
      • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
      • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
    • temp : Normalized temperature in Celsius. The values are divided to 41 (max)
    • atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
    • hum: Normalized humidity. The values are divided to 100 (max)
    • windspeed: Normalized wind speed. The values are divided to 67 (max)
    • casual: count of casual users
    • registered: count of registered users
    • cnt: count of total rental bikes including both casual and registered
library(tidyverse)
library(broom)
library(xts)
library(magrittr)
bike_share <- read_csv("Data/day.csv")
head(bike_share)
# A tibble: 6 x 16
  instant dteday     season    yr  mnth holiday weekday workingday weathersit
    <dbl> <date>      <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>      <dbl>
1       1 2011-01-01      1     0     1       0       6          0          2
2       2 2011-01-02      1     0     1       0       0          0          2
3       3 2011-01-03      1     0     1       0       1          1          1
4       4 2011-01-04      1     0     1       0       2          1          1
5       5 2011-01-05      1     0     1       0       3          1          1
6       6 2011-01-06      1     0     1       0       4          1          1
# ... with 7 more variables: temp <dbl>, atemp <dbl>, hum <dbl>,
#   windspeed <dbl>, casual <dbl>, registered <dbl>, cnt <dbl>

We can convert bike_share to a time-series object using the xts package. A time-series object requires an index to identify each observation by its date. We create the index with seq(as.date("YYYY-MM-DD"), by = "period", length.out = n), our start date is 2011-01-01, by “days”, with the number of observations as the length. After we create the date index, we will use dplyr:select to choose the variables (we don’t need dteday, yr, or mnth). As always, we will make use of the pipe operator to complete the code.

dates <- seq(as.Date("2011-01-01"), by = "days", length.out = 731)
bike_ts <- 
bike_share %>% 
  select(instant, season, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed, casual, registered, cnt) %>% 
  xts(dates) 
bike_ts %>% 
  head()
           instant season holiday weekday workingday weathersit  temp atemp
2011-01-01       1      1       0       6          0          2 0.344 0.364
2011-01-02       2      1       0       0          0          2 0.363 0.354
2011-01-03       3      1       0       1          1          1 0.196 0.189
2011-01-04       4      1       0       2          1          1 0.200 0.212
2011-01-05       5      1       0       3          1          1 0.227 0.229
2011-01-06       6      1       0       4          1          1 0.204 0.233
             hum windspeed casual registered  cnt
2011-01-01 0.806    0.1604    331        654  985
2011-01-02 0.696    0.2485    131        670  801
2011-01-03 0.437    0.2483    120       1229 1349
2011-01-04 0.590    0.1603    108       1454 1562
2011-01-05 0.437    0.1869     82       1518 1600
2011-01-06 0.518    0.0896     88       1518 1606

We see that the data now have a date index indicating to which date each observations belongs.

14.2 Detecting Autocorrelation

Let’s estimate the total number of riders as a function of time, \(cnt=\beta_0+\beta_1instant+\epsilon\), and test the residuals for first order auto correlation using the auxiliary regression approach.

bike_ts %$%
  lm(cnt ~ instant) 

Call:
lm(formula = cnt ~ instant)

Coefficients:
(Intercept)      instant  
    2392.96         5.77  
# retrieve the residuals as e
e <- 
  bike_ts %$%
  lm(cnt ~ instant)$residuals
# auxiliary regression
lm(e ~ lag(e,1)) %>% 
  tidy()
# A tibble: 2 x 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   -2.06    37.0      -0.0558 9.56e-  1
2 lag(e, 1)      0.752    0.0247   30.5    2.92e-132

The t-statistic on \(\hat\rho_{t-1}\) is 30.50 so we can reject the null hypothesis of no autocorrelation in the error term.

14.3 Correcting Autocorrelation

14.3.1 Newey-West

The sandwich package contains a function to estimate Newey-West standard errors.40 The output of the function call is the corrected variance-covariance matrix. We still need to calculate t-statistics based on the corrected variance-covariances. We will use the lmtest package to perform this test. coeftest(lm_object, vcov = variance-covariance_matrix).

# estimate the model (the lm_object)
bike_lm <- lm(cnt ~ instant, bike_ts)
bike_lm %>% 
  tidy()
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  2393.     112.         21.4 1.91e-79
2 instant         5.77     0.264      21.8 1.02e-81
# determine the number of lags for the Newey-West correction
lags <- length(bike_ts$instant)^(.25)
nw_vcov <- sandwich::NeweyWest(bike_lm, lag = lags, prewhite = F, adjust = T)
# model with corrected errors
lmtest::coeftest(bike_lm, vcov. = nw_vcov)

t test of coefficients:

            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 2392.961    221.782   10.79 <0.0000000000000002 ***
instant        5.769      0.646    8.93 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

14.3.2 Cochrane-Orcutt

The orcutt package allows us to use the Cochrane-Orcutt method to \(\rho\) difference the data to produce corrected standard errors using cochrane.orcutt(lm_oject)

bike_lm %>%
  orcutt::cochrane.orcutt() %>% 
  tidy()
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  2457.     301.         8.17 1.39e-15
2 instant         5.57     0.707      7.88 1.22e-14

14.4 Dynamic Models

Using a time-series object makes running dynamic models as easy as calling the argument lag(variable_name, number_of_lags). Suppose we’d like to estimate the a lagged version of the model we have been using with the form \(cnt_t=\beta_0+\beta_1instant_t+\beta_2temp_{t-1}+\epsilon\). We want to see if yesterday’s weather affects today’s rentals.

bike_lm_dyn <- lm(cnt ~ instant + lag(temp, 1), bike_ts)
bike_lm_dyn %>% 
 tidy
# A tibble: 3 x 5
  term         estimate std.error statistic   p.value
  <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   -189.     128.        -1.48 1.39e-  1
2 instant          5.02     0.193     26.0  4.77e-106
3 lag(temp, 1)  5769.     222.        25.9  1.31e-105

14.5 Dickey-Fuller Test

The tseries package includes an augmented Dickey-Fuller test, adf.test(time_series).

bike_ts$cnt %>% 
  tseries::adf.test()

	Augmented Dickey-Fuller Test

data:  .
Dickey-Fuller = -2, Lag order = 9, p-value = 0.7
alternative hypothesis: stationary

We conclude that cnt is non-stationary.

14.6 First Differencing

A simple solution to non-stationarity is to use first differences of values, i.e., \(\Delta y_t=y_t-y_{t-1}\). diff(x, ...) makes this easy with a time-series object. Let’s test \(\Delta y_t\) for stationarity.

bike_ts$cnt %>% 
  diff() %>% 
  tseries::na.remove() %>% # first differencing introduces NA's into the data
  tseries::adf.test()

	Augmented Dickey-Fuller Test

data:  .
Dickey-Fuller = -14, Lag order = 8, p-value = 0.01
alternative hypothesis: stationary

We can reject the null-hypothesis of non-stationarity. So let’s estimate the model \(\Delta cnt_t=\beta_0+\beta_1\Delta temp_t+\eta_t\)

lm(diff(cnt) ~ diff(temp), bike_ts) %>% 
  tidy()
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     3.24      38.0    0.0852 9.32e- 1
2 diff(temp)   4838.       651.     7.43   3.09e-13
# Compare to same equation in the levels.
lm(cnt ~ temp, bike_ts) %>% 
  tidy()
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    1215.      161.      7.54 1.43e-13
2 temp           6641.      305.     21.8  2.81e-81

  1. Fanaee-T, Hadi, and Gama, Joao, ‘Event labeling combining ensemble detectors and background knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg↩︎

  2. Unfortunately the pipe operator does not play nice with NeweyWest.↩︎