Chapter 14 Time Series: Dealing with Stickiness over Time

In this chapter we will learn to work with time–series data in R.

14.1 Time Series Objects in R

Working with time-series data in R is simplified if the data are structured as a time-series object. There are a host of functions and packages dedicated to time-series data. A time-series object is an R structure that contains the observations, the start and end date of the series, and information about the frequency or periodicity.

We will use the Bike Sharing Dataset ³⁹ from the UCI Machine Learning Repository.

The data set has 731 observations on 17 variables
- instant: record index
- dteday : date
- season : season (1:spring, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit :
  - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
  - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
  - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
  - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

library(tidyverse)
library(broom)
library(xts)
library(magrittr)
bike_share <- read_csv("Data/day.csv")
head(bike_share)

# A tibble: 6 x 16
  instant dteday     season    yr  mnth holiday weekday workingday weathersit
    <dbl> <date>      <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>      <dbl>
1       1 2011-01-01      1     0     1       0       6          0          2
2       2 2011-01-02      1     0     1       0       0          0          2
3       3 2011-01-03      1     0     1       0       1          1          1
4       4 2011-01-04      1     0     1       0       2          1          1
5       5 2011-01-05      1     0     1       0       3          1          1
6       6 2011-01-06      1     0     1       0       4          1          1
# ... with 7 more variables: temp <dbl>, atemp <dbl>, hum <dbl>,
#   windspeed <dbl>, casual <dbl>, registered <dbl>, cnt <dbl>

We can convert bike_share to a time-series object using the xts package. A time-series object requires an index to identify each observation by its date. We create the index with seq(as.date("YYYY-MM-DD"), by = "period", length.out = n), our start date is 2011-01-01, by “days”, with the number of observations as the length. After we create the date index, we will use dplyr:select to choose the variables (we don’t need dteday, yr, or mnth). As always, we will make use of the pipe operator to complete the code.

dates <- seq(as.Date("2011-01-01"), by = "days", length.out = 731)
bike_ts <- 
bike_share %>% 
  select(instant, season, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed, casual, registered, cnt) %>% 
  xts(dates) 
bike_ts %>% 
  head()

           instant season holiday weekday workingday weathersit  temp atemp
2011-01-01       1      1       0       6          0          2 0.344 0.364
2011-01-02       2      1       0       0          0          2 0.363 0.354
2011-01-03       3      1       0       1          1          1 0.196 0.189
2011-01-04       4      1       0       2          1          1 0.200 0.212
2011-01-05       5      1       0       3          1          1 0.227 0.229
2011-01-06       6      1       0       4          1          1 0.204 0.233
             hum windspeed casual registered  cnt
2011-01-01 0.806    0.1604    331        654  985
2011-01-02 0.696    0.2485    131        670  801
2011-01-03 0.437    0.2483    120       1229 1349
2011-01-04 0.590    0.1603    108       1454 1562
2011-01-05 0.437    0.1869     82       1518 1600
2011-01-06 0.518    0.0896     88       1518 1606

We see that the data now have a date index indicating to which date each observations belongs.

14.2 Detecting Autocorrelation

Let’s estimate the total number of riders as a function of time, $cnt=\beta_0+\beta_1instant+\epsilon$ , and test the residuals for first order auto correlation using the auxiliary regression approach.

bike_ts %$%
  lm(cnt ~ instant)


Call:
lm(formula = cnt ~ instant)

Coefficients:
(Intercept)      instant  
    2392.96         5.77

# retrieve the residuals as e
e <- 
  bike_ts %$%
  lm(cnt ~ instant)$residuals
# auxiliary regression
lm(e ~ lag(e,1)) %>% 
  tidy()

# A tibble: 2 x 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   -2.06    37.0      -0.0558 9.56e-  1
2 lag(e, 1)      0.752    0.0247   30.5    2.92e-132

The t-statistic on $\hat\rho_{t-1}$ is 30.50 so we can reject the null hypothesis of no autocorrelation in the error term.

14.3 Correcting Autocorrelation

14.3.1 Newey-West

The sandwich package contains a function to estimate Newey-West standard errors.⁴⁰ The output of the function call is the corrected variance-covariance matrix. We still need to calculate t-statistics based on the corrected variance-covariances. We will use the lmtest package to perform this test. coeftest(lm_object, vcov = variance-covariance_matrix).

# estimate the model (the lm_object)
bike_lm <- lm(cnt ~ instant, bike_ts)
bike_lm %>% 
  tidy()

# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  2393.     112.         21.4 1.91e-79
2 instant         5.77     0.264      21.8 1.02e-81

# determine the number of lags for the Newey-West correction
lags <- length(bike_ts$instant)^(.25)
nw_vcov <- sandwich::NeweyWest(bike_lm, lag = lags, prewhite = F, adjust = T)
# model with corrected errors
lmtest::coeftest(bike_lm, vcov. = nw_vcov)


t test of coefficients:

            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 2392.961    221.782   10.79 <0.0000000000000002 ***
instant        5.769      0.646    8.93 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

14.3.2 Cochrane-Orcutt

The orcutt package allows us to use the Cochrane-Orcutt method to $\rho$ difference the data to produce corrected standard errors using cochrane.orcutt(lm_oject)

bike_lm %>%
  orcutt::cochrane.orcutt() %>% 
  tidy()

# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  2457.     301.         8.17 1.39e-15
2 instant         5.57     0.707      7.88 1.22e-14

14.4 Dynamic Models

Using a time-series object makes running dynamic models as easy as calling the argument lag(variable_name, number_of_lags). Suppose we’d like to estimate the a lagged version of the model we have been using with the form $cnt_t=\beta_0+\beta_1instant_t+\beta_2temp_{t-1}+\epsilon$ . We want to see if yesterday’s weather affects today’s rentals.

bike_lm_dyn <- lm(cnt ~ instant + lag(temp, 1), bike_ts)
bike_lm_dyn %>% 
 tidy

# A tibble: 3 x 5
  term         estimate std.error statistic   p.value
  <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   -189.     128.        -1.48 1.39e-  1
2 instant          5.02     0.193     26.0  4.77e-106
3 lag(temp, 1)  5769.     222.        25.9  1.31e-105

14.5 Dickey-Fuller Test

The tseries package includes an augmented Dickey-Fuller test, adf.test(time_series).

bike_ts$cnt %>% 
  tseries::adf.test()


	Augmented Dickey-Fuller Test

data:  .
Dickey-Fuller = -2, Lag order = 9, p-value = 0.7
alternative hypothesis: stationary

We conclude that cnt is non-stationary.

14.6 First Differencing

A simple solution to non-stationarity is to use first differences of values, i.e., $\Delta y_t=y_t-y_{t-1}$ . diff(x, ...) makes this easy with a time-series object. Let’s test $\Delta y_t$ for stationarity.

bike_ts$cnt %>% 
  diff() %>% 
  tseries::na.remove() %>% # first differencing introduces NA's into the data
  tseries::adf.test()


	Augmented Dickey-Fuller Test

data:  .
Dickey-Fuller = -14, Lag order = 8, p-value = 0.01
alternative hypothesis: stationary

We can reject the null-hypothesis of non-stationarity. So let’s estimate the model $\Delta cnt_t=\beta_0+\beta_1\Delta temp_t+\eta_t$

lm(diff(cnt) ~ diff(temp), bike_ts) %>% 
  tidy()

# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     3.24      38.0    0.0852 9.32e- 1
2 diff(temp)   4838.       651.     7.43   3.09e-13

# Compare to same equation in the levels.
lm(cnt ~ temp, bike_ts) %>% 
  tidy()

# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    1215.      161.      7.54 1.43e-13
2 temp           6641.      305.     21.8  2.81e-81

Fanaee-T, Hadi, and Gama, Joao, ‘Event labeling combining ensemble detectors and background knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg↩︎
Unfortunately the pipe operator does not play nice with NeweyWest.↩︎