# Chapter 15 Time Series: Dealing with Stickiness over Time

In this chapter we will learn to work with time–series data in R.

## 15.1 Time Series Objects in R

Working with time-series data in R is simplified if the data are structured as a time-series object. There are a host of functions and packages dedicated to time-series data. A time-series object is an R structure that contains the observations, the start and end date of the series, and information about the frequency or periodicity.

We will use the Bike Sharing Dataset22 from the UCI Machine Learning Repository.

• The data set has 731 observations on 17 variables
• instant: record index
• dteday : date
• season : season (1:spring, 2:summer, 3:fall, 4:winter)
• yr : year (0: 2011, 1:2012)
• mnth : month ( 1 to 12)
• hr : hour (0 to 23)
• holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
• weekday : day of the week
• workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
• weathersit :
• 1: Clear, Few clouds, Partly cloudy, Partly cloudy
• 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
• 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
• 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
• temp : Normalized temperature in Celsius. The values are divided to 41 (max)
• atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
• hum: Normalized humidity. The values are divided to 100 (max)
• windspeed: Normalized wind speed. The values are divided to 67 (max)
• casual: count of casual users
• registered: count of registered users
• cnt: count of total rental bikes including both casual and registered
library(tidyverse)
library(broom)
library(xts)
library(magrittr)
head(bike_share)
# A tibble: 6 x 16
instant dteday     season    yr  mnth holiday weekday workingday
<dbl> <date>      <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>
1       1 2011-01-01      1     0     1       0       6          0
2       2 2011-01-02      1     0     1       0       0          0
3       3 2011-01-03      1     0     1       0       1          1
4       4 2011-01-04      1     0     1       0       2          1
5       5 2011-01-05      1     0     1       0       3          1
6       6 2011-01-06      1     0     1       0       4          1
# ... with 8 more variables: weathersit <dbl>, temp <dbl>, atemp <dbl>,
#   hum <dbl>, windspeed <dbl>, casual <dbl>, registered <dbl>, cnt <dbl>

We can convert bike_share to a time-series object using the xts package. A time-series object requires an index to identify each observation by its date. We create the index with seq(as.date("YYYY-MM-DD"), by = "period", length.out = n), our start date is 2011-01-01, by “days”, with the number of observations as the length. After we create the date index, we will use dplyr:select to choose the variables (we don’t need dteday, yr, or mnth). As always, we will make use of the pipe operator to complete the code.

dates <- seq(as.Date("2011-01-01"), by = "days", length.out = 731)
bike_ts <-
bike_share %>%
select(instant, season, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed, casual, registered, cnt) %>%
xts(dates)
bike_ts %>%
head()
           instant season holiday weekday workingday weathersit  temp
2011-01-01       1      1       0       6          0          2 0.344
2011-01-02       2      1       0       0          0          2 0.363
2011-01-03       3      1       0       1          1          1 0.196
2011-01-04       4      1       0       2          1          1 0.200
2011-01-05       5      1       0       3          1          1 0.227
2011-01-06       6      1       0       4          1          1 0.204
atemp   hum windspeed casual registered  cnt
2011-01-01 0.364 0.806    0.1604    331        654  985
2011-01-02 0.354 0.696    0.2485    131        670  801
2011-01-03 0.189 0.437    0.2483    120       1229 1349
2011-01-04 0.212 0.590    0.1603    108       1454 1562
2011-01-05 0.229 0.437    0.1869     82       1518 1600
2011-01-06 0.233 0.518    0.0896     88       1518 1606

We see that the data now have a date index indicating to which date each observations belongs.

## 15.2 Detecting Autocorrelation

Let’s estimate the total number of riders as a function of time, $$cnt=\beta_0+\beta_1instant+\epsilon$$, and test the residuals for first order auto correlation using the auxiliary regression approach.

bike_ts %$% lm(cnt ~ instant)   Call: lm(formula = cnt ~ instant) Coefficients: (Intercept) instant 2392.96 5.77  # retrieve the residuals as e e <- bike_ts %$%
lm(cnt ~ instant)$residuals # auxiliary regression lm(e ~ lag(e,1)) %>% tidy() # A tibble: 2 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -2.06 37.0 -0.0558 9.56e- 1 2 lag(e, 1) 0.752 0.0247 30.5 2.92e-132 The t-statistic on $$\hat\rho_{t-1}$$ is 30.50 so we can reject the null hypothesis of no autocorrelation in the error term. ## 15.3 Correcting Autocorrelation ### 15.3.1 Newey-West The sandwich package contains a function to estimate Newey-West standard errors.23 The output of the function call is the corrected variance-covariance matrix. We still need to calculate t-statistics based on the corrected variance-covariances. We will use the lmtest package to perform this test. coeftest(lm_object, vcov = variance-covariance_matrix). # estimate the model (the lm_object) bike_lm <- lm(cnt ~ instant, bike_ts) bike_lm %>% tidy() # A tibble: 2 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 2393. 112. 21.4 1.91e-79 2 instant 5.77 0.264 21.8 1.02e-81 # determine the number of lags for the Newey-West correction lags <- length(bike_ts$instant)^(.25)
nw_vcov <- sandwich::NeweyWest(bike_lm, lag = lags, prewhite = F, adjust = T)
# model with corrected errors
lmtest::coeftest(bike_lm, vcov. = nw_vcov)

t test of coefficients:

Estimate Std. Error t value            Pr(>|t|)
(Intercept) 2392.961    221.782   10.79 <0.0000000000000002 ***
instant        5.769      0.646    8.93 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

### 15.3.2 Cochrane-Orcutt

The orcutt package allows us to use the Cochrane-Orcutt method to $$\rho$$ difference the data to produce corrected standard errors using cochrane.orcutt(lm_oject)

bike_lm %>%
orcutt::cochrane.orcutt() %>%
tidy()
# A tibble: 2 x 5
term        estimate std.error statistic  p.value
<chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  2457.     301.         8.17 1.39e-15
2 instant         5.57     0.707      7.88 1.22e-14

## 15.4 Dynamic Models

Using a time-series object makes running dynamic models as easy as calling the argument lag(variable_name, number_of_lags). Suppose we’d like to estimate the a lagged version of the model we have been using with the form $$cnt_t=\beta_0+\beta_1instant_t+\beta_2temp_{t-1}+\epsilon$$. We want to see if yesterday’s weather affects today’s rentals.

bike_lm_dyn <- lm(cnt ~ instant + lag(temp, 1), bike_ts)
bike_lm_dyn %>%
tidy
# A tibble: 3 x 5
term         estimate std.error statistic   p.value
<chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   -189.     128.        -1.48 1.39e-  1
2 instant          5.02     0.193     26.0  4.77e-106
3 lag(temp, 1)  5769.     222.        25.9  1.31e-105

## 15.5 Dickey-Fuller Test

The tseries package includes an augmented Dickey-Fuller test, adf.test(time_series).

bike_ts$cnt %>% tseries::adf.test()  Augmented Dickey-Fuller Test data: . Dickey-Fuller = -2, Lag order = 9, p-value = 0.7 alternative hypothesis: stationary We conclude that cnt is non-stationary. ## 15.6 First Differencing A simple solution to non-stationarity is to use first differences of values, i.e., $$\Delta y_t=y_t-y_{t-1}$$. diff(x, ...) makes this easy with a time-series object. Let’s test $$\Delta y_t$$ for stationarity. bike_ts$cnt %>%
diff() %>%
tseries::na.remove() %>% # first differencing introduces NA's into the data
tseries::adf.test()

Augmented Dickey-Fuller Test

data:  .
Dickey-Fuller = -14, Lag order = 8, p-value = 0.01
alternative hypothesis: stationary

We can reject the null-hypothesis of non-stationarity. So let’s estimate the model $$\Delta cnt_t=\beta_0+\beta_1\Delta temp_t+\eta_t$$

lm(diff(cnt) ~ diff(temp), bike_ts) %>%
tidy()
# A tibble: 2 x 5
term        estimate std.error statistic  p.value
<chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     3.24      38.0    0.0852 9.32e- 1
2 diff(temp)   4838.       651.     7.43   3.09e-13
# Compare to same equation in the levels.
lm(cnt ~ temp, bike_ts) %>%
tidy()
# A tibble: 2 x 5
term        estimate std.error statistic  p.value
<chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    1215.      161.      7.54 1.43e-13
2 temp           6641.      305.     21.8  2.81e-81

1. Fanaee-T, Hadi, and Gama, Joao, ‘Event labeling combining ensemble detectors and background knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

2. Unfortunately the pipe operator does not play nice with NeweyWest.