Chapter 14 Time Series: Dealing with Stickiness over Time
In this chapter we will learn to work with time–series data in R.
14.1 Time Series Objects in R
Working with time-series data in R is simplified if the data are structured as a time-series object. There are a host of functions and packages dedicated to time-series data. A time-series object is an R structure that contains the observations, the start and end date of the series, and information about the frequency or periodicity.
We will use the Bike Sharing Dataset39 from the UCI Machine Learning Repository.
- The data set has 731 observations on 17 variables
- instant: record index
- dteday : date
- season : season (1:spring, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered
library(tidyverse)
library(broom)
library(xts)
library(magrittr)
bike_share <- read_csv("Data/day.csv")
head(bike_share)
# A tibble: 6 x 16
instant dteday season yr mnth holiday weekday workingday weathersit
<dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2011-01-01 1 0 1 0 6 0 2
2 2 2011-01-02 1 0 1 0 0 0 2
3 3 2011-01-03 1 0 1 0 1 1 1
4 4 2011-01-04 1 0 1 0 2 1 1
5 5 2011-01-05 1 0 1 0 3 1 1
6 6 2011-01-06 1 0 1 0 4 1 1
# ... with 7 more variables: temp <dbl>, atemp <dbl>, hum <dbl>,
# windspeed <dbl>, casual <dbl>, registered <dbl>, cnt <dbl>
We can convert bike_share
to a time-series object using the xts
package. A time-series object requires an index to identify each observation by its date. We create the index with seq(as.date("YYYY-MM-DD"), by = "period", length.out = n)
, our start date is 2011-01-01, by “days”, with the number of observations as the length. After we create the date index, we will use dplyr:select
to choose the variables (we don’t need dteday, yr, or mnth). As always, we will make use of the pipe operator to complete the code.
dates <- seq(as.Date("2011-01-01"), by = "days", length.out = 731)
bike_ts <-
bike_share %>%
select(instant, season, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed, casual, registered, cnt) %>%
xts(dates)
bike_ts %>%
head()
instant season holiday weekday workingday weathersit temp atemp
2011-01-01 1 1 0 6 0 2 0.344 0.364
2011-01-02 2 1 0 0 0 2 0.363 0.354
2011-01-03 3 1 0 1 1 1 0.196 0.189
2011-01-04 4 1 0 2 1 1 0.200 0.212
2011-01-05 5 1 0 3 1 1 0.227 0.229
2011-01-06 6 1 0 4 1 1 0.204 0.233
hum windspeed casual registered cnt
2011-01-01 0.806 0.1604 331 654 985
2011-01-02 0.696 0.2485 131 670 801
2011-01-03 0.437 0.2483 120 1229 1349
2011-01-04 0.590 0.1603 108 1454 1562
2011-01-05 0.437 0.1869 82 1518 1600
2011-01-06 0.518 0.0896 88 1518 1606
We see that the data now have a date index indicating to which date each observations belongs.
14.2 Detecting Autocorrelation
Let’s estimate the total number of riders as a function of time, \(cnt=\beta_0+\beta_1instant+\epsilon\), and test the residuals for first order auto correlation using the auxiliary regression approach.
Call:
lm(formula = cnt ~ instant)
Coefficients:
(Intercept) instant
2392.96 5.77
# retrieve the residuals as e
e <-
bike_ts %$%
lm(cnt ~ instant)$residuals
# auxiliary regression
lm(e ~ lag(e,1)) %>%
tidy()
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.06 37.0 -0.0558 9.56e- 1
2 lag(e, 1) 0.752 0.0247 30.5 2.92e-132
The t-statistic on \(\hat\rho_{t-1}\) is 30.50 so we can reject the null hypothesis of no autocorrelation in the error term.
14.3 Correcting Autocorrelation
14.3.1 Newey-West
The sandwich
package contains a function to estimate Newey-West standard errors.40 The output of the function call is the corrected variance-covariance matrix. We still need to calculate t-statistics based on the corrected variance-covariances. We will use the lmtest
package to perform this test. coeftest(lm_object, vcov = variance-covariance_matrix)
.
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2393. 112. 21.4 1.91e-79
2 instant 5.77 0.264 21.8 1.02e-81
# determine the number of lags for the Newey-West correction
lags <- length(bike_ts$instant)^(.25)
nw_vcov <- sandwich::NeweyWest(bike_lm, lag = lags, prewhite = F, adjust = T)
# model with corrected errors
lmtest::coeftest(bike_lm, vcov. = nw_vcov)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2392.961 221.782 10.79 <0.0000000000000002 ***
instant 5.769 0.646 8.93 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
14.3.2 Cochrane-Orcutt
The orcutt
package allows us to use the Cochrane-Orcutt method to \(\rho\) difference the data to produce corrected standard errors using cochrane.orcutt(lm_oject)
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2457. 301. 8.17 1.39e-15
2 instant 5.57 0.707 7.88 1.22e-14
14.4 Dynamic Models
Using a time-series object makes running dynamic models as easy as calling the argument lag(variable_name, number_of_lags)
. Suppose we’d like to estimate the a lagged version of the model we have been using with the form \(cnt_t=\beta_0+\beta_1instant_t+\beta_2temp_{t-1}+\epsilon\). We want to see if yesterday’s weather affects today’s rentals.
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -189. 128. -1.48 1.39e- 1
2 instant 5.02 0.193 26.0 4.77e-106
3 lag(temp, 1) 5769. 222. 25.9 1.31e-105
14.5 Dickey-Fuller Test
The tseries
package includes an augmented Dickey-Fuller test, adf.test(time_series)
.
Augmented Dickey-Fuller Test
data: .
Dickey-Fuller = -2, Lag order = 9, p-value = 0.7
alternative hypothesis: stationary
We conclude that cnt is non-stationary.
14.6 First Differencing
A simple solution to non-stationarity is to use first differences of values, i.e., \(\Delta y_t=y_t-y_{t-1}\). diff(x, ...)
makes this easy with a time-series object. Let’s test \(\Delta y_t\) for stationarity.
bike_ts$cnt %>%
diff() %>%
tseries::na.remove() %>% # first differencing introduces NA's into the data
tseries::adf.test()
Augmented Dickey-Fuller Test
data: .
Dickey-Fuller = -14, Lag order = 8, p-value = 0.01
alternative hypothesis: stationary
We can reject the null-hypothesis of non-stationarity. So let’s estimate the model \(\Delta cnt_t=\beta_0+\beta_1\Delta temp_t+\eta_t\)
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.24 38.0 0.0852 9.32e- 1
2 diff(temp) 4838. 651. 7.43 3.09e-13
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1215. 161. 7.54 1.43e-13
2 temp 6641. 305. 21.8 2.81e-81