Chapter 1 Introduction

1.1 Introduction to Timeseries

For an introduction to R see the Appendix @ref(ss_991IntrotoR)

Many of the datasets we will be working with have a (somehow regular) time dimension, and are therefore often called timeseries. In R there are a variety of classes available to handle data, such as vector, matrix, data.frame or their more modern implementation: tibble.[^According to the Vignette of the xts.] Adding a time dimension creates a timeseries from these objects. The most common/flexible package in R that handles timeseries based on the first three formats is xts, which we will discuss in the following. Afterwards we will introduce the package timetk-package that allows xts to interplay with tibbles to create the most powerful framework to handle (even very large) time-based datasets (as we often encounter in finance).

The community is currently working heavily to develop time-aware tibbles to bring together the powerful grouping feature from the dplyr package (for tibbles) with the abbilities of xts, which is the most powerful and most used timeseries method in finance to date, due to the available interplay with quantmod and other financial package. See also this link for more information.

All information regarding tibbles and the financial universe is summarized and updated on the business-science.io-Website.

In the following, we will define a variety of date and time classes, before we go and introduce xts, tibble and tibbletime. Most of this packages come with some excellent vignettes, that I will reference for further reading, while I will only pickup the necessary features for portfolio management, which is the focus of this book.

1.1.1 Date and Time

There some basic functionalities in base-R, but most of the time we will need additional functions to perform all necessary tasks. Available date (time) classes are Date, POSIXct, (chron), yearmon, yearqtr and timeDate (from the Rmetrics bundle).

1.1.1.1 Basic Date and Time Classes

There are several Date and Time Classes in R that can all be used as time-index for xts. We start with the most basic as.Date()

d1 <- "2018-01-18"
str(d1) # str() checks the structure of the R-object

##  chr "2018-01-18"

d2 <- as.Date(d1)
str(d2)

##  Date[1:1], format: "2018-01-18"

In the second case, R automatically detects the format of the Date-object, but if there is something more complex involved you can specify the format (for all available format definitions, see ?strptime())

d3 <- "4/30-2018"
as.Date(d3, "%m/%d-%Y") # as.Date(d3) will not work

## [1] "2018-04-30"

If you are working with monthly or quarterly data, yearmon and yearqtr will be your friends (both coming from the zoo-package that serves as foundation for xts)

as.yearmon(d1); as.yearmon(as.Date(d3, "%m/%d-%Y"))

## [1] "Jan 2018"

## [1] "Apr 2018"

as.yearqtr(d1); as.yearqtr(as.Date(d3, "%m/%d-%Y"))

## [1] "2018 Q1"

## [1] "2018 Q2"

Note, that as.yearmon shows dates in terms of the current locale of your computer (e.g. Austrian German). You can find out about your locale with Sys.getlocale() and set a different locale with Sys.setlocale()

Sys.setlocale("LC_TIME","German_Austria")

## [1] "German_Austria.1252"

as.yearmon(d1); as.yearmon(as.Date(d3, "%m/%d-%Y"))

## [1] "Jän 2018"

## [1] "Apr 2018"

Sys.setlocale("LC_TIME","English")

## [1] "English_United States.1252"

as.yearmon(d1); as.yearmon(as.Date(d3, "%m/%d-%Y"))

## [1] "Jan 2018"

## [1] "Apr 2018"

When your data wants you to also include information on time, then you will either need the POSIXct (which is the basic package behind all times and dates in R) or the timeDate-package. The latter one includes excellent abilities to work with financial data (see the next section). Note that talking about time also requires you to talk about timezones! We start with several examples of the POSIXct-class:

strptime("2018-01-15 13:55:23.975", "%Y-%m-%d %H:%M:%OS") # converts from character to POSIXct

## [1] "2018-01-15 13:55:23 CET"

as.POSIXct("2009-01-05 14:19:12", format="%Y-%m-%d %H:%M:%S", tz="UTC")

## [1] "2009-01-05 14:19:12 UTC"

We will mainly use the timeDate-package that provides many useful functions for financial timeseries.

An introduction to timeDate by the Rmetrics group can be found at https://www.rmetrics.org/sites/default/files/2010-02-timeDateObjects.pdf.

Dates <- c("1989-09-28","2001-01-15","2004-08-30","1990-02-09")
Times <- c( "23:12:55", "10:34:02", "08:30:00", "11:18:23")
DatesTimes <- paste(Dates, Times)
as.Date(DatesTimes)

## [1] "1989-09-28" "2001-01-15" "2004-08-30" "1990-02-09"

as.timeDate(DatesTimes)

## GMT
## [1] [1989-09-28 23:12:55] [2001-01-15 10:34:02] [2004-08-30 08:30:00]
## [4] [1990-02-09 11:18:23]

You see, that the timeDate comes along with timezone information (GMT) that is set to your computers locale. timeDate allows you to specify the timezone of origin zone as well as the timezone to transfer data to FinCenter:

timeDate(DatesTimes, zone = "Tokyo", FinCenter = "Zurich")

## Zurich
## [1] [1989-09-28 15:12:55] [2001-01-15 02:34:02] [2004-08-30 01:30:00]
## [4] [1990-02-09 03:18:23]

timeDate(DatesTimes, zone = "Tokyo", FinCenter = "NewYork")

## NewYork
## [1] [1989-09-28 10:12:55] [2001-01-14 20:34:02] [2004-08-29 19:30:00]
## [4] [1990-02-08 21:18:23]

timeDate(DatesTimes, zone = "NewYork", FinCenter = "Tokyo")

## Tokyo
## [1] [1989-09-29 12:12:55] [2001-01-16 00:34:02] [2004-08-30 21:30:00]
## [4] [1990-02-10 01:18:23]

listFinCenter("Europe/Vi*") # get a list of all financial centers available in ...

## [1] "Europe/Vaduz"     "Europe/Vatican"   "Europe/Vienna"   
## [4] "Europe/Vilnius"   "Europe/Volgograd"

Date as well as the timeDate package allow you to create time sequences (necessary if you want to manually create timeseries)

dates1 <- seq(as.Date("2017-01-01"), length=12, by="month"); dates1 # or to=

##  [1] "2017-01-01" "2017-02-01" "2017-03-01" "2017-04-01" "2017-05-01"
##  [6] "2017-06-01" "2017-07-01" "2017-08-01" "2017-09-01" "2017-10-01"
## [11] "2017-11-01" "2017-12-01"

dates2 <- timeSequence(from = "2017-01-01", to = "2017-12-31", by = "month"); dates2 # or length.out=

## GMT
##  [1] [2017-01-01] [2017-02-01] [2017-03-01] [2017-04-01] [2017-05-01]
##  [6] [2017-06-01] [2017-07-01] [2017-08-01] [2017-09-01] [2017-10-01]
## [11] [2017-11-01] [2017-12-01]

Now there are several very useful functions in the timeDate package to determine first/last days of months/quarters/… (I let them speak for themselves)

timeFirstDayInMonth(dates1 -7) # btw check the difference between "dates1-7" and "dates2-7"

## GMT
##  [1] [2016-12-01] [2017-01-01] [2017-02-01] [2017-03-01] [2017-04-01]
##  [6] [2017-05-01] [2017-06-01] [2017-07-01] [2017-08-01] [2017-09-01]
## [11] [2017-10-01] [2017-11-01]

timeFirstDayInQuarter(dates1)

## GMT
##  [1] [2017-01-01] [2017-01-01] [2017-01-01] [2017-04-01] [2017-04-01]
##  [6] [2017-04-01] [2017-07-01] [2017-07-01] [2017-07-01] [2017-10-01]
## [11] [2017-10-01] [2017-10-01]

timeLastDayInMonth(dates1)

## GMT
##  [1] [2017-01-31] [2017-02-28] [2017-03-31] [2017-04-30] [2017-05-31]
##  [6] [2017-06-30] [2017-07-31] [2017-08-31] [2017-09-30] [2017-10-31]
## [11] [2017-11-30] [2017-12-31]

timeLastDayInQuarter(dates1)

## GMT
##  [1] [2017-03-31] [2017-03-31] [2017-03-31] [2017-06-30] [2017-06-30]
##  [6] [2017-06-30] [2017-09-30] [2017-09-30] [2017-09-30] [2017-12-31]
## [11] [2017-12-31] [2017-12-31]

timeNthNdayInMonth("2018-01-01",nday = 5, nth = 3) # useful for option expiry that (mostly) happens on the 3rd Friday of each month

## GMT
## [1] [2018-01-19]

timeNthNdayInMonth(dates1,nday = 5, nth = 3)

## GMT
##  [1] [2017-01-20] [2017-02-17] [2017-03-17] [2017-04-21] [2017-05-19]
##  [6] [2017-06-16] [2017-07-21] [2017-08-18] [2017-09-15] [2017-10-20]
## [11] [2017-11-17] [2017-12-15]

If one wants to create a more specific sequence of times, this can be done with timeCalendar using time ‘atoms’:

timeCalendar(m = 1:4, d = c(28, 15, 30, 9), y = c(1989, 2001, 2004, 1990), FinCenter = "Europe/Zurich")

## Europe/Zurich
## [1] [1989-01-28 01:00:00] [2001-02-15 01:00:00] [2004-03-30 02:00:00]
## [4] [1990-04-09 02:00:00]

timeCalendar(d=1, m=3:4, y=2018, h = c(9, 14), min = c(15, 23), s=c(39,41), FinCenter = "Europe/Zurich", zone = "UTC") # what happens here?

## Europe/Zurich
## [1] [2018-03-01 10:15:39] [2018-04-01 16:23:41]

1.1.1.2 Week-days and Business-days

One of the most important functionalities only existing in the timeDate-package is the possibility to check for business days in almost any timezone. The most important ones can be called by holidayXXX()

holidayNYSE()

## NewYork
## [1] [2018-01-01] [2018-01-15] [2018-02-19] [2018-03-30] [2018-05-28]
## [6] [2018-07-04] [2018-09-03] [2018-11-22] [2018-12-25]

holiday(year = 2018, Holiday = c("GoodFriday","Easter","FRAllSaints"))

## GMT
## [1] [2018-03-30] [2018-04-01] [2018-11-01]

dateSeq <- timeSequence(Easter(year(Sys.time()), -14), Easter(year(Sys.time()), +14)); dateSeq # selct +- 14 days around Easter of this year

## GMT
##  [1] [2018-03-18] [2018-03-19] [2018-03-20] [2018-03-21] [2018-03-22]
##  [6] [2018-03-23] [2018-03-24] [2018-03-25] [2018-03-26] [2018-03-27]
## [11] [2018-03-28] [2018-03-29] [2018-03-30] [2018-03-31] [2018-04-01]
## [16] [2018-04-02] [2018-04-03] [2018-04-04] [2018-04-05] [2018-04-06]
## [21] [2018-04-07] [2018-04-08] [2018-04-09] [2018-04-10] [2018-04-11]
## [26] [2018-04-12] [2018-04-13] [2018-04-14] [2018-04-15]

dateSeq2 <- dateSeq[isWeekday(dateSeq)]; dateSeq2 # select only weekdays

## GMT
##  [1] [2018-03-19] [2018-03-20] [2018-03-21] [2018-03-22] [2018-03-23]
##  [6] [2018-03-26] [2018-03-27] [2018-03-28] [2018-03-29] [2018-03-30]
## [11] [2018-04-02] [2018-04-03] [2018-04-04] [2018-04-05] [2018-04-06]
## [16] [2018-04-09] [2018-04-10] [2018-04-11] [2018-04-12] [2018-04-13]

dayOfWeek(dateSeq2)

## 2018-03-19 2018-03-20 2018-03-21 2018-03-22 2018-03-23 2018-03-26 
##      "Mon"      "Tue"      "Wed"      "Thu"      "Fri"      "Mon" 
## 2018-03-27 2018-03-28 2018-03-29 2018-03-30 2018-04-02 2018-04-03 
##      "Tue"      "Wed"      "Thu"      "Fri"      "Mon"      "Tue" 
## 2018-04-04 2018-04-05 2018-04-06 2018-04-09 2018-04-10 2018-04-11 
##      "Wed"      "Thu"      "Fri"      "Mon"      "Tue"      "Wed" 
## 2018-04-12 2018-04-13 
##      "Thu"      "Fri"

dateSeq3 <- dateSeq[isBizday(dateSeq, holidayZURICH(year(Sys.time())))]; dateSeq3 # select only BusinessDays of Zurich

## GMT
##  [1] [2018-03-19] [2018-03-20] [2018-03-21] [2018-03-22] [2018-03-23]
##  [6] [2018-03-26] [2018-03-27] [2018-03-28] [2018-03-29] [2018-04-03]
## [11] [2018-04-04] [2018-04-05] [2018-04-06] [2018-04-09] [2018-04-10]
## [16] [2018-04-11] [2018-04-12] [2018-04-13]

dayOfWeek(dateSeq3)

## 2018-03-19 2018-03-20 2018-03-21 2018-03-22 2018-03-23 2018-03-26 
##      "Mon"      "Tue"      "Wed"      "Thu"      "Fri"      "Mon" 
## 2018-03-27 2018-03-28 2018-03-29 2018-04-03 2018-04-04 2018-04-05 
##      "Tue"      "Wed"      "Thu"      "Tue"      "Wed"      "Thu" 
## 2018-04-06 2018-04-09 2018-04-10 2018-04-11 2018-04-12 2018-04-13 
##      "Fri"      "Mon"      "Tue"      "Wed"      "Thu"      "Fri"

Now, one of the strongest points for the timeDate package is made, when one puts times and dates from different timezones together. This could be a challenging task (imagine hourly stock prices from London, Tokyo and New York). Luckily the timeDate-package can handle this easily:

ZH <- timeDate("2015-01-01 16:00:00", zone = "GMT", FinCenter = "Zurich")
NY <- timeDate("2015-01-01 18:00:00", zone = "GMT", FinCenter = "NewYork")
c(ZH, NY)

## Zurich
## [1] [2015-01-01 17:00:00] [2015-01-01 19:00:00]

c(NY, ZH) # it always takes the Financial Center of the first entry

## NewYork
## [1] [2015-01-01 13:00:00] [2015-01-01 11:00:00]

1.1.1.3 Assignments

Create a daily time series for 2018:

Find the subset of first and last days per month/quarter (uniquely)
Take December 2017 and remove all weekends and holidays in Zurich (Tokyo)
create a series of five dates & times in New York. Show them for New York, London and Belgrade

1.1.2 eXtensible Timeseries

The xts format is based on the timeseries format zoo, but extends its power to be more compatible with other data classes. For example, if one converts dates from the timeDate, xts will be so flexible as to memorize the financial center the dates were coming from and upon retransformation to this class will be reassigned values that would have been lost upon transformation to a pure zoo-object. As quite often we (might) want to transform our data to and from xts this is a great feature and makes our lifes a lot easier. Also xts comes with a bundle of other features.

For the reader who wants to dig deeper, we recommend the excellent zoo vignettes (vignette("zoo-quickref"), vignette("zoo"), vignette("zoo-faq"), vignette("zoo-design") and vignette("zoo-read")). Read up on xts in vignette("xts") and vignette("xts-faq").

To start, we create an xts object consisting of a series of randomly created data points:

data <- rnorm(5) # 5 std. normally distributed random numbers
dates <- seq(as.Date("2017-05-01"), length=5, by="days")
xts1 <- xts(x=data, order.by=dates); xts1

##                  [,1]
## 2017-05-01 -0.4089077
## 2017-05-02  1.8744037
## 2017-05-03 -0.1501499
## 2017-05-04  0.2816538
## 2017-05-05  1.2689409

coredata(xts1) # access data

##            [,1]
## [1,] -0.4089077
## [2,]  1.8744037
## [3,] -0.1501499
## [4,]  0.2816538
## [5,]  1.2689409

index(xts1)   # access time (index)

## [1] "2017-05-01" "2017-05-02" "2017-05-03" "2017-05-04" "2017-05-05"

Here, the xts object was built from a vector and a series of Dates. We could also have used timeDate, yearmon or yearqtr and a data.frame:

s1 <- rnorm(5); s2 <- 1:5
data <- data.frame(s1,s2)
dates <- timeSequence("2017-01-01",by="months",length.out=5,zone = "GMT")
xts2 <- xts(x=data, order.by=dates); xts2

## Warning: timezone of object (GMT) is different than current timezone ().

##                    s1 s2
## 2017-01-01  1.1119636  1
## 2017-02-01 -1.9167934  2
## 2017-03-01 -0.3592996  3
## 2017-04-01 -0.5022776  4
## 2017-05-01  0.8596929  5

dates2 <- as.yearmon(dates)
xts3 <- xts(x=data, order.by = dates2)

In the next step we evaluate the merging of two timeseries:

set.seed(1)
xts3 <- xts(rnorm(6), timeSequence(from = "2017-01-01", to = "2017-06-01", by="month"))
xts4 <- xts(rnorm(5), timeSequence(from = "2017-04-01", to = "2017-08-01", by="month"))
colnames(xts3) <- "tsA"; colnames(xts4) <- "tsB"
merge(xts3,xts4)

Please be aware that joining timeseries in R does sometimes want you to do a left/right/inner/outer join of the two objects

merge(xts3,xts4,join = "left")
merge(xts3,xts4,join = "right")
merge(xts3,xts4,join = "inner")
merge(xts3,xts4,join="outer",fill=0)

In the next step, we subset and replace parts of xts objects

xts5 <- xts(rnorm(24), timeSequence(from = "2016-01-01", to = "2017-12-01", by="month"))
xts5["2017-01-01"]
xts5["2017-05-01/2017-08-12"]
xts5[c("2017-01-01","2017-05-01")] <- NA
xts5["2016"] <- 99
xts5["2016-05-01/"]
first(xts5)
last(xts5)
first(xts5,"3 months")
xts6 <- last(xts5,"1 year")

Now let us handle the missing value we introduced. One possibility is just to omit the missing value using na.omit(). Other possibilities would be to use the last value na.locf() or linear interpolation with na.approx()

na.omit(xts6)
na.locf(xts6)
na.locf(xts6,fromLast = TRUE,na.rm = TRUE)
na.approx(xts6,na.rm = FALSE)

Finally, standard calculations can be done on xts objects, AND there are some pretty helper functions to make life easier

periodicity(xts5)
nmonths(xts5); nquarters(xts5); nyears(xts5)
to.yearly(xts5)
to.quarterly(xts6)
round(xts6^2,2)
xts6[which(is.na(xts6))] <- rnorm(2)
# For aggregation of timeseries
ep1 <- endpoints(xts6,on="months",k = 2) # for aggregating timesries
ep2 <- endpoints(xts6,on="months",k = 3) # for aggregating timesries
period.sum(xts6, INDEX = ep2)
period.apply(xts6, INDEX = ep1, FUN=mean) # 2month means
period.apply(xts6, INDEX = ep2, FUN=mean) # 3month means
# Lead, lag and diff operations
cbind(xts6,lag(xts6,k=-1),lag(xts6,k=1),diff(xts6))

Finally, I will show some applications that go beyond xts, for example the use of lapply to operate on a list

# splitting timeseries (results is a list)
xts6_yearly <- split(xts5,f="years")
lapply(xts6_yearly,FUN=mean,na.rm=TRUE)
# using elaborate functions from the zoo-package
rollapply(as.zoo(xts6), width=3, FUN=sd) # rolling standard deviation

Last and least, we plot xts data and save it to a (csv) file, then open it again:

tmp <- tempfile()
write.zoo(xts2,sep=",",file = tmp)
xts8  <- as.xts(read.zoo(tmp, sep=",", FUN=as.yearmon))
plot(xts8)

1.1.3 Downloading timeseries and basic visualization with quantmod

Many downloading and plotting functions are (still) available in quantmod. We first require the package, then download data for Google, Apple and the S&P500 from yahoo finance. Each of these “Symbols” will be downloaded into its own “environment”. For plotting there are a large variety of technical indicators available, for an overview see here.

Quantmod is developed by Jeffrey Ryan and Joshua Ulrich and has a homepage. The homepage includes an Introduction, describes how Data can be handled between xts and quantmod and has examples about Financial Charting with quantmod and TTR. More documents will be developed within 2018.

require(quantmod)
# the easiest form of getting data is for yahoo finance where you know the appropriate symbols (Apple is "APPL")
getSymbols(Symbols = "AAPL", from="2010-01-01", to="2018-03-01", periodicity="monthly")
head(AAPL)
is.xts(AAPL)
plot(AAPL[, "AAPL.Adjusted"], main = "AAPL")
chartSeries(AAPL, TA=c(addVo(),addBBands(), addADX())) # Plot and add technical indicators
getSymbols(Symbols = c("GOOG","^GSPC"), from="2000-01-01", to="2018-03-01", periodicity="monthly") # now get Google and the S&P500
getSymbols('DTB3', src='FRED') # fred does not recognize from and to

Now we create an xts from all relevant parts of the data

stocks <- cbind("Apple"=AAPL[,"AAPL.Adjusted"],"Google"=GOOG[,"GOOG.Adjusted"],"SP500"=GSPC[,"GSPC.Adjusted"])
rf.daily <- DTB3["2010-01-01/2018-03-01"]
rf.monthly <- to.monthly(rf.daily)[,"rf.daily.Open"]
rf <- xts(coredata(rf.monthly),order.by =  as.Date(index(rf.monthly)))

One possibility (that I adopted from (here)[https://www.quantinsti.com/blog/an-example-of-a-trading-strategy-coded-in-r/]) is to use the technical indicators provided by quantmod to devise a technical trading strategy. We make use of a fast and a slow moving average (function MACD in the TTR package that belongs to quantmod). Whenever the fast moving average crosses the slow moving one from below, we invest (there is a short term trend to exploit) and we drop out of the investment once the red (fast) line falls below the grey (slow) line. To evaluate the trading strategy we need to also calculate returns for the S&P500 index using ROC.

chartSeries(GSPC, TA=c(addMACD(fast=3, slow=12,signal=6,type=SMA)))
macd <- MACD(GSPC[,"GSPC.Adjusted"], nFast=3, nSlow=12,nSig=6,maType=SMA, percent = FALSE)
buy_sell_signal <- Lag(ifelse(macd$macd < macd$signal, -1, 1))
buy_sell_returns <- (ROC(GSPC[,"GSPC.Adjusted"])*buy_sell_signal)["2001-06-01/"]
portfolio <- exp(cumsum(buy_sell_returns)) # for nice plotting we assume that we invest one dollar and see hoiw much we have at the end of the observation period
plot(portfolio)

For evaluation of trading strategies/portfolios and other financial timeseries, almost every tool is available through the package PerformanceAnalytics. In this case charts.PerformanceSummary() calculates cumulative returns (similar to above), monthly returns and maximum drawdown (maximum loss in relation to best value, see here.

PerformanceAnalytics is a large package with an uncountable variety of Tools. There are vignettes on the estimation of higher order (co)moments vignette("EstimationComoments"), performance attribution measures according to (???) vignette("PA-Bacon"), charting vignette("PA-charts") and more that can be found on the PerformanceAnalytics cran page.

require(PerformanceAnalytics)
rets <- cbind(buy_sell_returns,ROC(GSPC[,"GSPC.Adjusted"]))
colnames(rets) <- c("investment","benchmark")
charts.PerformanceSummary(rets,colorset=rich6equal)
chart.Histogram(rets, main = "Risk Measures", methods = c("add.density", "add.normal","add.risk"),colorset=rich6equal)

1.2 Introduction to the `tidyVerse`

1.2.1 Tibbles

Since the middle of 2017 a lot of programmers have put in a huge effort to rewrite many r functions and data objects in a tidy way and thereby created the tidyverse.

For updates check the tidyverse homepage. A very well written book introducing the tidyverse can be found online: R for Data Science. The core of the tidyverse currently contains several packages:

ggplot2 for creating powerful graphs (see the vignette("ggplot2-specs"))
dplyr for data manipulation (see the vignette("dplyr"))
tidyr for tidying data
readr for importing datasets (see the vignette("readr"))
purrr for programming (see the ``)
tibble for modern data.frames (see the vignette("tibble"))

and many more.

require(tidyverse) # install first if not yet there, update regularly: install.packages("tidyverse")
require(tidyquant) # this package wraps all the quantmod etc packages into the language and structure of the tidyverse

Most of the following is adapted from “Introduction to Statistical Learning with Applications in R” by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani at http://www.science.smith.edu/~jcrouser/SDS293/labs/. We begin by loading in the Auto data set. This data is part of the ISLR package.

require(ISLR)
data(Auto)

Nothing happens when you run this, but now the data is available in your environment. (In RStudio, you would see the name of the data in your Environment tab). To view the data, we can either print the entire dataset by typing its name, or we can “slice” some of the data off to look at just a subset by piping data using the %>% operator into the slice function. The piping operator is one of the most useful tools of the tidyverse. Thereby you can pipe command into command into command without saving and naming each Intermittent step. The first step is to transform this data.frame into a tibble (similar concept but better). A tibble has observations in rows and variables in columns. Those variables can have many different formats:

Auto %>% slice(1:10)
tbs1 <- tibble(
  Date = seq(as.Date("2017-01-01"), length=12, by="months"),
  returns = rnorm(12),
  letters = sample(letters, 12, replace = TRUE)
)

As you can see all three columns of tbs1 have different formats. One can get the different variables by name and position. If you want to use the pipe operator you need to use the special placeholder ..

tbs1$returns
tbs1[[2]]
tbs1 %>% .[[2]]

Before we go on an analysis a large tibble such as Auto, we quickly talk about reading and saving files with tools from the tidyverse. We save the file as csv using write_csv and read it back using read_csv. because the columns of the read file are not in the exact format as before, we use mutate to transform the columns.

Auto <- as.tibble(Auto) # make tibble from Auto
tmp <- tempfile()
write_csv(Auto,path = tmp) # write
Auto2 <- read_csv(tmp)
Auto2 <- Auto2 %>%
  mutate(cylinders=as.double(cylinders),horsepower=as.double(horsepower),year=as.double(year),origin=as.double(origin),name=as.factor(name))
all.equal(Auto,Auto2) # only the factor levels differ

Notice that the data looks just the same as when we loaded it from the package. Now that we have the data, we can begin to learn things about it.

dim(Auto)
str(Auto)
names(Auto)

The dim() function tells us that the data has 392 observations and nine variables. The original data had some empty rows, but when we read the data in R knew to ignore them. The str() function tells us that most of the variables are numeric or integer, although the name variable is a character vector. names() lets us check the variable names.

1.2.2 Summary statistics

Often, we want to know some basic things about variables in our data. summary() on an entire dataset will give you an idea of some of the distributions of your variables. The summary() function produces a numerical summary of each variable in a particular data set.

 summary(Auto)

The summary suggests that origin might be better thought of as a factor. It only seems to have three possible values, 1, 2 and 3. If we read the documentation about the data (using ?Auto) we will learn that these numbers correspond to where the car is from: 1. American, 2. European, 3. Japanese. So, lets mutate() that variable into a factor (categorical) variable.

Auto <- Auto %>%
    mutate(origin = factor(origin))
summary(Auto)

1.2.3 Plotting

We can use the ggplot2 package to produce simple graphics. ggplot2 has a particular syntax, which looks like this

ggplot(Auto) + geom_point(aes(x=cylinders, y=mpg))

The basic idea is that you need to initialize a plot with ggplot() and then add “geoms” (short for geometric objects) to the plot. The ggplot2 package is based on the Grammar of Graphics, a famous book on data visualization theory. It is a way to map attributes in your data (like variables) to “aesthetics” on the plot. The parameter aes() is short for aesthetic.

For more about the ggplot2 syntax, view the help by typing ?ggplot or ?geom_point. There are also great online resources for ggplot2, like the R graphics cookbook.

The cylinders variable is stored as a numeric vector, so R has treated it as quantitative. However, since there are only a small number of possible values for cylinders, one may prefer to treat it as a qualitative variable. We can turn it into a factor, again using a mutate() call.

Auto = Auto %>%
    mutate(cylinders = factor(cylinders))

To view the relationship between a categorical and a numeric variable, we might want to produce boxplots. As usual, a number of options can be specified in order to customize the plots.

ggplot(Auto) + geom_boxplot(aes(x=cylinders, y=mpg)) + xlab("Cylinders") + ylab("MPG")

The geom geom_histogram() can be used to plot a histogram.

ggplot(Auto) + geom_histogram(aes(x=mpg), binwidth=5)

For small datasets, we might want to see all the bivariate relationships between the variables. The GGally package has an extension of the scatterplot matrix that can do just that. We make use of the select operator to only select the two variables mpg and cylinders and pipe it into the ggpairs() function

Auto %>% select(mpg, cylinders) %>% GGally::ggpairs()

Because there are not many cars with 3 and 5 cylinders we use filter to only select those cars with 4, 6 and 8 cylinders.

Auto %>% select(mpg, cylinders) %>% filter(cylinders %in% c(4,6,8)) %>% GGally::ggpairs()

Sometimes, we might want to save a plot for use outside of R. To do this, we can use the ggsave() function.

ggsave("histogram.png",ggplot(Auto) + geom_histogram(aes(x=mpg), binwidth=5))

TO DO: * Tidyquant: Document more technical features. * For extensive manipulations a la timeseries, there is an extension of the tibble objects: time aware tibbles, that allow for many of the xts functionality without the necessary conversion tibbletime.