15 Data Manipulation (IV)

This chapter discusses another technique in data frame manipulations, reshaping data frames from wide to long format or from long to wide format.

15.1 Wide and long formats

The term reshape points to something very specific in data frame manipulation, which is to restructure a data frame from wide to long format or from long to wide format.

What exactly does wide or long mean for a data frame? How do data frames in wide or long format looks like? Actually, we have some good examples from our Yahoo Finance datasets.

wide

When we download historical stocks data from Yahoo Finance using quantmod, we will see that information for each stock is stored in individual objects.

library(quantmod)
tickers <- c("AFL","AAPL", "MMM")
getSymbols(tickers, src = "yahoo", 
           from = "2021-01-01", 
           to = "2021-01-31")
##            MMM.Open MMM.High MMM.Low MMM.Close MMM.Volume MMM.Adjusted
## 2021-01-04   175.00   176.20  170.55    171.87    2996200     169.2200
## 2021-01-05   172.01   173.25  170.65    171.58    2295300     168.9345
## 2021-01-06   172.72   175.57  172.04    174.19    3346400     171.5043
## 2021-01-07   171.56   173.46  166.16    169.72    5863400     167.1032
## 2021-01-08   169.17   169.54  164.61    166.62    4808100     164.0510
## 2021-01-11   166.10   167.14  165.00    165.20    2736600     162.6529

These are xts objects.

class(MMM)
## [1] "xts" "zoo"

An xts object can be thought of as a matrix of observations combined with an index of corresponding dates and times: xts = matrix + times. The index holds the information that we need for xts to treat our data as a time series.

These time series objects are designed to be bound together by dates and times. If we combine these xts objects, we bind these data frames horizontally.

qm <- cbind(AAPL, AFL, MMM)

head(qm)
##            AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted AFL.Open AFL.High AFL.Low AFL.Close
## 2021-01-04    133.52    133.61   126.76     129.41   143301900      128.9978    44.60    44.70   43.03     43.19
## 2021-01-05    128.89    131.74   128.43     131.01    97664900      130.5927    43.33    43.70   42.96     43.26
## 2021-01-06    127.72    131.05   126.38     126.60   155088000      126.1967    43.91    45.20   43.54     44.93
## 2021-01-07    128.36    131.63   127.86     130.92   109578200      130.5030    45.18    45.26   44.45     44.68
## 2021-01-08    132.43    132.63   130.23     132.05   105158200      131.6294    44.92    44.99   43.72     44.49
## 2021-01-11    129.19    130.17   128.50     128.98   100384500      128.5692    44.22    44.82   44.00     44.51
##            AFL.Volume AFL.Adjusted MMM.Open MMM.High MMM.Low MMM.Close MMM.Volume MMM.Adjusted
## 2021-01-04    3111200     42.63539   175.00   176.20  170.55    171.87    2996200     169.2200
## 2021-01-05    2561700     42.70449   172.01   173.25  170.65    171.58    2295300     168.9345
## 2021-01-06    3842900     44.35305   172.72   175.57  172.04    174.19    3346400     171.5043
## 2021-01-07    4217400     44.10626   171.56   173.46  166.16    169.72    5863400     167.1032
## 2021-01-08    2915700     43.91870   169.17   169.54  164.61    166.62    4808100     164.0510
## 2021-01-11    2479000     43.93844   166.10   167.14  165.00    165.20    2736600     162.6529

The resulting data frame is wide. In this wide format, we have one record for each date. Prices for different stocks on the same date will be in one row. The observations for different stocks are coded as different columns.

long

Compare what we get from quantmod with what we get from tidyquant.

library(tidyquant)
tq <- tq_get(c("AFL","AAPL", "MMM"), 
             get = "stock.prices", 
             from="2021-01-01",to="2021-01-31") 

We get a data frame that stores information of all the stocks vertically. The resulting data frame is long.

## # A tibble: 11 × 8
##    symbol date        open  high   low close    volume adjusted
##    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
##  1 AFL    2021-01-25  46.0  46.6  45.7  46.6   5031400     46.0
##  2 AFL    2021-01-26  46.7  47.0  46.4  46.4   3956600     45.8
##  3 AFL    2021-01-27  45.8  46.1  44.8  45.2   4463200     44.6
##  4 AFL    2021-01-28  45.6  46.5  45.3  46.1   4272600     45.5
##  5 AFL    2021-01-29  45.9  46.0  44.8  45.2   5236800     44.6
##  6 AAPL   2021-01-04 134.  134.  127.  129.  143301900    129. 
##  7 AAPL   2021-01-05 129.  132.  128.  131.   97664900    131. 
##  8 AAPL   2021-01-06 128.  131.  126.  127.  155088000    126. 
##  9 AAPL   2021-01-07 128.  132.  128.  131.  109578200    131. 
## 10 AAPL   2021-01-08 132.  133.  130.  132.  105158200    132. 
## 11 AAPL   2021-01-11 129.  130.  128.  129.  100384500    129.

In this long format, in each record, the variable date records the time period, and the variable symbol groups the records from the same stocks. There will be multiple records for each date. Each row is one time point per stock.

wide or long

Both formats have their advantages. If the data is collected on the same time points, the wide format has no redundancy or repetition. Stock prices collected on the same day take one row. Elementary statistical computations are easy to do in the wide format.

The long format, on the other hand, has an explicit time variable available that can be used for analysis. Graphs and statistical analyses are easier in the long format.

15.2 Reshaping data frames from long to wide

The next question is how to convert between wide and the long formats.

First, long to wide. For instance, how can we convert the long dataset collected by tidyquant to the wide format? The package tidyr offers the function pivot_wider() that can help us convert a data frame from long to wide in most situations that are not hugely complicated.

library(tidyr)
tq_to_wide <- 
  tq %>% 
  pivot_wider(names_from = symbol, 
              values_from = c(open,close,high,low,volume,adjusted),
              names_sep = ".") 

Arguments names_from and values_from are a pair of arguments describing which column(s) to get the name of the output column (names_from), and which column(s) to get the cell values from (values_from).

If values_from contains multiple values, the value will be added to the beginning of the output column. Therefore, the variable symbol in the original long format would be added to the price variables open, close, high, low, volume, and adjusted.

head(tq_to_wide)
## # A tibble: 6 × 19
##   date       open.AFL open.AAPL open.MMM close.AFL close.AAPL close.MMM high.AFL high.AAPL high.MMM low.AFL
##   <date>        <dbl>     <dbl>    <dbl>     <dbl>      <dbl>     <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
## 1 2021-01-04     44.6      134.     175       43.2       129.      172.     44.7      134.     176.    43.0
## 2 2021-01-05     43.3      129.     172.      43.3       131.      172.     43.7      132.     173.    43.0
## 3 2021-01-06     43.9      128.     173.      44.9       127.      174.     45.2      131.     176.    43.5
## 4 2021-01-07     45.2      128.     172.      44.7       131.      170.     45.3      132.     173.    44.5
## 5 2021-01-08     44.9      132.     169.      44.5       132.      167.     45.0      133.     170.    43.7
## 6 2021-01-11     44.2      129.     166.      44.5       129.      165.     44.8      130.     167.    44  
## # ℹ 8 more variables: low.AAPL <dbl>, low.MMM <dbl>, volume.AFL <dbl>, volume.AAPL <dbl>, volume.MMM <dbl>,
## #   adjusted.AFL <dbl>, adjusted.AAPL <dbl>, adjusted.MMM <dbl>

Because we have multiple variables here, these column names need to be joined together by a pattern, which is a dot. This is specified in names_sep.

15.3 Reshaping data frames from wide to long

tidyr has also provides a function pivot_longer() to convert a data frame from wide to long.

Let’s discuss a more meaningful application of reshaping a data frame from wide to long for visualization. The task is to make a line graph with multiple lines using our stocks dataset.

We created a subset where it only contains data of the company MMM . We also created a new variable SMA_2 that is the simple moving average for the past two days. The plot with a single line looks like this.

library(dplyr)
library(ggplot2)

sp500 %>% 
  mutate(SMA_2 = SMA(close, n = 2)) %>%
  filter(symbol == "MMM") %>%
  ggplot(aes(x = date, y = SMA_2)) + 
    geom_line() +
    geom_point()

We’ll explain how to visualize data with R later, and focus on the motivation of reshaping the dataset for our visualization task.

If we want to add a new line of the simple moving average for the past five days, we need to create a new subset, where we have an additional column of the SMA for the past five days.

sp500 %>% 
  mutate(SMA_2 = SMA(close, n = 2)) %>%
  mutate(SMA_5 = SMA(close, n = 5)) %>%
  filter(symbol == "MMM") %>%
  select(symbol, date, SMA_2, SMA_5)
## # A tibble: 20 × 4
##    symbol date       SMA_2 SMA_5
##    <chr>  <date>     <dbl> <dbl>
##  1 MMM    2021-04-12   NA    NA 
##  2 MMM    2021-04-13  197.   NA 
##  3 MMM    2021-04-14  197.   NA 
##  4 MMM    2021-04-15  197.   NA 
##  5 MMM    2021-04-16  198.  197.
##  6 MMM    2021-04-19  199.  198.
##  7 MMM    2021-04-20  198.  198.
##  8 MMM    2021-04-21  200.  199.
##  9 MMM    2021-04-22  201.  199.
## 10 MMM    2021-04-23  202.  200.
## 11 MMM    2021-04-26  201.  200.
## 12 MMM    2021-04-27  197.  200.
## 13 MMM    2021-04-28  195.  199.
## 14 MMM    2021-04-29  197.  198.
## 15 MMM    2021-04-30  198.  197.
## 16 MMM    2021-05-03  198.  197.
## 17 MMM    2021-05-04  199.  198.
## 18 MMM    2021-05-05  200.  199.
## 19 MMM    2021-05-06  202.  200.
## 20 MMM    2021-05-07  203.  201.

However, for visualization, we need a column that contains both SMAs for the past two days and for past five days.

ggplot(aes(x = date, y = SMA_2_5, linetype = n)) + 
    geom_line() +
    geom_point()

In other words, we need the dataset to be of the long format, where the two groups are in one column. We can convert the wide format to the long format using the function pivot_longer() from tidyr.

MMM_SMA <- 
  sp500 %>% 
  mutate(SMA_2 = SMA(close, n = 2)) %>%
  mutate(SMA_5 = SMA(close, n = 5)) %>%
  filter(symbol == "MMM") %>%
  select(symbol, date, SMA_2, SMA_5) %>%
  pivot_longer(cols = SMA_2:SMA_5,
               names_to = c("SMA", "n"),
               names_pattern = "(.)_(.)")

In pivot_longer(), cols specifies the columns to pivot into longer format. names_to is a character vector that specifies the names of the columns to create. names_pattern controls how the column name would be broken up if we have multiple variable names. It takes a regular expression containing matching groups.

MMM_SMA
## # A tibble: 40 × 5
##    symbol date       SMA   n     value
##    <chr>  <date>     <chr> <chr> <dbl>
##  1 MMM    2021-04-12 A     2       NA 
##  2 MMM    2021-04-12 A     5       NA 
##  3 MMM    2021-04-13 A     2      197.
##  4 MMM    2021-04-13 A     5       NA 
##  5 MMM    2021-04-14 A     2      197.
##  6 MMM    2021-04-14 A     5       NA 
##  7 MMM    2021-04-15 A     2      197.
##  8 MMM    2021-04-15 A     5       NA 
##  9 MMM    2021-04-16 A     2      198.
## 10 MMM    2021-04-16 A     5      197.
## # ℹ 30 more rows

As shown above, the two SMAs now both go to the column value. There are some missing values here because we are running an SMA for the past n days. For n-1 days, SMAs will be missing.

Now we can create a line graph by group.

MMM_SMA %>% 
  ggplot(aes(x = date, y = value, linetype = n)) + 
    geom_line() +
    geom_point()