13 Data Manipulation (II)
This chapter focuses on variable transformations, data aggregation, and other tasks operating on rows and columns, including handling duplicates, sorting rows, and ordering columns.
13.1 Transformating variables
Variable transformation is the replacement of a variable by a new value or a function of that variable. The purposes can be multiple: convenience, reducing skewness, producing linear or additive relationships, and more. Standardization and logarithmic transformation are common types of variable transformation.
creating new variables
The simplest way to create a new variable is using the operator $
.
AFL <- sp500stocks[sp500stocks$symbol == "AFL",]
AFL$close_lagged1 <- lag(AFL$close, k = 1)
AFL$close_lagged2 <- lag(AFL$close, k = 2)
lag
computes a lagged version of a time series, shifting the time base back by a given number of observations. The argument k
is the number of lags in units of observations.
Below is the dplyr
version. The lag()
is not the base R lag
, and n
is equivalent to k
.
## # A tibble: 20 × 9
## symbol date open high low close volume adjusted close_lagged1
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AFL 2021-04-12 52.0 52.6 51.8 52.4 2731200 52.4 NA
## 2 AFL 2021-04-13 52.1 52.5 51.9 52.3 3074600 52.3 52.4
## 3 AFL 2021-04-14 52.2 52.9 52.1 52.9 2255700 52.9 52.3
## 4 AFL 2021-04-15 52.9 52.9 52.5 52.8 2164500 52.8 52.9
## 5 AFL 2021-04-16 53.3 53.6 53.1 53.5 2763400 53.5 52.8
## 6 AFL 2021-04-19 53.5 53.7 53.3 53.6 3330100 53.6 53.5
## 7 AFL 2021-04-20 53.3 53.4 52.2 52.6 2512600 52.6 53.6
## 8 AFL 2021-04-21 52.7 53.6 52.6 53.5 3037700 53.5 52.6
## 9 AFL 2021-04-22 53.1 53.4 52.8 52.9 3874600 52.9 53.5
## 10 AFL 2021-04-23 53.0 53.6 52.8 53.4 2803400 53.4 52.9
## 11 AFL 2021-04-26 53.5 54.1 53.3 53.4 3473500 53.4 53.4
## 12 AFL 2021-04-27 53.4 53.6 53.2 53.5 2466700 53.5 53.4
## 13 AFL 2021-04-28 53.9 54 53.5 53.5 2247400 53.5 53.5
## 14 AFL 2021-04-29 54.1 54.9 53.5 54.2 3253400 54.2 53.5
## 15 AFL 2021-04-30 54.2 54.2 53.4 53.7 4143000 53.7 54.2
## 16 AFL 2021-05-03 54.3 54.9 53.8 54.5 3071000 54.5 53.7
## 17 AFL 2021-05-04 54.4 55.0 54.0 54.9 2948400 54.9 54.5
## 18 AFL 2021-05-05 54.9 55.4 54.9 55.4 2575000 55.4 54.9
## 19 AFL 2021-05-06 55.7 56.2 55.2 56.2 3791500 56.2 55.4
## 20 AFL 2021-05-07 55.4 56.2 55.3 56.1 2597200 56.1 56.2
mutate()
can operate on multiple variables.
## # A tibble: 20 × 10
## symbol date open high low close volume adjusted close_lagged1 close_lagged2
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AFL 2021-04-12 52.0 52.6 51.8 52.4 2731200 52.4 NA NA
## 2 AFL 2021-04-13 52.1 52.5 51.9 52.3 3074600 52.3 52.4 NA
## 3 AFL 2021-04-14 52.2 52.9 52.1 52.9 2255700 52.9 52.3 52.4
## 4 AFL 2021-04-15 52.9 52.9 52.5 52.8 2164500 52.8 52.9 52.3
## 5 AFL 2021-04-16 53.3 53.6 53.1 53.5 2763400 53.5 52.8 52.9
## 6 AFL 2021-04-19 53.5 53.7 53.3 53.6 3330100 53.6 53.5 52.8
## 7 AFL 2021-04-20 53.3 53.4 52.2 52.6 2512600 52.6 53.6 53.5
## 8 AFL 2021-04-21 52.7 53.6 52.6 53.5 3037700 53.5 52.6 53.6
## 9 AFL 2021-04-22 53.1 53.4 52.8 52.9 3874600 52.9 53.5 52.6
## 10 AFL 2021-04-23 53.0 53.6 52.8 53.4 2803400 53.4 52.9 53.5
## 11 AFL 2021-04-26 53.5 54.1 53.3 53.4 3473500 53.4 53.4 52.9
## 12 AFL 2021-04-27 53.4 53.6 53.2 53.5 2466700 53.5 53.4 53.4
## 13 AFL 2021-04-28 53.9 54 53.5 53.5 2247400 53.5 53.5 53.4
## 14 AFL 2021-04-29 54.1 54.9 53.5 54.2 3253400 54.2 53.5 53.5
## 15 AFL 2021-04-30 54.2 54.2 53.4 53.7 4143000 53.7 54.2 53.5
## 16 AFL 2021-05-03 54.3 54.9 53.8 54.5 3071000 54.5 53.7 54.2
## 17 AFL 2021-05-04 54.4 55.0 54.0 54.9 2948400 54.9 54.5 53.7
## 18 AFL 2021-05-05 54.9 55.4 54.9 55.4 2575000 55.4 54.9 54.5
## 19 AFL 2021-05-06 55.7 56.2 55.2 56.2 3791500 56.2 55.4 54.9
## 20 AFL 2021-05-07 55.4 56.2 55.3 56.1 2597200 56.1 56.2 55.4
Let’s create some technical indicators to evaluate the market performance of the stocks of the S&P 500 companies.
Simple Moving Average (SMA)
An n-day simple moving average (n-day SMA) is arithmetic average of prices of past n days.
We’ll create a variable that is a 2-day SMA of AFL’s close prices.
mySMA <- function (price, n){
sma <- c()
sma[1 : (n - 1)] <- NA
for (i in n : length(price)){
sma[i] <- mean(price[(i - n + 1) : i])
}
return(sma)
}
mySMA(AFL$close, 2)
## [1] NA 52.365 52.580 52.850 53.140 53.505 53.075 53.045 53.220 53.155 53.370 53.450 53.525 53.870 53.975
## [16] 54.125 54.705 55.135 55.780 56.145
We can add that variable to our data frame.
The package quantmod
provides an SMA()
function, which we can directly call to calculate SMA.
Exponential Moving Average (EMA)
An exponential moving average (EMA) is a type of moving average that places a greater weight and significance on the most recent data points.
EMA with n lagged period at time t is defined below by myEMA()
.
myEMA <- function (price, n){
ema <- c()
ema[1 : (n - 1)] <- NA
ema[n]<- mean(price[1 : n])
beta <- 2 / (n + 1)
for (i in (n + 1):length(price)){
ema[i] <- beta * price[i] + (1 - beta) * ema[i - 1]
}
return(ema)
}
myEMA(AFL$close, 3)
## [1] NA NA 52.53333 52.68167 53.06583 53.31292 52.95146 53.22573 53.08286 53.22643 53.29822 53.41411
## [13] 53.46705 53.84353 53.78676 54.15338 54.52169 54.95085 55.56542 55.83771
We can add a new variable that calculates three-day EMA of AFL’s close prices.
The quantmod
function is EMA()
.
dropping variables
Assigning NULL
to a variable removes that variable from the data frame.
Adding -
to a variable deselects that variable with dplyr::select()
.
AFL %>%
mutate(close_lagged1 = lag(close, n = 1),
close_lagged2 = lag(close, n = 2)) %>%
select(-close_lagged2)
## # A tibble: 20 × 11
## symbol date open high low close volume adjusted sma ema close_lagged1
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AFL 2021-04-12 52.0 52.6 51.8 52.4 2731200 52.4 NA NA NA
## 2 AFL 2021-04-13 52.1 52.5 51.9 52.3 3074600 52.3 52.4 NA 52.4
## 3 AFL 2021-04-14 52.2 52.9 52.1 52.9 2255700 52.9 52.6 52.5 52.3
## 4 AFL 2021-04-15 52.9 52.9 52.5 52.8 2164500 52.8 52.9 52.7 52.9
## 5 AFL 2021-04-16 53.3 53.6 53.1 53.5 2763400 53.5 53.1 53.1 52.8
## 6 AFL 2021-04-19 53.5 53.7 53.3 53.6 3330100 53.6 53.5 53.3 53.5
## 7 AFL 2021-04-20 53.3 53.4 52.2 52.6 2512600 52.6 53.1 53.0 53.6
## 8 AFL 2021-04-21 52.7 53.6 52.6 53.5 3037700 53.5 53.0 53.2 52.6
## 9 AFL 2021-04-22 53.1 53.4 52.8 52.9 3874600 52.9 53.2 53.1 53.5
## 10 AFL 2021-04-23 53.0 53.6 52.8 53.4 2803400 53.4 53.2 53.2 52.9
## 11 AFL 2021-04-26 53.5 54.1 53.3 53.4 3473500 53.4 53.4 53.3 53.4
## 12 AFL 2021-04-27 53.4 53.6 53.2 53.5 2466700 53.5 53.4 53.4 53.4
## 13 AFL 2021-04-28 53.9 54 53.5 53.5 2247400 53.5 53.5 53.5 53.5
## 14 AFL 2021-04-29 54.1 54.9 53.5 54.2 3253400 54.2 53.9 53.8 53.5
## 15 AFL 2021-04-30 54.2 54.2 53.4 53.7 4143000 53.7 54.0 53.8 54.2
## 16 AFL 2021-05-03 54.3 54.9 53.8 54.5 3071000 54.5 54.1 54.2 53.7
## 17 AFL 2021-05-04 54.4 55.0 54.0 54.9 2948400 54.9 54.7 54.5 54.5
## 18 AFL 2021-05-05 54.9 55.4 54.9 55.4 2575000 55.4 55.1 55.0 54.9
## 19 AFL 2021-05-06 55.7 56.2 55.2 56.2 3791500 56.2 55.8 55.6 55.4
## 20 AFL 2021-05-07 55.4 56.2 55.3 56.1 2597200 56.1 56.1 55.8 56.2
13.2 Recoding variables
Recode means replacing numeric values based on their position or name.
<-
One way we to recode variables is to use the assignment sign <-
.
For instance, below we combine the category “Communication Services” with “Information Technology”, and “Consumer Discretionary” with “Consumer Staples” in the sp500tickers
dataset.
##
## Communication Services Consumer Discretionary Consumer Staples Energy
## 26 63 32 23
## Financials Health Care Industrials Information Technology
## 65 62 74 75
## Materials Real Estate Utilities
## 28 29 28
sp500tickers$sector <- sp500tickers$`GICS Sector`
sp500tickers$sector[sp500tickers$sector == "Communication Services"] <- "CS&IT"
sp500tickers$sector[sp500tickers$sector == "Information Technology"] <- "CS&IT"
sp500tickers$sector[sp500tickers$sector == "Consumer Discretionary"] <- "Consumer"
sp500tickers$sector[sp500tickers$sector == "Consumer Staples"] <- "Consumer"
The resulting new category is shown below.
##
## Consumer CS&IT Energy Financials Health Care Industrials Materials Real Estate Utilities
## 95 101 23 65 62 74 28 29 28
dplyr::recode()
The same can be achieved by dplyr
function recode()
.
sp500tickers$sector2 <- recode(sp500tickers$`GICS Sector`,
`Communication Services` = "CS&IT",
`Information Technology` = "CS&IT",
`Consumer Discretionary` = "Consumer",
`Consumer Staples` = "Consumer")
head(sp500tickers$sector2)
## [1] "Industrials" "Health Care" "Health Care" "Health Care" "CS&IT" "CS&IT"
dplyr::case_when()
case_when()
allows us to vectorize multiple if
and else if
statements.
sp500tickers$sector3 <- case_when(
sp500tickers$`GICS Sector` %in% c("Communication Services","Information Technology") ~ "CS&IT",
sp500tickers$`GICS Sector` %in% c("Consumer Discretionary","Consumer Staples") ~ "Consumer")
What is not specified becomes NA
.
## [1] NA NA NA NA "CS&IT" "CS&IT"
To correct that:
sp500tickers$sector3 <- case_when(
sp500tickers$`GICS Sector` %in% c("Communication Services","Information Technology") ~ "CS&IT",
sp500tickers$`GICS Sector` %in% c("Consumer Discretionary","Consumer Staples") ~ "Consumer",
TRUE ~ sp500tickers$`GICS Sector`)
head(sp500tickers$sector3)
## [1] "Industrials" "Health Care" "Health Care" "Health Care" "CS&IT" "CS&IT"
dplyr::mutate(case_when())
mutate(case_when())
creates new variables based on specified conditions.
sp500tickers <- sp500tickers %>%
mutate(sector4 = case_when(
sp500tickers$`GICS Sector` %in% c("Communication Services","Information Technology") ~ "CS&IT",
sp500tickers$`GICS Sector` %in% c("Consumer Discretionary","Consumer Staples") ~ "Consumer",
TRUE ~ sp500tickers$`GICS Sector`)
)
head(sp500tickers$sector4)
## [1] "Industrials" "Health Care" "Health Care" "Health Care" "CS&IT" "CS&IT"
dplyr::mutate(ifelse())
mutate()
can also be used with ifelse()
.
sp500tickers %>%
filter(is.na(`Date first added`)) %>%
mutate(newdate = ifelse(is.na(`Date first added`), Founded, `Date first added`)) %>%
select(`Date first added`, Founded, newdate)
## # A tibble: 51 × 3
## `Date first added` Founded newdate
## <chr> <chr> <chr>
## 1 <NA> 1823 1823
## 2 <NA> 1851 1851
## 3 <NA> 1969 1969
## 4 <NA> 1938 1938
## 5 <NA> 1983 1983
## 6 <NA> 1911 1911
## 7 <NA> 1966 1966
## 8 <NA> 1858 1858
## 9 <NA> 1997 1997
## 10 <NA> 1947 1947
## # ℹ 41 more rows
13.3 Aggregating data
Data aggregation is the process of combining individual-level data to high-level data.
apply family functions
We discussed apply family functions thoroughly in the chapter Apply Family Functions. One application of these functions is aggregating data.
For instance, we can get the means of open, close, high, and low prices and the volume in the time periods from 2021-04-12 to 2021-05-07 using apply()
.
## open high low close volume adjusted
## 53.6065 54.0265 53.2685 53.7645 2955745.0000 53.7645
As another example, we can use tapply()
to calculate the momentum of the close prices broken out by symbol.
We first create the momentum function and set the default n
to be 2.
myMom <- function (price, n = 2){
mom <- rep(0, n)
N <- length(price)
Lprice <- lag(price, n)
for (i in (n + 1) : N){
mom[i] <- price[i] - Lprice[i]
}
return(mom)
}
We then apply myMom()
to the close prices of AFL. Here, symbol
is the group variable.
## $A
## [1] 0.000000 0.000000 0.699997 0.309998 2.610001 0.220001 -2.119995 0.330002 2.069992 2.849991 2.809997
## [12] -0.199997 -2.539993 -2.330002 -1.160004 -0.959992 -2.740005 0.179993 1.500000 0.529999
##
## $AAL
## [1] 0.000000 0.000000 -0.539999 -0.430000 -0.340000 -0.580000 -1.660000 -0.539999 -0.300001 0.100001 1.970001
## [12] 0.649999 -0.340000 -0.460001 0.019998 0.650002 -0.299999 -0.380001 0.070000 0.430000
##
## $AAP
## [1] 0.000000 0.000000 -0.289994 2.050003 5.149994 3.079986 0.990006 7.330002 6.389999 -0.119995 -3.160004
## [12] -2.470001 0.490006 3.399994 2.529999 1.180008 3.989990 0.949997 0.010010 1.660003
We can also use by()
to apply the function myMom()
to each symbol
. The result is a list.
## $A
## [1] 0.000000 0.000000 0.699997 0.309998 2.610001 0.220001 -2.119995 0.330002 2.069992 2.849991 2.809997
## [12] -0.199997 -2.539993 -2.330002 -1.160004 -0.959992 -2.740005 0.179993 1.500000 0.529999
##
## $AAL
## [1] 0.000000 0.000000 -0.539999 -0.430000 -0.340000 -0.580000 -1.660000 -0.539999 -0.300001 0.100001 1.970001
## [12] 0.649999 -0.340000 -0.460001 0.019998 0.650002 -0.299999 -0.380001 0.070000 0.430000
##
## $AAP
## [1] 0.000000 0.000000 -0.289994 2.050003 5.149994 3.079986 0.990006 7.330002 6.389999 -0.119995 -3.160004
## [12] -2.470001 0.490006 3.399994 2.529999 1.180008 3.989990 0.949997 0.010010 1.660003
dplyr
functions
The outputs from the apply functions are not always so controllable as we would like them to be. Sometimes getting our results into a data frame from apply function can take a bit of effort.
This is where the dplyr
functions come in handy. We show two quick examples below.
group_by()
group_by()
groups a data frame by the specified variable. ungroup()
removes the grouping.
group_by()
is most useful when used in combination with summarise()
, filter()
, mutate()
, and arrange()
.
sp500stocks %>%
filter(symbol %in% c("AFL", "MMM", "AAPL")) %>%
group_by(symbol) %>%
summarise(sum(volume))
## # A tibble: 3 × 2
## symbol `sum(volume)`
## <chr> <dbl>
## 1 AAPL 1820543300
## 2 AFL 59114900
## 3 MMM 43785600
tally()
tally()
counts the number of cases or calls sum()
to calculate the total amount.
## # A tibble: 124 × 3
## # Groups: GICS Sector [11]
## `GICS Sector` `GICS Sub-Industry` n
## <chr> <chr> <int>
## 1 Communication Services Advertising 2
## 2 Communication Services Alternative Carriers 1
## 3 Communication Services Broadcasting 2
## 4 Communication Services Cable & Satellite 3
## 5 Communication Services Integrated Telecommunication Services 2
## 6 Communication Services Interactive Home Entertainment 3
## 7 Communication Services Interactive Media & Services 4
## 8 Communication Services Movies & Entertainment 6
## 9 Communication Services Publishing 2
## 10 Communication Services Wireless Telecommunication Services 1
## # ℹ 114 more rows
tally(sum())
calculates the sum by the specified group.
sp500stocks %>%
filter(symbol %in% c("AFL", "MMM", "AAPL")) %>%
group_by(symbol) %>%
tally(sum(volume))
## # A tibble: 3 × 2
## symbol n
## <chr> <dbl>
## 1 AAPL 1820543300
## 2 AFL 59114900
## 3 MMM 43785600
13.4 Duplicates
keeping unique rows
dplyr::distinct()
is equivalent to unique()
, which selects unique rows.
We can apply distinct()
to the entire data frame to keep unique records using information from all variables.
We can also filter records based on a single variable.
In this example, duplicates are removed and all other variables are kept where symbol
has no duplicates. Only the first cases of duplicated symbol
have been kept. Without argument .keep_all
, the other variables will be dropped.
distinct()
allows for multiple variables to be included.
finding duplicates
Using duplicated()
from base R, we will get a logical vector indicating which values are duplicates. By default, the first value will be marked as the unique value (FALSE
) and the rest as duplicates (TRUE
).
We may also create a new column to mark the duplicates.
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [19] FALSE FALSE
In cases where we want to mark all identical elements as duplicates, we can specify the fromLast
argument to be TRUE
, where duplication would be considered from the reverse side.
13.5 Sorting rows
dplyr::arrange()
sorts variables in ascending order. It is the dplyr
version of sort()
.
desc()
sorts variables in descending order.
Alternatively, we can add -
to a variable within arrange()
.
13.6 Ordering columns
In addition to moving the rows, we can move columns in select()
.
## # A tibble: 10,100 × 7
## adjusted symbol date open high low close
## <dbl> <chr> <date> <dbl> <dbl> <dbl> <dbl>
## 1 198. MMM 2021-04-12 198. 199. 197. 198.
## 2 196. MMM 2021-04-13 196. 197. 195. 196.
## 3 197. MMM 2021-04-14 196. 197. 195. 197.
## 4 197. MMM 2021-04-15 197. 198. 196. 197.
## 5 199. MMM 2021-04-16 199. 200. 198. 199.
## 6 199. MMM 2021-04-19 198. 199. 197. 199.
## 7 198. MMM 2021-04-20 199. 199. 197. 198.
## 8 201. MMM 2021-04-21 198. 202. 198. 201.
## 9 201. MMM 2021-04-22 201. 202. 200. 201.
## 10 202. MMM 2021-04-23 201 203. 201. 202.
## # ℹ 10,090 more rows
We can move a variable to the front with select(var, everything())
.
## # A tibble: 10,100 × 8
## adjusted symbol date open high low close volume
## <dbl> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 198. MMM 2021-04-12 198. 199. 197. 198. 1682100
## 2 196. MMM 2021-04-13 196. 197. 195. 196. 2259700
## 3 197. MMM 2021-04-14 196. 197. 195. 197. 1575000
## 4 197. MMM 2021-04-15 197. 198. 196. 197. 1461600
## 5 199. MMM 2021-04-16 199. 200. 198. 199. 2769500
## 6 199. MMM 2021-04-19 198. 199. 197. 199. 2200500
## 7 198. MMM 2021-04-20 199. 199. 197. 198. 1653100
## 8 201. MMM 2021-04-21 198. 202. 198. 201. 2649000
## 9 201. MMM 2021-04-22 201. 202. 200. 201. 2215800
## 10 202. MMM 2021-04-23 201 203. 201. 202. 1864100
## # ℹ 10,090 more rows
select(-adjusted, everything())
moves the column adjusted
to the last.
## # A tibble: 10,100 × 8
## symbol date open high low close volume adjusted
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MMM 2021-04-12 198. 199. 197. 198. 1682100 198.
## 2 MMM 2021-04-13 196. 197. 195. 196. 2259700 196.
## 3 MMM 2021-04-14 196. 197. 195. 197. 1575000 197.
## 4 MMM 2021-04-15 197. 198. 196. 197. 1461600 197.
## 5 MMM 2021-04-16 199. 200. 198. 199. 2769500 199.
## 6 MMM 2021-04-19 198. 199. 197. 199. 2200500 199.
## 7 MMM 2021-04-20 199. 199. 197. 198. 1653100 198.
## 8 MMM 2021-04-21 198. 202. 198. 201. 2649000 201.
## 9 MMM 2021-04-22 201. 202. 200. 201. 2215800 201.
## 10 MMM 2021-04-23 201 203. 201. 202. 1864100 202.
## # ℹ 10,090 more rows