12 Data Manipulation (I)

Data manipulation is the “dirty work” that must be done before we can start data analysis properly.

In this and the next several chapters, we’ll address the essential tasks in data manipulation one by one, and apply those tricks to our datasets. This chapter focuses on the data manipulation tasks for exploratory data analysis.

12.1 Demo datasets

In the next several chapters, we’ll use two demo datasets for demonstrations. The first dataset is a table S&P 500 component stocks scraped from the Wikipedia page List of S&P 500 companies following this tutorial.

Later, we’ll use the Symbol of the S&P 500 to match and collect historical Yahoo Finance data so that we have both the historical performance of the stocks and the descriptions of the firms.

library(readr)
sp500tickers <- read_csv("sp500tickers.csv")
head(sp500tickers, 3)
## # A tibble: 3 × 9
##   Symbol Security         SEC f…¹ GICS …² GICS …³ Headq…⁴ Date …⁵    CIK Founded
##   <chr>  <chr>            <chr>   <chr>   <chr>   <chr>   <chr>    <dbl> <chr>  
## 1 MMM    3M Company       reports Indust… Indust… St. Pa… 1976-0… 6.67e4 1902   
## 2 ABT    Abbott Laborato… reports Health… Health… North … 1964-0… 1.8 e3 1888   
## 3 ABBV   AbbVie Inc.      reports Health… Pharma… North … 2012-1… 1.55e6 2013 (…
## # … with abbreviated variable names ¹​`SEC filings`, ²​`GICS Sector`,
## #   ³​`GICS Sub-Industry`, ⁴​`Headquarters Location`, ⁵​`Date first added`

The second dataset is the stock performance of these S&P 500 companies, collected from the Yahoo Finance historical data page.

sp500stocks <- read.csv("sp500stocks.csv")
head(sp500stocks, 3)
##   symbol       date   open   high    low  close  volume adjusted
## 1    MMM 2021-04-12 198.20 199.29 197.06 197.83 1682100   197.83
## 2    MMM 2021-04-13 196.01 197.33 195.37 196.47 2259700   196.47
## 3    MMM 2021-04-14 195.55 197.39 195.02 196.75 1575000   196.75

symbol

symbol is a stock’s symbol, also known as ticker. It is a unique series of letters assigned to a security for trading purposes on a particular stock market. For instance, stocks listed on the New York Stock Exchange (NYSE) can have four or fewer letters. NASDAQ-listed securities can have up to five characters. In short, symbols are a way of describing a company’s stock.

prices

In stock trading, open is the price of the stock when the market opened on this day. close is the price at which a stock ended trading in the same period.

high means the highest price in a given period of time. low means the lowest price in a given period of time.

volume is the total amount of trading activity. It is the number of shares that have been bought and sold for the day.

adjusted is adjusted close price. It is the closing price after adjustments for all applicable splits and dividend distributions.

12.2 First look at our datasets

head(), tail()

head() and tail() let us review the first or last several records in our datasets.

tail(sp500stocks, n = 5)
##       symbol       date   open   high    low  close  volume adjusted
## 10096    ZTS 2021-05-03 174.70 176.23 173.35 175.24 1886800   175.24
## 10097    ZTS 2021-05-04 174.43 174.82 172.42 173.77 1849700   173.77
## 10098    ZTS 2021-05-05 174.20 174.30 172.77 173.86 1433500   173.86
## 10099    ZTS 2021-05-06 167.25 168.63 165.22 167.52 3200100   167.52
## 10100    ZTS 2021-05-07 169.19 173.26 169.00 171.55 1492500   171.55

summary(), str()

summary() provides summaries of our dataset. It is also useful to produce result summaries from model fitting functions.

For a data frame, summary() provides the summary statistics for numeric variables.

summary(sp500stocks)
##     symbol              date                open              high        
##  Length:10100       Length:10100       Min.   :   9.77   Min.   :  10.52  
##  Class :character   Class :character   1st Qu.:  60.95   1st Qu.:  61.61  
##  Mode  :character   Mode  :character   Median : 113.34   Median : 114.58  
##                                        Mean   : 189.65   Mean   : 191.56  
##                                        3rd Qu.: 206.42   3rd Qu.: 208.43  
##                                        Max.   :5149.68   Max.   :5234.27  
##       low              close             volume             adjusted      
##  Min.   :   9.70   Min.   :  10.11   Min.   :     9800   Min.   :  10.11  
##  1st Qu.:  60.45   1st Qu.:  61.11   1st Qu.:   859275   1st Qu.:  61.11  
##  Median : 112.11   Median : 113.60   Median :  1749450   Median : 113.60  
##  Mean   : 187.62   Mean   : 189.72   Mean   :  4279249   Mean   : 189.68  
##  3rd Qu.: 204.78   3rd Qu.: 206.72   3rd Qu.:  4053675   3rd Qu.: 206.72  
##  Max.   :5109.32   Max.   :5211.38   Max.   :230430600   Max.   :5211.38

We can also use summary() to summarize variables, in addition to data frames.

summary(sp500stocks$high)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.52   61.61  114.58  191.56  208.43 5234.27

str() displays a structure of an object.

base_stocks <- read.csv("sp500stocks.csv")
str(base_stocks)
## 'data.frame':    10100 obs. of  8 variables:
##  $ symbol  : chr  "MMM" "MMM" "MMM" "MMM" ...
##  $ date    : chr  "2021-04-12" "2021-04-13" "2021-04-14" "2021-04-15" ...
##  $ open    : num  198 196 196 197 199 ...
##  $ high    : num  199 197 197 198 200 ...
##  $ low     : num  197 195 195 196 198 ...
##  $ close   : num  198 196 197 197 199 ...
##  $ volume  : int  1682100 2259700 1575000 1461600 2769500 2200500 1653100 2649000 2215800 1864100 ...
##  $ adjusted: num  198 196 197 197 199 ...

tibble is essentially tidyverse’s data frame, and that won’t change the way we work with data frames fundamentally.

tidy_stocks <- read_csv("sp500stocks.csv")
str(tidy_stocks)
## spc_tbl_ [10,100 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ symbol  : chr [1:10100] "MMM" "MMM" "MMM" "MMM" ...
##  $ date    : Date[1:10100], format: "2021-04-12" "2021-04-13" ...
##  $ open    : num [1:10100] 198 196 196 197 199 ...
##  $ high    : num [1:10100] 199 197 197 198 200 ...
##  $ low     : num [1:10100] 197 195 195 196 198 ...
##  $ close   : num [1:10100] 198 196 197 197 199 ...
##  $ volume  : num [1:10100] 1682100 2259700 1575000 1461600 2769500 ...
##  $ adjusted: num [1:10100] 198 196 197 197 199 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   symbol = col_character(),
##   ..   date = col_date(format = ""),
##   ..   open = col_double(),
##   ..   high = col_double(),
##   ..   low = col_double(),
##   ..   close = col_double(),
##   ..   volume = col_double(),
##   ..   adjusted = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

table()

table() returns a frequency table.

table(sp500tickers$`GICS Sector`)
## 
## Communication Services Consumer Discretionary       Consumer Staples 
##                     26                     63                     32 
##                 Energy             Financials            Health Care 
##                     23                     65                     62 
##            Industrials Information Technology              Materials 
##                     74                     75                     28 
##            Real Estate              Utilities 
##                     29                     28

12.3 Handling missing values

This section introduces ways to identify and replace missing values.

identifying missing values

summary(), is.na(), and complete.cases() can be used to identify missing (and non-missing) values, together with functions like which(), which returns locations of missing values, and sum(), which gives us counts of missing values.

summary()

To start with, we can find out if a numeric variable contains missing values or not by summary().

As shown below, we don’t have such variables in our datasets.

summary(sp500stocks)
##     symbol              date                open              high        
##  Length:10100       Length:10100       Min.   :   9.77   Min.   :  10.52  
##  Class :character   Class :character   1st Qu.:  60.95   1st Qu.:  61.61  
##  Mode  :character   Mode  :character   Median : 113.34   Median : 114.58  
##                                        Mean   : 189.65   Mean   : 191.56  
##                                        3rd Qu.: 206.42   3rd Qu.: 208.43  
##                                        Max.   :5149.68   Max.   :5234.27  
##       low              close             volume             adjusted      
##  Min.   :   9.70   Min.   :  10.11   Min.   :     9800   Min.   :  10.11  
##  1st Qu.:  60.45   1st Qu.:  61.11   1st Qu.:   859275   1st Qu.:  61.11  
##  Median : 112.11   Median : 113.60   Median :  1749450   Median : 113.60  
##  Mean   : 187.62   Mean   : 189.72   Mean   :  4279249   Mean   : 189.68  
##  3rd Qu.: 204.78   3rd Qu.: 206.72   3rd Qu.:  4053675   3rd Qu.: 206.72  
##  Max.   :5109.32   Max.   :5211.38   Max.   :230430600   Max.   :5211.38

Let’s load another dataset from dplyr just to see what happens when we have missing values in a numeric variable.

starwars <- dplyr::starwars
summary(starwars$birth_year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    8.00   35.00   52.00   87.57   72.00  896.00      44

is.na()

is.na() evaluates whether elements are missing. We will get TRUE if an element is missing and FALSE if an element is not missing.

It is self-explanatory that !is.na() evaluates whether elements are NOT missing.

As shown below, Date first added is a character vector that has quite some missing values.

is.na(head(sp500tickers$`Date first added`, 30))
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE

which(is.na())

Now the question is: how would I know where in the dataset these missing values occur?

We can use which(), which evaluates which indices are TRUE.

which(is.na(sp500tickers$`Date first added`))
##  [1] 123 127 135 136 151 161 178 191 192 204 205 211 239 240 270 276 285 294 302
## [20] 307 310 313 325 326 328 340 342 355 362 374 379 392 396 404 409 415 424 425
## [39] 433 435 442 454 457 467 474 477 484 485 492 493 494

sum(is.na()), mean(is.na())

Two useful tricks to identify missing values are sum(is.na()) and mean(is.na()).

sum(is.na()) tells us how many missing values there are in a variable.

sum(is.na(sp500tickers$`Date first added`))
## [1] 51

mean(is.na()) tells us if there are any missing values in a variable. If there is no missing value, we should get 0. If there indeed are missing values, the returned value should be larger than 0.

mean(is.na(sp500tickers$`Date first added`))
## [1] 0.1009901

complete.cases()

complete.cases() returns a logical vector indicating which cases are complete, i.e., have no missing values.

complete.cases(head(sp500tickers$`Date first added`, 30))
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

To take subsets where Date first added is complete:

sp500tickers[complete.cases(sp500tickers$`Date first added`), ] 
## # A tibble: 454 × 9
##    Symbol Security        SEC f…¹ GICS …² GICS …³ Headq…⁴ Date …⁵    CIK Founded
##    <chr>  <chr>           <chr>   <chr>   <chr>   <chr>   <chr>    <dbl> <chr>  
##  1 MMM    3M Company      reports Indust… Indust… St. Pa… 1976-0… 6.67e4 1902   
##  2 ABT    Abbott Laborat… reports Health… Health… North … 1964-0… 1.8 e3 1888   
##  3 ABBV   AbbVie Inc.     reports Health… Pharma… North … 2012-1… 1.55e6 2013 (…
##  4 ABMD   Abiomed         reports Health… Health… Danver… 2018-0… 8.15e5 1981   
##  5 ACN    Accenture       reports Inform… IT Con… Dublin… 2011-0… 1.47e6 1989   
##  6 ATVI   Activision Bli… reports Commun… Intera… Santa … 2015-0… 7.19e5 2008   
##  7 ADBE   Adobe Inc.      reports Inform… Applic… San Jo… 1997-0… 7.96e5 1982   
##  8 AMD    Advanced Micro… reports Inform… Semico… Santa … 2017-0… 2.49e3 1969   
##  9 AAP    Advance Auto P… reports Consum… Automo… Raleig… 2015-0… 1.16e6 1932   
## 10 AES    AES Corp        reports Utilit… Indepe… Arling… 1998-1… 8.75e5 1981   
## # … with 444 more rows, and abbreviated variable names ¹​`SEC filings`,
## #   ²​`GICS Sector`, ³​`GICS Sub-Industry`, ⁴​`Headquarters Location`,
## #   ⁵​`Date first added`

replacing missing values

There are cases where we would like to recode the missing values. For instance, in survey data, when a respondent refused to respond, these cases could be coded as numbers like -999. If we don’t read the codebook carefully, and just treat those records as reflecting the truth, we could make serious mistakes. Therefore, we should recode those values to system missing values. In R, we want to recode them to NA.

survey$variable[survey$variable == -999] <- NA

There are also cases where we would like to recode missing values to a meaningful number, such as 0.

df$variable[is.na(df$variable)] <- 0

12.4 Subsets

To examine subsets of our data to see whether some conditions are true. For instance, we may go back to subsets of our data when we notice something that does not make sense and would like to take a closer look at it. Other times we may want to take a subset from the entire dataset and only work on that subset for further operations.

There are several ways to take subsets.

The first option is data frame indexing. For instance, we can take a subset on the stock AFL.

AFL <- sp500stocks[sp500stocks$symbol == "AFL",]

What about the stocks that have the highest or lowest open prices and on which days?

sp500stocks[sp500stocks$open == max(sp500stocks$open, na.rm = TRUE), c("symbol","date","open")]
##      symbol       date    open
## 6999    NVR 2021-05-06 5149.68
sp500stocks[sp500stocks$open == min(sp500stocks$open, na.rm = TRUE), c("symbol","date","open")]
##      symbol       date open
## 5968    MRO 2021-04-21 9.77

subset()

In addition to using the indexing operators, we can also use the function subset() to subset a data frame that meet conditions.

subset(x, subset, select)

The subset argument is a logical expression indicating elements or rows to keep; missing values are treated as FALSE. The select argument indicates columns to select from a data frame.

subset(sp500stocks, subset = open == max(open), select = c(symbol, date, open))
##      symbol       date    open
## 6999    NVR 2021-05-06 5149.68

One of the benefits of using subset() is that the code is more readable.

If we don’t specify the columns, we are going to get all the columns back.

subset(sp500stocks, subset = open == max(open))
##      symbol       date    open    high     low   close volume adjusted
## 6999    NVR 2021-05-06 5149.68 5151.35 5075.51 5134.44  13300  5134.44

tidyverse functions

tidyverse packages operate on “tidy data”. Tidy data has a specific structure: each variable is a column; each observation is a row; and each type of observational unit is a table. That is straightforward.

In addition to the tidyverse and base R approaches, data.table is another powerful package that takes care of data manipulation tasks.

We’ll primarily rely on dplyr and base R functions in this chapter. dplyr provides fast and intuitive ways to manipulate date frames.

library(dplyr)

%>% pipe opeartor

A side note on tidyverse pipe operator %>% before we proceed. The pipe operator %>% comes from the package magrittr.

%>% chains functions together, passing the output of the former function to the input of the next function. Step by step, each new operation is performed on the output from the previous one. In this way, we don’t have to nest a lot of functions, which also makes the code more readable.

session %>% 
  filter(!is.na(performance)) %>%
  group_by(date, performance) %>%
  add_tally(interval) %>%
  rename(sum_day = n)

In the example below, we use filter() to only keep the stocks of AFL.

sp500stocks %>% filter(symbol == "AFL")

This one keeps the record with the highest open price, and uses select() to keep the three variables symbol, date and open of that record.

sp500stocks %>% filter(open == max(open)) %>% select(symbol, date, open)

Adding - deselects variables.

sp500stocks %>% filter(symbol == "AFL") %>% select(-open, -close)