12 Data Manipulation (I)
Data manipulation is the “dirty work” that must be done before we start data analysis properly.
In this and the next several chapters, we’ll address the essential tasks in data manipulation one by one, and apply those tricks to our datasets. This chapter focuses on data manipulation for exploratory data analysis.
12.1 Demo datasets
In the next several chapters, we’ll use two demo datasets. The first dataset is a table S&P 500 component stocks scraped from the Wikipedia page List of S&P 500 companies following this tutorial.
Later, we’ll use the Symbol of the S&P 500 to match and collect historical Yahoo Finance data so that we have both the historical performance of the stocks and the descriptions of the firms.
## # A tibble: 3 × 9
## Symbol Security `SEC filings` `GICS Sector` `GICS Sub-Industry` Headquarters Locatio…¹ `Date first added` CIK
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 MMM 3M Comp… reports Industrials Industrial Conglom… St. Paul, Minnesota 1976-08-09 6.67e4
## 2 ABT Abbott … reports Health Care Health Care Equipm… North Chicago, Illino… 1964-03-31 1.8 e3
## 3 ABBV AbbVie … reports Health Care Pharmaceuticals North Chicago, Illino… 2012-12-31 1.55e6
## # ℹ abbreviated name: ¹`Headquarters Location`
## # ℹ 1 more variable: Founded <chr>
The second dataset is the stock performance of these S&P 500 companies collected from the Yahoo Finance historical data page.
## symbol date open high low close volume adjusted
## 1 MMM 2021-04-12 198.20 199.29 197.06 197.83 1682100 197.83
## 2 MMM 2021-04-13 196.01 197.33 195.37 196.47 2259700 196.47
## 3 MMM 2021-04-14 195.55 197.39 195.02 196.75 1575000 196.75
symbol
symbol
is a stock’s symbol, also known as ticker. It is a unique series of letters assigned to a security for trading purposes on a particular stock market. For instance, stocks listed on the New York Stock Exchange (NYSE) can have four or fewer letters. NASDAQ-listed securities can have up to five characters. In short, symbols are a way of describing a company’s stock.
prices
In stock trading, open
is the price of the stock when the market opened on this day. close
is the price at which a stock ended trading in the same period.
high
means the highest price in a given period of time. low
means the lowest price in a given period of time.
volume
is the total amount of trading activity. It is the number of shares that have been bought and sold for the day.
adjusted
is adjusted close price. It is the closing price after adjustments for all applicable splits and dividend distributions.
12.2 First look at our datasets
head()
, tail()
head()
and tail()
let us review the first or last several records in our datasets.
## symbol date open high low close volume adjusted
## 10096 ZTS 2021-05-03 174.70 176.23 173.35 175.24 1886800 175.24
## 10097 ZTS 2021-05-04 174.43 174.82 172.42 173.77 1849700 173.77
## 10098 ZTS 2021-05-05 174.20 174.30 172.77 173.86 1433500 173.86
## 10099 ZTS 2021-05-06 167.25 168.63 165.22 167.52 3200100 167.52
## 10100 ZTS 2021-05-07 169.19 173.26 169.00 171.55 1492500 171.55
summary()
, str()
summary()
provides summaries of our dataset. It is also useful when we need to review result summaries from model fitting functions.
For our data frame sp500stocks
, summary()
provides the summary statistics for numeric variables.
## symbol date open high low close
## Length:10100 Length:10100 Min. : 9.77 Min. : 10.52 Min. : 9.70 Min. : 10.11
## Class :character Class :character 1st Qu.: 60.95 1st Qu.: 61.61 1st Qu.: 60.45 1st Qu.: 61.11
## Mode :character Mode :character Median : 113.34 Median : 114.58 Median : 112.11 Median : 113.60
## Mean : 189.65 Mean : 191.56 Mean : 187.62 Mean : 189.72
## 3rd Qu.: 206.42 3rd Qu.: 208.43 3rd Qu.: 204.78 3rd Qu.: 206.72
## Max. :5149.68 Max. :5234.27 Max. :5109.32 Max. :5211.38
## volume adjusted
## Min. : 9800 Min. : 10.11
## 1st Qu.: 859275 1st Qu.: 61.11
## Median : 1749450 Median : 113.60
## Mean : 4279249 Mean : 189.68
## 3rd Qu.: 4053675 3rd Qu.: 206.72
## Max. :230430600 Max. :5211.38
We can also use summary()
to summarize variables, in addition to data frames.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.52 61.61 114.58 191.56 208.43 5234.27
str()
displays a structure of an object.
## 'data.frame': 10100 obs. of 8 variables:
## $ symbol : chr "MMM" "MMM" "MMM" "MMM" ...
## $ date : chr "2021-04-12" "2021-04-13" "2021-04-14" "2021-04-15" ...
## $ open : num 198 196 196 197 199 ...
## $ high : num 199 197 197 198 200 ...
## $ low : num 197 195 195 196 198 ...
## $ close : num 198 196 197 197 199 ...
## $ volume : int 1682100 2259700 1575000 1461600 2769500 2200500 1653100 2649000 2215800 1864100 ...
## $ adjusted: num 198 196 197 197 199 ...
tibble
is essentially tidyverse
’s data frame, and that won’t change the way we work with data frames fundamentally.
## spc_tbl_ [10,100 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ symbol : chr [1:10100] "MMM" "MMM" "MMM" "MMM" ...
## $ date : Date[1:10100], format: "2021-04-12" "2021-04-13" "2021-04-14" "2021-04-15" ...
## $ open : num [1:10100] 198 196 196 197 199 ...
## $ high : num [1:10100] 199 197 197 198 200 ...
## $ low : num [1:10100] 197 195 195 196 198 ...
## $ close : num [1:10100] 198 196 197 197 199 ...
## $ volume : num [1:10100] 1682100 2259700 1575000 1461600 2769500 ...
## $ adjusted: num [1:10100] 198 196 197 197 199 ...
## - attr(*, "spec")=
## .. cols(
## .. symbol = col_character(),
## .. date = col_date(format = ""),
## .. open = col_double(),
## .. high = col_double(),
## .. low = col_double(),
## .. close = col_double(),
## .. volume = col_double(),
## .. adjusted = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
12.3 Handling missing values
This section introduces several methods to identify and replace missing values.
identifying missing values
summary()
, is.na()
, and complete.cases()
can be used to identify missing values. which()
can be used to return locations of missing values. sum()
can gives us counts of missing values.
summary()
To start with, we can find out if a numeric variable contains missing values or not by summary()
.
As shown below, we don’t have such variables in our datasets.
## symbol date open high low close
## Length:10100 Length:10100 Min. : 9.77 Min. : 10.52 Min. : 9.70 Min. : 10.11
## Class :character Class :character 1st Qu.: 60.95 1st Qu.: 61.61 1st Qu.: 60.45 1st Qu.: 61.11
## Mode :character Mode :character Median : 113.34 Median : 114.58 Median : 112.11 Median : 113.60
## Mean : 189.65 Mean : 191.56 Mean : 187.62 Mean : 189.72
## 3rd Qu.: 206.42 3rd Qu.: 208.43 3rd Qu.: 204.78 3rd Qu.: 206.72
## Max. :5149.68 Max. :5234.27 Max. :5109.32 Max. :5211.38
## volume adjusted
## Min. : 9800 Min. : 10.11
## 1st Qu.: 859275 1st Qu.: 61.11
## Median : 1749450 Median : 113.60
## Mean : 4279249 Mean : 189.68
## 3rd Qu.: 4053675 3rd Qu.: 206.72
## Max. :230430600 Max. :5211.38
Let’s load another dataset from dplyr
just to see what happens when we have missing values in a numeric variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 8.00 35.00 52.00 87.57 72.00 896.00 44
is.na()
is.na()
evaluates whether elements are missing. We will get TRUE
if an element is missing, and FALSE
if an element is not missing.
It is self-explanatory that !is.na()
evaluates whether elements are NOT missing.
As shown below, Date first added
is a character vector that has quite some missing values.
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
which(is.na())
Now, how would I know where in the dataset these missing values occur?
We can use which()
, which evaluates which indices are TRUE.
## [1] 123 127 135 136 151 161 178 191 192 204 205 211 239 240 270 276 285 294 302 307 310 313 325 326 328 340 342
## [28] 355 362 374 379 392 396 404 409 415 424 425 433 435 442 454 457 467 474 477 484 485 492 493 494
sum(is.na())
, mean(is.na())
Two useful tricks to identify missing values are sum(is.na())
and mean(is.na())
.
sum(is.na())
tells us how many missing values there are in a variable.
## [1] 51
mean(is.na())
tells us if there are any missing values in a variable. If there is no missing value, we should get 0. If there indeed are missing values, the returned value should be larger than 0.
## [1] 0.1009901
complete.cases()
complete.cases()
returns a logical vector indicating which cases are complete, i.e., have no missing values.
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
To take subsets where Date first added
is complete:
## # A tibble: 454 × 9
## Symbol Security `SEC filings` `GICS Sector` `GICS Sub-Industry` Headquarters Locatio…¹ `Date first added`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 MMM 3M Company reports Industrials Industrial Conglom… St. Paul, Minnesota 1976-08-09
## 2 ABT Abbott Labora… reports Health Care Health Care Equipm… North Chicago, Illino… 1964-03-31
## 3 ABBV AbbVie Inc. reports Health Care Pharmaceuticals North Chicago, Illino… 2012-12-31
## 4 ABMD Abiomed reports Health Care Health Care Equipm… Danvers, Massachusetts 2018-05-31
## 5 ACN Accenture reports Information … IT Consulting & Ot… Dublin, Ireland 2011-07-06
## 6 ATVI Activision Bl… reports Communicatio… Interactive Home E… Santa Monica, Califor… 2015-08-31
## 7 ADBE Adobe Inc. reports Information … Application Softwa… San Jose, California 1997-05-05
## 8 AMD Advanced Micr… reports Information … Semiconductors Santa Clara, Californ… 2017-03-20
## 9 AAP Advance Auto … reports Consumer Dis… Automotive Retail Raleigh, North Caroli… 2015-07-09
## 10 AES AES Corp reports Utilities Independent Power … Arlington, Virginia 1998-10-02
## # ℹ 444 more rows
## # ℹ abbreviated name: ¹`Headquarters Location`
## # ℹ 2 more variables: CIK <dbl>, Founded <chr>
replacing missing values
There are cases where we would like to recode the missing values. For instance, in survey data, when a respondent refused to respond, these cases could be coded as numbers like -999. If we don’t read the codebook carefully, and just treat those records as facts, we could make serious mistakes. Therefore, we should recode those values to system missing values. In R, we want to recode them to NA
.
survey$variable[survey$variable == -999] <- NA
There are also cases where we would like to recode missing values to a meaningful number, such as 0.
df$variable[is.na(df$variable)] <- 0
12.4 Subsets
Sometimes we examine subsets of our data when we notice something that does not make sense and would like to take a closer look at it. Other times we may want to take a subset from the entire dataset and only work on that subset for further operations.
There are several ways to take subsets.
The first option is data frame indexing. For instance, we can take a subset on the stock AFL.
What about the stocks that have the highest or lowest open prices and on which days?
## symbol date open
## 6999 NVR 2021-05-06 5149.68
## symbol date open
## 5968 MRO 2021-04-21 9.77
subset()
In addition to using the indexing operators, we can also use the function subset()
to subset a data frame that meet conditions.
subset(x, subset, select)
The subset
argument is a logical expression indicating elements or rows to keep; missing values are treated as FALSE
. The select
argument indicates columns to select from a data frame.
## symbol date open
## 6999 NVR 2021-05-06 5149.68
One of the benefits of using subset()
is that the code is more readable.
If we don’t specify the columns, we are going to get all the columns back.
## symbol date open high low close volume adjusted
## 6999 NVR 2021-05-06 5149.68 5151.35 5075.51 5134.44 13300 5134.44
tidyverse
functions
tidyverse
packages operate on “tidy data”. Tidy data has a specific structure: each variable is a column; each observation is a row; and each type of observational unit is a table. That is straightforward.
In addition to the tidyverse
and base R approaches, data.table
is another powerful package that takes care of data manipulation tasks.
We’ll primarily rely on dplyr
and base R functions in this chapter. dplyr
provides fast and intuitive ways to manipulate date frames.
%>%
pipe opeartor
A side note on tidyverse
pipe operator %>%
before we proceed. The pipe operator %>%
comes from the package magrittr
.
%>%
chains functions together, passing the output of the former function to the input of the next function. Step by step, each new operation is performed on the output from the previous one. In this way, we don’t have to nest a lot of functions, which also makes the code more readable.
session %>%
filter(!is.na(performance)) %>%
group_by(date, performance) %>%
add_tally(interval) %>%
rename(sum_day = n)
In the example below, we use filter()
to only keep the stocks of AFL.
This one keeps the record with the highest open price, and uses select()
to keep the three variables symbol
, date
and open
of that record.
Adding -
deselects variables.