# Chapter 3 R Data Modeling

## 3.1 Imagine This

Your project team is assigned to work with the accounts receivable team. Specifically, accounts receivable is about to review a portfolio of customers from a potential acquisition. Managers would like to query data provided through due diligence during the acquisition process. Some of the questions include:

- What is the income risk across applicant pools?
- Are there differences in applicant income?
- Does age matter?
- Is there a pattern of family dependents across applicant pools?
- How much income per dependent?

The analytics team will process the data, review its quality, and help accounts receivable answer these questions.

In this chapter we will build approaches to manage such queries, including pivot tables, lookups, and the creation of new metrics from existing data. We will expand this assortment of skills into the writing of functions, such as net present value and internal rate of return, more plotting of data, working with time series data, and fitting data to probability distributions.

## 3.2 Pivot tables and Vertical Lookups

These are, mythically at least, two of the most-used Excel features. Pivot tables are the slice and dice machine we use to partition data sets. Lookups allow us to relate one data table to another. We will explore these tools in `R`

, here made easier and less apt to crash on large data sets. We start with some definitions.

### 3.2.1 Some definitions

The pivot table is a data summarization tool that can automatically sort, count, total, or give the average of the data stored in one table or spreadsheet, displaying the results in a second table showing the summarized data. This tool transforms a flat table of fields with rows of data into a table with grouped row values and column header values. The specification of grouped row values and column headers can rotate the flat table’s data rows into the intersection of the row and column labels.

“V” or “vertical” stands for the looking up of a value in a column. This feature allows the analyst to find approximate and exact matches between the look up value and a table value in a vertical column assigned to the look up value. A `HLOOKUP`

function does the same lookup but for a specified row instead of a column.

### 3.2.2 Pivot and Parry

Let’s return to the Credit Card Applicant business questions:

- What is the income risk across applicant pools?
- Are there differences in applicant income?
- Does age matter?
- Is there a pattern of dependents across applicant pools?
- How much income per dependent?

The first step in building an analysis of the data relative to these questions is to understand the required dimensions of the data that apply to the questions. Here we would scan the table column names in the data base and look for

- Card status
- Ownership
- Employment

```
CreditCard <- read.csv("data/CreditCard.csv")
str(CreditCard)
```

```
## 'data.frame': 1319 obs. of 13 variables:
## $ card : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ reports : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : num 37.7 33.2 33.7 30.5 32.2 ...
## $ income : num 4.52 2.42 4.5 2.54 9.79 ...
## $ share : num 0.03327 0.00522 0.00416 0.06521 0.06705 ...
## $ expenditure: num 124.98 9.85 15 137.87 546.5 ...
## $ owner : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 2 1 ...
## $ selfemp : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ dependents : int 3 3 4 0 2 0 2 0 0 0 ...
## $ months : int 54 34 58 25 64 54 7 77 97 65 ...
## $ majorcards : int 1 1 1 1 1 1 1 1 1 1 ...
## $ active : int 12 13 5 7 5 1 5 3 6 18 ...
## $ state : Factor w/ 3 levels "CT","NJ","NY": 3 3 3 3 3 3 3 3 3 3 ...
```

The `str()`

function allows us to see all of the objects in `CreditCard`

. Next lext look at the data itself inside this object using `head`

(for the beginning of the data).

`head(CreditCard, 3)`

```
## card reports age income share expenditure owner selfemp
## 1 yes 0 37.66667 4.52 0.033269910 124.983300 yes no
## 2 yes 0 33.25000 2.42 0.005216942 9.854167 no no
## 3 yes 0 33.66667 4.50 0.004155556 15.000000 yes no
## dependents months majorcards active state
## 1 3 54 1 12 NY
## 2 3 34 1 13 NY
## 3 4 58 1 5 NY
```

Knowing the structure and a sample of the data, we can build a summary of the data and review the minimum, maximum, and quartiles in each of `CreditCard`

’s columns of data.

`summary(CreditCard)`

```
## card reports age income
## no : 296 Min. : 0.0000 Min. : 0.1667 Min. : 0.210
## yes:1023 1st Qu.: 0.0000 1st Qu.:25.4167 1st Qu.: 2.244
## Median : 0.0000 Median :31.2500 Median : 2.900
## Mean : 0.4564 Mean :33.2131 Mean : 3.365
## 3rd Qu.: 0.0000 3rd Qu.:39.4167 3rd Qu.: 4.000
## Max. :14.0000 Max. :83.5000 Max. :13.500
## share expenditure owner selfemp
## Min. :0.0001091 Min. : 0.000 no :738 no :1228
## 1st Qu.:0.0023159 1st Qu.: 4.583 yes:581 yes: 91
## Median :0.0388272 Median : 101.298
## Mean :0.0687322 Mean : 185.057
## 3rd Qu.:0.0936168 3rd Qu.: 249.036
## Max. :0.9063205 Max. :3099.505
## dependents months majorcards active
## Min. :0.0000 Min. : 0.00 Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 12.00 1st Qu.:1.0000 1st Qu.: 2.000
## Median :1.0000 Median : 30.00 Median :1.0000 Median : 6.000
## Mean :0.9939 Mean : 55.27 Mean :0.8173 Mean : 6.997
## 3rd Qu.:2.0000 3rd Qu.: 72.00 3rd Qu.:1.0000 3rd Qu.:11.000
## Max. :6.0000 Max. :540.00 Max. :1.0000 Max. :46.000
## state
## CT:442
## NJ:472
## NY:405
##
##
##
```

We immediately see an age minimum of 0.2. Either this is an anomaly, or outright error, or there is an application not quite a year old!. Let’s filter the data for ages greater than 18 to be safe.

```
ccard <- CreditCard[CreditCard$age >=
18, ]
```

In the filter, the comma means keep data on applicants only at or in excess of 18 years of age. When we leave the column empty, it means apply this filter across all columns. We next review the distribution of ages of applicants to be sure our filter does the job properly. The function `hist()`

builds a simple frequency histogram to visualize this data.

`hist(ccard$age)`

### 3.2.3 Try this exercise

What is the basic design of this inquiry?

- Business questions?
- Dimensions?
- Taxonomy and metrics?

To answer 1 and 2 we have business questions along the lines of indicator variables:

- Card issued (
`card`

) - Own or rent (
`owner`

) - Self-employed or not (
`selfemp`

)

For 3 our basic taxonomy is:

- For each card issued…in New York
- …and for each owner…
- …who is employed…
- What are the range of income, average dependents, age, and income per dependent?

Here is the basic 3 step pivot table design. We should check if we have installed the `dplyr`

package into the `R`

environment.

```
# install.packages('dplyr') if not
# already
require(dplyr)
## 1: filter to keep three states.
pivot.table <- filter(ccard, state %in%
"NY")
## 2: set up data frame for by-group
## processing.
pivot.table <- group_by(pivot.table,
card, owner, selfemp)
## 3: calculate the three summary
## metrics
options(dplyr.width = Inf) ## to display all columns
pivot.table <- summarise(pivot.table,
income.cv = sd(income)/mean(income),
age.avg = mean(age), income.per.dependent = sum(income)/sum(dependents))
```

We then visualize results in a table. Here we use `knitr`

, which is a package that powers `rmarkdown`

. The function `kable()`

is short for “knitr table.”

`knitr::kable(pivot.table)`

card | owner | selfemp | income.cv | age.avg | income.per.dependent |
---|---|---|---|---|---|

no | no | no | 0.4941859 | 31.91936 | 3.645848 |

no | no | yes | 0.5652634 | 26.38542 | 2.852000 |

no | yes | no | 0.3756274 | 36.01786 | 2.157589 |

no | yes | yes | NaN | 53.33333 | Inf |

yes | no | no | 0.3298633 | 28.09311 | 5.313677 |

yes | no | yes | 0.4367858 | 37.45238 | 7.062500 |

yes | yes | no | 0.5519888 | 36.79503 | 3.154476 |

yes | yes | yes | 0.5032180 | 41.91667 | 3.194547 |

### 3.2.4 Now to VLOOKUP

Let’s start with a different data set. We load this IBRD (World Bank) data that has

- The variable
`life.expectancy`

is the average life expectancy for each country from 2009 through 2014. - The variable
`sanitation`

is the percentage of population with direct access to sanitation facilities.

```
le <- read.csv("data/life_expectancy.csv",
header = TRUE, stringsAsFactors = FALSE)
sa <- read.csv("data/sanitation_.csv",
header = TRUE, stringsAsFactors = FALSE)
```

Always we look at the first few records.

`head(le)`

```
## country years.life.expectancy.avg
## 1 Afghanistan 46.62135
## 2 Albania 71.11807
## 3 Algeria 61.81652
## 4 Angola 41.65847
## 5 Antigua and Barbuda 69.81219
## 6 Arab World 60.93432
```

`head(sa)`

```
## country sanitation.avg
## 1 Afghanistan 25.39600
## 2 Albania 85.36154
## 3 Algeria 84.21538
## 4 American Samoa 61.73077
## 5 Andorra 100.00000
## 6 Angola 36.00769
```

The job here is to join sanitation data with life expectancy data, by country. In Excel we would typically use a `VLOOKUP(country, sanitation, 2, FALSE)`

statement.

- In this statement
`country`

is the value to be looked up, for example, “Australia”. - The variable
`sanitation`

is the range of the sanitation lookup table of two columns of country and sanitation data, for example,`B2:C104`

in Excel. - The
`2`

is the second column of the sanitation lookup table, for example column`C`

. `FALSE`

means don’t find an exact match.

In `R`

we can use the `merge()`

function.

```
life.sanitation <- merge(le[, c("country",
"years.life.expectancy.avg")], sa[,
c("country", "sanitation.avg")])
```

The whole range of countries is populated by the lookup.

`head(life.sanitation, 3)`

```
## country years.life.expectancy.avg sanitation.avg
## 1 Afghanistan 46.62135 25.39600
## 2 Albania 71.11807 85.36154
## 3 Algeria 61.81652 84.21538
```

### 3.2.5 Try this exercise

We will load yet another data set on house prices. Suppose we work for a housing developer like Toll Brothers (NYSE: TOL) and want to allocate resources to marketing and financing the building of luxury homes in major US metropolitan areas. We have data for one test market.

`hprice <- read.csv("data/hprice.csv")`

Let’s look at the available data:

`summary(hprice)`

```
## ID Price SqFt Bedrooms
## Min. : 1.00 Min. : 69100 Min. :1450 Min. :2.000
## 1st Qu.: 32.75 1st Qu.:111325 1st Qu.:1880 1st Qu.:3.000
## Median : 64.50 Median :125950 Median :2000 Median :3.000
## Mean : 64.50 Mean :130427 Mean :2001 Mean :3.023
## 3rd Qu.: 96.25 3rd Qu.:148250 3rd Qu.:2140 3rd Qu.:3.000
## Max. :128.00 Max. :211200 Max. :2590 Max. :5.000
## Bathrooms Offers Brick Neighborhood
## Min. :2.000 Min. :1.000 No :86 East :45
## 1st Qu.:2.000 1st Qu.:2.000 Yes:42 North:44
## Median :2.000 Median :3.000 West :39
## Mean :2.445 Mean :2.578
## 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :4.000 Max. :6.000
```

Our business questions include:

- What are the most valuable (higher price) neighborhoods?
- What housing characteristics maintain the most housing value?

First, where and what are the most valuable houses? One way to answer this is to build a pivot table. Next we pivot the data and build metrics into the query. We will use the `mean()`

and standard deviation `sd()`

functions to help answer our questions.

```
require(dplyr)
## 1: filter to those houses with
## fairly high prices
pivot.table <- filter(hprice, Price >
99999)
## 2: set up data frame for by-group
## processing
pivot.table <- group_by(pivot.table,
Brick, Neighborhood)
## 3: calculate the summary metrics
options(dplyr.width = Inf) ## to display all columns
pivot.table <- summarise(pivot.table,
Price.avg = mean(Price), Price.cv = sd(Price)/mean(Price),
SqFt.avg = mean(SqFt), Price.per.SqFt = mean(Price)/mean(SqFt))
```

Then we visualize in a table.

`knitr::kable(pivot.table)`

Brick | Neighborhood | Price.avg | Price.cv | SqFt.avg | Price.per.SqFt |
---|---|---|---|---|---|

No | East | 121095.7 | 0.1251510 | 2019.565 | 59.96125 |

No | North | 115307.1 | 0.0939797 | 1958.214 | 58.88382 |

No | West | 148230.4 | 0.0912350 | 2073.478 | 71.48878 |

Yes | East | 135468.4 | 0.0977973 | 2031.053 | 66.69863 |

Yes | North | 118457.1 | 0.1308498 | 1857.143 | 63.78462 |

Yes | West | 175200.0 | 0.0930105 | 2091.250 | 83.77764 |

Based on this data set from one metropolitan area, the most valuable properties (fetching the highest average price and price per square foot) are made of brick in the West neighborhood. Brick or not, the West neighborhood also seems have the lowest relative variation in price.

Now for something different: functions.

## 3.3 Why Functions?

We will encapsulate several operations into a reusable storage device called a function. The usual suspects and candidates for the use of functions are:

- Data structures rack together related values into one object.
- Functions group related commands into one object.

In both cases the logic and coding is easier to understand, easier to work with, easier to build into larger things, and less prone to breaches of plain-old stubby finger breaches of operational safety and security.

For example, here is an Excel look-alike NPV function. We enter this into a code-chunk in an `R markdown`

file or directly into the console to store the function into the current `R`

environment. Once that is done, we now have a new function we can use like any other function.

```
## Net Present Value function Inputs:
## vector of rates (rates) with 0 as
## the first rate for time 0, vector
## of cash flows (cashflows) Outputs:
## scalar net present value
NPV.1 <- function(rates, cashflows) {
NPV <- sum(cashflows/(1 + rates)^(seq_along(cashflows) -
1))
return(NPV)
}
```

The structure of a function has these parts:

- A header describes the function along with inputs and outputs. Here we use comment characters
`#`

to describe and document the function. - A definition names the function and identify the interface of inputs and outputs to the programming environment. The name is like a variable and is assigned to
`function()`

, where inputs are defined. - Code statements take the inputs from the definition and program the tasks, logic, and decisions in the function’s work flow into output.
- An output statement releases the function’s results for use in other code statements outside of the function’s “mini-verse.” We use the formal
`return()`

function to identify the output that the function will produce. If we did not use`return()`

, then`R`

will us the last assigned variable as the output of the function.

In this example We generate data internal to the function:

- We use
`seq_along`

to generate time index of cashflows. - We must subtract 1 from this sequence as starting cashflow is time 0.
- We generate a net present value directly in one line of code.

Our functions get used just like the built-in ones, for example, `mean()`

. Let’s define `rates`

and `cashflows`

as vector inputs to the `NPV.1()`

function and run this code.

```
rates <- c(0, 0.08, 0.06, 0.04) ## first rate is always 0.00
cashflows <- c(-100, 200, 300, 10)
NPV.1(rates, cashflows)
```

`## [1] 361.0741`

We go back to the declaration and look at the parts:

```
## Net Present Value function Inputs:
## vector of rates (rates) with 0 as
## the first rate for time 0, vector
## of cash flows (cashflows) Outputs:
## scalar net present value
NPV.1 <- function(rates, cashflows) {
NPV <- sum(cashflows/(1 + rates)^(seq_along(cashflows) -
1))
return(NPV)
}
```

**Interfaces** refer to these components:

**inputs**or**arguments****outputs**or**return value**- Calls other functions
`sum`

,`seq_along()`

, operators`/`

,`+`

,`^`

and`-`

.

We can also call other functions we’ve written. We use `return()`

to explicitly say what the output is. This is simply good documentation. Alternately, a function will return the last evaluation.

Comments, that is, lines that begin with `#`

, are not required by `R`

, but are always a good and welcome idea that provide a terse description of purpose and direction. Initial comments should also include a listing of inputs, also called “arguments,” and outputs.

### 3.3.1 What should be a function?

Functions should be written for code we are going to re-run, especially if it will be re-run with changes in inputs. They can also be code chunks we keep highlighting and hitting return on. We often write functions for code chunkswhich are small parts of bigger analyses.

In the next redition of `irr.1`

we improve the code with named and default arguments.

```
## Internal Rate of Return (IRR)
## function Inputs: vector of cash
## flows (cashflows), scalar
## interations (maxiter) Outputs:
## scalar net present value
IRR.1 <- function(cashflows, maxiter = 1000) {
t <- seq_along(cashflows) - 1
## rate will eventually converge to
## IRR
f <- function(rate) (sum(cashflows/(1 +
rate)^t))
## use uniroot function to solve for
## root (IRR = rate) of f = 0 c(-1,1)
## bounds solution for only positive
## or negative rates select the root
## estimate
return(uniroot(f, c(-1, 1), maxiter = maxiter)$root)
}
```

Here the *default argument* is `maxiter`

which controls the number of iterations. At our peril we can eliminate this argument if we want. This illustrates yet another need for functions: we can put error and exception logic to handle somtimes fatal issues our calculations might present.

Here are the cashflows for a 3% coupon bond bought at a hefty premium.

```
cashflows <- c(-150, 3, 3, 3, 3, 3, 3,
3, 103)
IRR.1(cashflows)
```

`## [1] -0.02554088`

`IRR.1(cashflows, maxiter = 100)`

`## [1] -0.02554088`

We get a negative IRR or yield to maturity on this net present value = 0 calculation.

### 3.3.2 Shooting trouble

*Problem*: We see “odd”" behavior when arguments aren’t as we expect.

`NPV.1(c(0.1, 0.05), c(-10, 5, 6, 100))`

`## [1] 86.10434`

We do get a result, but…

- What does it mean?
- What rates correspond with what cashflows?

Here the function calculates a net present value. But the analyst entered two rates for four cash flows.

*Solution*: We put sanity checks into the code.

- Let’s use the
`stopifnot(some logical statment)`

is TRUE.

```
## Net Present Value function Inputs:
## vector of rates (rates) with 0 as
## the first rate for time 0, vector
## of cash flows (cashflows), length
## of rates must equal length of
## cashflows Outputs: scalar net
## present value
NPV.2 <- function(rates, cashflows) {
stopifnot(length(rates) == length(cashflows))
NPV <- sum(cashflows/(1 + rates)^(seq_along(cashflows) -
1))
return(NPV)
}
```

Here are some thoughts about `stopifnot`

`TRUE`

error handling

- Arguments to
`stopifnot()`

are a series of logical expressions which should all be TRUE. - Execution halts, with error message, at
*first*FALSE.

`NPV.2(c(0.1, 0.05), c(-10, 5, 6, 100))`

Hit (not too hard!) the `Escape`

key on your keyboard, This will take you out of `Browse[1]>`

mode and back to the console prompt `>`

.

### 3.3.3 What the function can see and do

Each function has its own environment. Names here will override names in the global environment. The function’s internal environment starts with the named arguments. Assignments inside the function only change the internal environment. If a name is not defined in the function, the function will look for this name in the environment the function gets called from.

### 3.3.4 Try this …

Your company is running a 100 million pound sterling project in the EU. You must post 25% collateral in a Landesbank using only high-quality government securities. You find a high-quality gilt fund that will pay 1.5% (coupon rate) annually for three years.

Some questions for analysis

- How much would you pay for this collateral if the rate curve (yield to maturity of cash flows) is (from next year on…)

`rates <- c(-0.001, 0.002, 0.01)`

- Suppose a bond dealer asks for 130% of notional collateral value for this bond. What is the yield on this transaction (IRR)? Would you buy it?
- What is the return on this collateral if you terminate the project in one year and liquidate the collateral (i.e., sell it for cash) if the yield shifts down by 0.005? This is a “parallel” shift, which is finance for: “take each rate and deduct 0.005.”

To get at these requirements we will build rates and cash flows across the 3-year time frame, remembering our previous work.

`(rates <- c(0, rates))`

`## [1] 0.000 -0.001 0.002 0.010`

```
collateral.periods <- 3
collateral.rate <- 0.25
collateral.notional <- collateral.rate *
100
coupon.rate <- 0.015
cashflows <- rep(coupon.rate * collateral.notional,
collateral.periods)
cashflows[collateral.periods] <- collateral.notional +
cashflows[collateral.periods]
(cashflows <- c(0, cashflows))
```

`## [1] 0.000 0.375 0.375 25.375`

What just happened?

- We appended a
`0`

to the rate schedule so we can use the`NPV.2`

function. - We then parameterized the term sheet (terms of the collateral transaction),
- We used
`rep()`

to form coupon cash flows. - Then we added notional value repayment to the last cash flow.

Now we can calculate the present value of the bond using `NPV.2`

.

`(Value.0 <- NPV.2(rates, cashflows))`

`## [1] 25.3776`

The answer is 25.378 million pounds sterling or `Value.0 / collateral.notional`

times the notional value.

The yield to maturity averages the forward rates across the bond cash flows. A “forward rate” is the rate per period we would expect to earn each period. This is one interpretation of the Internal Rate of Return (“IRR”).

```
cashflows.IRR <- cashflows
collateral.ask <- 130
cashflows.IRR[1] <- -(collateral.ask/100) *
collateral.notional
## mind the negative sign!
(collateral.IRR.1 <- IRR.1(cashflows.IRR))
```

`## [1] -0.07112366`

You end up paying over 7% per annum for the privilege of owning this bond! You call up the European Central Bank, report this overly hefty haircut on your project. You send out a request for proposal to other bond dealers. They come back with an average asking price of 109 (109% of notional).

```
cashflows.IRR <- cashflows
collateral.ask <- 109
cashflows.IRR[1] <- -(collateral.ask/100) *
collateral.notional
(collateral.IRR.1 <- IRR.1(cashflows.IRR))
```

`## [1] -0.01415712`

That’s more like it: about 140 basis points (1.41% x 100 basis points per percentage) cost (negative sign).

Let’s unwind the project, and the collateral transaction, in 1 year. Let’s suppose the yield curve in 1 year has parallel shifted down by 0.005.

```
rate.shift <- -0.005
rates.1 <- c(0, rates[-2]) + rate.shift
cashflows.1 <- c(0, cashflows[-2])
(Value.1 <- NPV.2(rates.1, cashflows.1))
```

`## [1] 25.37541`

```
(collateral.return.1 <- Value.1/(-cashflows.IRR[1]) -
1)
```

`## [1] -0.0687923`

This results ooks much more than a break-even return on the collateral transation:

```
(collateral.gainloss <- collateral.notional *
collateral.return.1) * 1e+06
```

`## [1] -1719807`

`## adjust for millions of euros`

That’s probably someone’s salary…(in pounds sterling).

### 3.3.5 Mind the Interface!

Interfaces mark out a controlled inner environment for our code;

- They allow us to interact with the rest of the system only at the interface.
- Arguments explicitly give the function all the information the function needs to operate and reduces the risk of confusion and error.

- There are exceptions such as true universals like \(\pi\).
- Likewise, output should only be through the return value.

Let’s build (parametric) distributions next.

## 3.4 Making distributions

As always, let’s load some data, this time from the Bureau of Labor Statistics (BLS) and load the export-import price index whose description and graph are at http://data.bls.gov/timeseries/EIUIR?output_view=pct_1mth. We look up the symbols “EIUIR” and “EIUIR100” and download a text file that we then convert to a comma separated variable or `csv`

file in Excel. We deposit the `csv`

file in a directory and read it into a variable called `EIUIR`

.

```
require(xts)
require(zoo)
EIUIR <- read.csv("data/EIUIR.csv")
head(EIUIR)
```

```
## Date Value
## 1 2006-01-01 113.7
## 2 2006-02-01 112.8
## 3 2006-03-01 112.7
## 4 2006-04-01 115.1
## 5 2006-05-01 117.2
## 6 2006-06-01 117.3
```

```
xmprice <- na.omit(EIUIR) ## to clean up any missing data
str(xmprice)
```

```
## 'data.frame': 131 obs. of 2 variables:
## $ Date : Factor w/ 131 levels "2006-01-01","2006-02-01",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Value: num 114 113 113 115 117 ...
```

We might have to have installed separatetly the `xts`

and `zoo`

packages that handle time series data explicitly. The `str()`

function indicates that the `Value`

column in the data frame contains the export price series. We then compute the natural logarithm of prices and calculate the differences to get rates of growth from month to month. A simple plot reveals aspects of the data series to explore.

```
xmprice.r <- as.zoo(na.omit((diff(log(xmprice$Value))))) ## compute rates
head(xmprice.r)
```

```
## 1 2 3 4 5
## -0.0079470617 -0.0008869180 0.0210718947 0.0180805614 0.0008528785
## 6
## 0.0076433493
```

```
plot(xmprice.r, type = "l", col = "blue",
xlab = "Date", main = "Monthly 2/2000-9/2016")
```

We further transform the data exploring the absolute value of price rates. This is a first stab at understanding the clustering of volatility in financial-economic time series, a topic to which we will return.

```
xmprice.r.df <- data.frame(xmprice.r,
Date = index(xmprice.r), Rate = xmprice.r[,
1], Rate.abs <- abs(xmprice.r[,
1]))
head(xmprice.r.df)
```

```
## xmprice.r Date Rate Rate.abs....abs.xmprice.r...1..
## 1 -0.0079470617 1 -0.0079470617 0.0079470617
## 2 -0.0008869180 2 -0.0008869180 0.0008869180
## 3 0.0210718947 3 0.0210718947 0.0210718947
## 4 0.0180805614 4 0.0180805614 0.0180805614
## 5 0.0008528785 5 0.0008528785 0.0008528785
## 6 0.0076433493 6 0.0076433493 0.0076433493
```

`str(xmprice.r.df)`

```
## 'data.frame': 130 obs. of 4 variables:
## $ xmprice.r : num -0.007947 -0.000887 0.021072 0.018081 0.000853 ...
## $ Date : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Rate : num -0.007947 -0.000887 0.021072 0.018081 0.000853 ...
## $ Rate.abs....abs.xmprice.r...1..: num 0.007947 0.000887 0.021072 0.018081 0.000853 ...
```

We can achieve a “prettier” plot with the `ggplot2`

package. In the `ggplot`

statements we use `aes`

, “aesthetics”, to pick `x`

(horizontal) and `y`

(vertical) axes. The added (`+`

) `geom_line`

is the geometrical method that builds the line plot.

```
require(ggplot2)
ggplot(xmprice.r.df, aes(x = Date, y = Rate)) +
geom_line(colour = "blue")
```

Let’s try a bar graph of the absolute value of price rates. We use `geom_bar`

to build this picture.

```
require(ggplot2)
ggplot(xmprice.r.df, aes(x = Date, y = Rate.abs)) +
geom_bar(stat = "identity", colour = "green")
```

### 3.4.1 Try this exercise

Let’s overlay returns (`geom_line`

) and their absolute value `geom_bar`

.

`ggplot`

declares the canvas using the price data frame.`aes`

establishes the data series to be used to generate pictures.`geom_bar`

builds a bar chart.`geom_line`

overlays the bar chart with a line chart.

By examining this chart, what business questions about your Univeral Export-Import Ltd supply chain could this help answer? Why is this helpful?

```
require(ggplot2)
ggplot(xmprice.r.df, aes(Date, Rate.abs)) +
geom_bar(stat = "identity", colour = "darkorange") +
geom_line(data = xmprice.r.df, aes(Date,
Rate), colour = "blue")
```

The plot goes a long way to answering the question: When supply and demand tightens, does price volatility cluster?

- If we are selling, we would experience strong swings in demand and thus in revenue at the customer fulfillment end of the chain.
- If we are buying, we would experience strong swings in cost and input product utilization at the procurement end of the chain.
- For the financial implications: we would have a tough time making the earnings we forecast to the market.

### 3.4.2 Picture this

We import goods as input to our manufacturing process. We might want to know the odds that a very high export-import rate might occur. We can plot a cumulative distribution function (*cdf* or *CDF*) call. we can build this plot using the `stat_ecdf()`

function in `ggplot2`

.

```
require(ggplot2)
ggplot(xmprice.r.df, aes(Rate)) + stat_ecdf(colour = "blue")
```

### 3.4.3 Try another exercise

- Suppose the procurement team’s delegation of authority remit states: “Procurement may approve input invoices when there is only a 5% chance that prices will rise any higher than the price rate associated with that tolerance. If input prices do rise higher than the tolerable rate, you must get divisional approval.”
- Plot a vertical line to indicate the maximum tolerable rate for procurement using the BLS EIUR data from 2000 to the present.

- Use
`r.tol <- quantile(xmprice.r.df$Rate, 0.95)`

to find the tolerable rate. - Use
`+ geom_vline(xintercept = r.tol)`

in the CDF plot.

We can implement these requirements with the following code.

```
require(ggplot2)
r.tol.pct <- 0.95
r.tol <- quantile(xmprice.r.df$Rate,
r.tol.pct)
r.tol.label <- paste("Tolerable Rate = ",
round(r.tol, 2))
ggplot(xmprice.r.df, aes(Rate)) + stat_ecdf(colour = "blue",
size = 1.5) + geom_vline(xintercept = r.tol,
colour = "red", size = 1.5) + annotate("text",
x = r.tol - 0.05, y = 0.75, label = r.tol.label,
colour = "darkred")
```

This may be a little more than we bargained for originally. We used the `paste`

and `round`

(to two, `2`

, decimal places) functions to make a label. We made much thicker lines (`size = 1.5`

). At 2% we drew a line with `geom_vline()`

and annotated the line with text.

Now that we have *made* some distributions out of live data, let’s estimate the parameters of specific distributions that might be fit to that data.

## 3.5 Optimization

The optimization we will conduct here helps us to find the distribution that best fits the data. We will use results from optimization to simulate that data to help us make decisions *prospectively*.

There are many distributions in `R`

: `?distributions`

will tell you all about them.

- If
`name`

is the name of a distribution (e.g.,`norm`

for “normal”), then

- dname = the probability
*d*ensity (if continuous) or probability mass function of`name`

(pdf or pmf), think “histogram” - pname = the cumulative
*p*robability function (CDF), think “s-curve” - qname = the
*q*uantile function (inverse to CDF), “think tolerance line” - rname = draw
*r*andom numbers from`name`

(first argument always the number of draws), think whatever you want…it’s kind of random

- And ways to write your own (like the
`pareto`

distribution we use in finance)

### 3.5.1 Try this exercise

Suppose the `EIUR`

price series is the *benchmark* in several import contracts you write as the procurement officer of your organization. Your concern is with volatility. Thus you think that you need to simulate the size of the price rates, whatever direction they go in. Draw the histogram of the absolute value of price rates.

```
require(ggplot2)
r.tol <- quantile(xmprice.r.df$Rate,
0.95)
r.tol.label <- paste("Tolerable Rate = ",
round(r.tol, 2))
ggplot(xmprice.r.df, aes(Rate.abs)) +
geom_histogram(fill = "cornsilk",
colour = "grey60") + geom_density() +
geom_vline(xintercept = r.tol, colour = "red",
size = 1.5) + annotate("text",
x = 0.055, y = 30, label = r.tol.label,
colour = "darkred")
```

This series is right-skewed and thickly-tailed. We will use this function to pull several of the statistics calculations together.

```
## r_moments function INPUTS: r vector
## OUTPUTS: list of scalars (mean, sd,
## median, skewness, kurtosis)
data_moments <- function(data) {
require(moments)
mean.r <- mean(data)
sd.r <- sd(data)
median.r <- median(data)
skewness.r <- skewness(data)
kurtosis.r <- kurtosis(data)
result <- data.frame(mean = mean.r,
std_dev = sd.r, median = median.r,
skewness = skewness.r, kurtosis = kurtosis.r)
return(result)
}
```

We might need to `install.packages("moments")`

to make this function operate. We then run these sentences.

```
ans <- data_moments(xmprice.r.df$Rate.abs)
ans <- round(ans, 4)
knitr::kable(ans)
```

mean | std_dev | median | skewness | kurtosis |
---|---|---|---|---|

0.0109 | 0.0117 | 0.0074 | 2.523 | 12.1545 |

As we visually surmised, the series is right-skewed and very thickly tailed. This may indicate that the `gamma`

and `pareto`

functions may help us describe these series and prepare us for simulations, estimation, and inference. We will make liberal use of the `fitdistr`

function from `MASS`

and come back to this moments function.

## 3.6 Estimate until morale improves…

We will try one method that works often enough in practice, Method of Moments (“MM” or, more affectionately, “MOM”). This estimation technique finds the distribution parameters such that the moments of the data match the moments of the distribution. Other methods include:

`fitdistr`

: Let the opaque box do the job for you; look at the package`MASS`

which uses the “maximum likelihood” approach in the`fitdistr`

estimating function (like`lm`

for regression).`fitdistrplus`

: For the more adventurous analyst, this package contains several methods, including MM, to get the job done.

uppose we believe that absolute price rates somehow follow a `gamma`

distribution. You can look up this distribution easily enough in Wikipedia’s good article on the subject. Behind managerial scenes, we can model the loss with `gamma`

severity function that allows for skewness and “heavy” tails. We can specify the `gamma`

distribution with by shape, \(\alpha\), and scale, \(\beta\), parameters. We will find in operational loss analysis that this distribution is especially useful for time-sensitive losses.

It turns out We can specify the shape and scale parameters using the mean, \(\mu\), and standard deviation, \(\sigma\) of the random severities, \(X\). The scale parameter is \[ \beta = \sigma^2 / \mu, \] and shape parameter,

\[ \alpha = \mu^2 / \sigma^2. \]

The distribution itself is defined as \[
f(x; alpha, \beta) = \frac{\beta^{\alpha}x^{\alpha-1}e^{-x\beta}}{\Gamma(\alpha)},
\] where, to have a complete statement, \[
\Gamma(x) = \int_{0}^{\infty} x^{t-1} e^{-x} dx.
\] Let’s finally implement into `R`

.

First, we will load a cost sample and calculate moments and gamma parameters:

```
cost <- read.csv("data/cost.csv")
cost <- cost$x
cost.moments <- data_moments(cost)
cost.mean <- cost.moments$mean
cost.sd <- cost.moments$std_dev
(cost.shape <- cost.mean^2/cost.sd^2)
```

`## [1] 19.06531`

`(cost.scale <- cost.sd^2/cost.mean)`

`## [1] 0.5575862`

`gamma.start <- c(cost.shape, cost.scale)`

When performing these calculations, be sure that the function `data_moments`

is loaded into the workspace.

Second, we can use `fitdistr`

from the `Mass`

package to estimate the `gamma`

parameters `alpha`

and `beta`

.

```
require(MASS)
fit.gamma.cost <- fitdistr(cost, "gamma")
fit.gamma.cost
```

```
## shape rate
## 20.2998092 1.9095724
## ( 2.3729250) ( 0.2259942)
```

Third, we construct the ratio of estimates to the standard error of estimates. This computes the number of standard deviations away from zero the estimates are. If they are “far” enough away from zero, we have reason to reject the null hypothesis that the estimates are no different from zero.

`(cost.t <- fit.gamma.cost$estimate/fit.gamma.cost$sd)`

```
## shape rate
## 8.554762 8.449652
```

`knitr::kable(cost.t)`

shape | 8.554762 |

rate | 8.449652 |

Nice…but we also note that the scale parameter is `fit.gamma.cost$estimate[2] / gamma.start[2]`

times the moment estimates above.

### 3.6.1 Try this exercise

Let’s use the export-input price series rates and the `t`

distribution instead of the `gamma`

.

First, we calculate the moments (mean, etc.).

```
rate <- xmprice.r.df$Rate
rate.moments <- data_moments(rate)
(rate.mean <- rate.moments$mean)
```

`## [1] 0.0004595748`

`(rate.sd <- rate.moments$std_dev)`

`## [1] 0.01602021`

Second, we use `fitdistr`

from the `Mass`

package to estimate the parameters of the `t`

distribution.

```
fit.t.rate <- fitdistr(rate, "t", hessian = TRUE)
fit.t.rate
```

```
## m s df
## 0.001791738 0.009833018 2.888000806
## (0.001059206) (0.001131725) (0.843729312)
```

Third, we infer if we did a good job or not. The null hypothesis is that the parameters are no different than zero (\(H_0\)). We calculate t statistics to approximate the mapping of parameter estimates to a dimensionless scale that will compute the number of standard deviations from the null hypothesis that the parameters are just zero and of no further use.

```
## m s df
## 1.691586 8.688522 3.422900
```

Nice…but that location parameter is a bit low relative to moment estimate. What else can we do? Simulate the estimated results and see if, at least, skewness and kurtosis line up with the moments.

## 3.7 Summary

We used our newly found ability to write functions and built insightful pictures of distributions. We also ran nonlinear (gamma and t-distributions are indeed very nonlinear) regressions using a package and the method of moments. All of this to answer critical business questions.

More specifically we waded into:

- Excel look alike processes: Pivot tables and VLOOKUP
- Excel look alike functions
- Graphics to get insights into distributions
- Estimating parameters of distribution
- Goodness of fit

## 3.8 Further Reading

Teetor’s various chapters have much to guide us in the writing of functions and the building of expressions. Present value and internal rate of return can be found in Brealey et al. Use of `ggplot2`

in this chapter relies heavily on Chang (2014).

## 3.9 Practice Sets

These practice sets reference materials developed in this chapter. We will explore new problems and data with models, `R`

packages, tables, and plots worked out already in the chapter.

### 3.9.1 Set A

In this set we will build a data set using filters and `if`

and `diff`

statements. We will then answer some questions using plots and a pivot table report. We will then review a function to house our approach in case we would like to run the same analysis on other data sets.

#### 3.9.1.1 Problem

Supply chain managers at our company continue to note we have a significant exposure to heating oil prices (Heating Oil No. 2, or HO2), specifically New York Harbor. The exposure hits the variable cost of producing several products. When HO2 is volatile, so is earnings. Our company has missed earnings forecasts for five straight quarters. To get a handle on Brent we download this data set and review some basic aspects of the prices.

```
# Read in data
HO2 <- read.csv("data/nyhh02.csv", header = T,
stringsAsFactors = F)
# stringsAsFactors sets dates as
# character type
head(HO2)
HO2 <- na.omit(HO2) ## to clean up any missing data
str(HO2) # review the structure of the data so far
```

#### 3.9.1.2 Questions

- What is the nature of HO2 returns? We want to reflect the ups and downs of price movements, something of prime interest to management. First, we calculate percentage changes as log returns. Our interest is in the ups and downs. To look at that we use
`if`

and`else`

statements to define a new column called`direction`

. We will build a data frame to house this analysis.

```
# Construct expanded data frame
return <- as.numeric(diff(log(HO2$DHOILNYH))) *
100
size <- as.numeric(abs(return)) # size is indicator of volatility
direction <- ifelse(return > 0, "up",
ifelse(return < 0, "down", "same")) # another indicator of volatility
date <- as.Date(HO2$DATE[-1], "%m/%d/%Y") # length of DATE is length of return +1: omit 1st observation
price <- as.numeric(HO2$DHOILNYH[-1]) # length of DHOILNYH is length of return +1: omit first observation
HO2.df <- na.omit(data.frame(date = date,
price = price, return = return, size = size,
direction = direction)) # clean up data frame by omitting NAs
str(HO2.df)
```

We can plot with the `ggplot2`

package. In the `ggplot`

statements we use `aes`

, “aesthetics”, to pick `x`

(horizontal) and `y`

(vertical) axes. Use `group =1`

to ensure that all data is plotted. The added (`+`

) `geom_line`

is the geometrical method that builds the line plot.

Let’s try a bar graph of the absolute value of price rates. We use `geom_bar`

to build this picture.

Now let’s build an overlay of `return`

on `size`

.

Let’s dig deeper and compute mean, standard deviation, etc. Load the

`data_moments()`

function. Run the function using the`HO2.df$return`

subset and write a`knitr::kable()`

report.Let’s pivot

`size`

and`return`

on`direction`

. What is the average and range of returns by direction? How often might we view positive or negative movements in HO2?

### 3.9.2 Set B

We will use the data from the previous set to investigate the distribution of returns we generated. This will entail fitting the data to some parametric distributions as well as plotting and building supporting data frames.

#### 3.9.2.1 Problem

We want to further characterize the distribution of up and down movements visually. Also we would like to repeat the analysis periodically for inclusion in management reports.

#### 3.9.2.2 Questions

How can we show the differences in the shape of ups and downs in HO2, especially given our tolerance for risk? Let’s use the

`HO2.df`

data frame with`ggplot2`

and the cumulative relative frequency function`stat_ecdf`

.How can we regularly, and reliably, analyze HO2 price movements? For this requirement, let’s write a function similar to

`data_moments`

.

Let’s test `HO2_movement()`

.

Morale: more work today (build the function) means less work tomorrow (write yet another report).

- Suppose we wanted to simulate future movements in HO2 returns. What distribution might we use to run those scenarios? Here, let’s use the
`MASS`

package’s`fitdistr()`

function to find the optimal fit of the HO2 data to a parametric distribution.

### 3.9.3 Practice Set Debrief

List the R skills needed to complete these practice sets.

What are the packages used to compute and graph results. Explain each of them.

How well did the results begin to answer the business questions posed at the beginning of each practice set?

## 3.10 Project

### 3.10.1 Background

Your company uses natural gas as a major input to recycle otherwise non-recoverable waste. The only thing that prevents the operation from being a little better than break-even is volatile natural gas prices. In its annual review, management requires information and analysis of recycling operations with a view to making decisions about outsourcing contracts. These contracts typically have three year tenors.

Since management will be making do or outsource decisions over a three year forward span, analysts will build models that characterize historical movements in natural gas prices, the volatility of those prices, and probability distributions to simulate future natural gas scenarios.

### 3.10.2 Data

In a preliminary analysis, you gather data from FRED on daily natural gas prices. You whill use this data to characterize historical natural gas price movements and construct provisional probability distributions for eventual generation of forward scenarios.

### 3.10.3 Workflow

Data collection. Collect, clean, and review data definitions, and data transformations of price into returns. Use tables and graphs to report results.

Analysis.

- Group prices into up, same (no movement), and down movements using percent change in daily price as the criterion.
- Build a table of summary statistics that pivots the data and computes metrics.
- Graph the cumulative probability of an up, same, and down group of historical returns.
- Estimate Student’s t distribution parameters for up, same, and down movements in natural gas returns.

- Observations and Recommendations.

- Summarize the data, its characteristics, and applicability to attend to the problem being solved for management.
- Discuss key take-aways from analytical results that are relevant to the decisions that managers will make.
- Produce an
`R Markdown`

document with code chunks to document and interpret results. - The format will introduce the problem to be analyzed, with sections that discuss the data used, and which follow the work flow.

### 3.10.4 Assessment

We will use the following rubric to assess our performance in producing analytic work product for the decision maker.

The text is laid out cleanly, with clear divisions and transitions between sections and sub-sections. The writing itself is well-organized, free of grammatical and other mechanical errors, divided into complete sentences, logically grouped into paragraphs and sections, and easy to follow from the presumed level of knowledge.

All numerical results or summaries are reported to suitable precision, and with appropriate measures of uncertainty attached when applicable.

All figures and tables shown are relevant to the argument for ultimate conclusions. Figures and tables are easy to read, with informative captions, titles, axis labels and legends, and are placed near the relevant pieces of text.

The code is formatted and organized so that it is easy for others to read and understand. It is indented, commented, and uses meaningful names. It only includes computations which are actually needed to answer the analytical questions, and avoids redundancy. Code borrowed from the notes, from books, or from resources found online is explicitly acknowledged and sourced in the comments. Functions or procedures not directly taken from the notes have accompanying tests which check whether the code does what it is supposed to. All code runs, and the

`R Markdown`

file`knits`

to`pdf_document`

output, or other output agreed with the instructor.Model specifications are described clearly and in appropriate detail. There are clear explanations of how estimating the model helps to answer the analytical questions, and rationales for all modeling choices. If multiple models are compared, they are all clearly described, along with the rationale for considering multiple models, and the reasons for selecting one model over another, or for using multiple models simultaneously.

The actual estimation and simulation of model parameters or estimated functions is technically correct. All calculations based on estimates are clearly explained, and also technically correct. All estimates or derived quantities are accompanied with appropriate measures of uncertainty.

The substantive, analytical questions are all answered as precisely as the data and the model allow. The chain of reasoning from estimation results about the model, or derived quantities, to substantive conclusions is both clear and convincing. Contingent answers (for example, “if X, then Y , but if A, then B, else C”) are likewise described as warranted by the model and data. If uncertainties in the data and model mean the answers to some questions must be imprecise, this too is reflected in the conclusions.

All sources used, whether in conversation, print, online, or otherwise are listed and acknowledged where they used in code, words, pictures, and any other components of the analysis.

## 3.11 References

Brealey

Chang

Teetor

knitr

xts

zoo

BLS