# Chapter 8 Lab 6 - 16/11/2021

In this lecture we will

- learn how to perform normality test, i.e. Jarque-Bera Test and Shapiro-Wilk test
- we will study the general class of goodness-of-fit test and investigate the difference between simple and composite null hypothesis
- descriptive analysis of a multiple data.frame. How to compute the mean and the variance of a linear combination of random variables (e.g., a portfolio)

## 8.1 Analysis of the Current Population Survey data

We will use the data stored in the **CPSch3.csv** file and collected through the Earnings from the Current Population Survey (CPS). The available variables are the following:

`year`

: survey year (quantitative discrete variable)`ahe`

: average hourly earnings (quantitative continuous variable)`sex`

: a qualitative variable with two categories/levels (male,female)

We start with data import and structure check:

```
= read.csv("files/CPSch3.csv", sep=";")
CPSch3 str(CPSch3)
```

```
## 'data.frame': 11130 obs. of 3 variables:
## $ year: int 1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
## $ ahe : num 13 11.6 17.4 10.1 16.7 ...
## $ sex : chr "male" "male" "male" "female" ...
```

For `year`

and `sex`

it is possible to obtain the univariate frequency distribution which reports how many observations are available for each year/category:

`table(CPSch3$year)`

```
##
## 1992 1994 1996 1998
## 2962 2956 2609 2603
```

`table(CPSch3$sex)`

```
##
## female male
## 5174 5956
```

We observe, for example, that for year 1992 there are 2962 observations (26.61% of the total number of observations). Moreover, also the bivariate frequency distribution can be easily obtained using the same function:

`table(CPSch3$year, CPSch3$sex)`

```
##
## female male
## 1992 1371 1591
## 1994 1358 1598
## 1996 1235 1374
## 1998 1210 1393
```

The previous table, is also called contingency table and it reports the *joint* absolute frequencies of the statistical variables (sex,year). We can also compute the table of the *joint* relative frequencies of the variable

```
<- length(CPSch3$year)
n_obs <-table(CPSch3$year, CPSch3$sex)/n_obs
joint_rel
joint_rel
```

```
##
## female male
## 1992 0.1231806 0.1429470
## 1994 0.1220126 0.1435759
## 1996 0.1109614 0.1234501
## 1998 0.1087152 0.1251572
```

It is worth mentioning that the relative joint frequencies are an estimation of the joint distribution of the random variables (\(Y_1\)=sex, \(Y_2\)=year). Observe for instance that

`sum(joint_rel)`

`## [1] 1`

At the margin of the table we can report the marginal distribution of the two variables

`addmargins(joint_rel)`

```
##
## female male Sum
## 1992 0.1231806 0.1429470 0.2661276
## 1994 0.1220126 0.1435759 0.2655885
## 1996 0.1109614 0.1234501 0.2344115
## 1998 0.1087152 0.1251572 0.2338724
## Sum 0.4648697 0.5351303 1.0000000
```

Going back at the contingency table, we note that 1371 observations refer to female in year 1992, corresponding to the 46.29% of the observations available for 1992 and to 26.5% of the female data.

In the following we will consider a subset of the original data, referring only to males and observations after 1994. This selection can be performed by setting (logical) conditions are shown in Section 3.2, or by using the `subset`

function as follows:

```
= subset(CPSch3, sex=="male" & year>1994)
CPS dim(CPS)
```

`## [1] 2767 3`

`head(CPS)`

```
## year ahe sex
## 5920 1996 11.660622 male
## 5922 1996 14.971988 male
## 5923 1996 25.958241 male
## 5924 1996 9.989918 male
## 5925 1996 13.451954 male
## 5926 1996 21.394132 male
```

`str(CPS)`

```
## 'data.frame': 2767 obs. of 3 variables:
## $ year: int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
## $ ahe : num 11.66 14.97 25.96 9.99 13.45 ...
## $ sex : chr "male" "male" "male" "male" ...
```

The new data frame, named `CPS`

, contains 2767 observations.

We analyze and represent graphically the only continuous variable available in the data frame:

`summary(CPS$ahe)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.159 12.009 16.233 17.416 21.377 52.467
```

`hist(CPS$ahe)`

The distribution appears to hava a positive skewness; this is confirmed by the fact that the mean is greater than the median and by the following normal probability plot with a concave pattern:

```
qqnorm(CPS$ahe,datax=T)
qqline(CPS$ahe,datax=T)
```

## 8.2 Normality tests

The aim of normality tests is to assess the normality of a set of observed data. Thus, the null hypothesis is \(H_0\):data are from a Normal distribution, against the alternative hypothesis \(H_1\):data are not from a Normal distribution. A decision is taken by using the p-value.

In the following we present the Jarque-Bera and Shapiro-Wilk tests.

### 8.2.1 Jarque-Bera Normality test

The Jarque-bera test of normality uses the sample skewness \(\hat{Sk}\) and kurtosis \(\hat{Kur}\) which are compared with the corresponding expected value (0 and 3) in case of Normal distribution. The test statistic is given by \[ JB = n\left(\frac{\hat{Sk}^2}{6}+\frac{(\hat{Kur}-3)^2}{24}\right) \] which, under the null hypothesis, is approximately distributed like a Chi-square distribution with 2 degrees of freedom (\(\chi^2_2\)). Under the null hypothesis (normality), we expect a value of the test statistic close to 0.

This test is implemented in R using the `jarque.bera.test`

function which is part of the `tseries`

package.

Assuming that the `tseries`

package is already installed (see Section 6.2.3.1), we load it and run the test for the `ahe`

variable:

`library(tseries)`

```
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
```

`jarque.bera.test(CPS$ahe)`

```
##
## Jarque Bera Test
##
## data: CPS$ahe
## X-squared = 570.37, df = 2, p-value < 2.2e-16
```

Given the extremely small value of the p-value (basically equal to zero), we reject, as expected given the plots, the normality hypothesis which is not supported by the data.

It is also possible to compute the value of the test statistic *manually*. Nevertheless, since to compare our results with the one obtained with the `jarque.bera.test`

, we need to compute the Skewness and the Kurtosis statistics using the population standard deviation (we have programmed an R function in Section 7.2. Then we first define the function to compute the population standard deviation

```
<- function(y){
sd_pop<- length(y)
n<- sqrt( (n-1)/n )*sd(y)
outputreturn(output)
}
```

Then we will compute the JB test statistics as follows

```
= length(CPS$ahe)
n = mean((CPS$ahe-mean(CPS$ahe))^3)/sd_pop(CPS$ahe)^3
sk = mean((CPS$ahe-mean(CPS$ahe))^4)/sd_pop(CPS$ahe)^4
kur = n*(sk^2/6+(kur-3)^2/24)
JB JB
```

`## [1] 570.3726`

The corresponding p-value can be computed *manually* by using the `pchisq`

function which provides the cumulative distribution function (i.e. \(P(X\leq x)\)) for a random variable \(X\) which is distributed like a Chi-square Distribution with `df`

degrees of freedom. In this case `df=2`

. We should also remember that the critical region of a JB test is unilateral on the right:

`pchisq(JB,df=2,lower.tail = F)`

`## [1] 1.39688e-124`

The option `lower.tail = F`

because we are interested in the probability in the right tail, while `pchisq`

returns the probability in the left side. As an alternative we could have computed the complement at one of the `pchisq`

, i.e.,

`1-pchisq(JB,df=2)`

`## [1] 0`

As described in Section 7.4, we use the Box-Cox transformation to solve the asymmetry issue. The transformed variable, named `ahe2`

, is added to the data frame:

```
library(car)
= powerTransform(CPS$ahe)
bestpar $lambda bestpar
```

```
## CPS$ahe
## 0.3531166
```

```
$ahe2 = bcPower(CPS$ahe,bestpar$lambda)
CPShead(CPS)
```

```
## year ahe sex ahe2
## 5920 1996 11.660622 male 3.909613
## 5922 1996 14.971988 male 4.531714
## 5923 1996 25.958241 male 6.111130
## 5924 1996 9.989918 male 3.551352
## 5925 1996 13.451954 male 4.258539
## 5926 1996 21.394132 male 5.520853
```

The transformed variable distribution is studied here below by means of different plots:

```
hist(CPS$ahe2,freq=F)
lines(density(CPS$ahe2),col="red")
```

```
qqnorm(CPS$ahe2,datax = T)
qqline(CPS$ahe2,datax = T,col="red")
```

The distribution seems to be symmetric and most of the points in the normal probability plot have a linear pattern, but with some points that deviate from the reference line in the left and right tails. Can we conclude that the transformed data are normally distributed? We will use a (different) normality test.

### 8.2.2 Shapiro-Wilk Normality test

The Shapiro-Wilk is based on the correlation between the empirical quantiles, given by the sample order statistics \(y_{(i)}\), and the theoretical Normal quantiles \(z_{(i)}=E(Z_{(i)})\). Under normality, the correlation should be close to 1; the \(H_0\) hypothesis of normality is rejected for small values of the test statistic.

The `shapiro.test`

function is part of the base R release and provides the values of the test statistic and of the corresponding p-value. It can be computed only if the sample size (\(n\)) is lower than 5000.

`shapiro.test(CPS$ahe2)`

```
##
## Shapiro-Wilk normality test
##
## data: CPS$ahe2
## W = 0.99775, p-value = 0.0005052
```

In this case, even if the correlation coefficient (0.9977467) is close to 1, the p-value is quite small and leads to the rejection of the normality hypothesis. Note that the Shapiro-Wilk test is known to be a sample-size biased test: for big sample even small differences from the Normal distribution lead to the rejection of the \(H_0\) hypothesis. In this case deviations from the Normal distribution are observed in the tails (see Figure 8.1).

Let’s try to use the Jarque-Bera test for the same set of data:

`jarque.bera.test(CPS$ahe2)`

```
##
## Jarque Bera Test
##
## data: CPS$ahe2
## X-squared = 6.2297, df = 2, p-value = 0.04439
```

In this case the p-value does not provide a very strong evidence against \(H_0\) (if compared with the p-value of the Shapiro-Wilk test). However, if we set the significance level \(\alpha\) (type I error probability) to 0.05, we reject the normality hypothesis also in this case.

## 8.3 Goodnes of fit test

## 8.4 Kolmogorov-Smirnov test

This test compares the empirical CDF \(F_n(t)\) (ECDF, see Section 6.1) and the theoretical CDF \(F(t)\) of a continuous random variable whose (population) parameters are assumed to be known. The test statistics is given by the absolute distance between the two function defined as follows: \[ D_n=\sup_t |F_n(t)-F(t)| \]

Under the null hypothesis, we expect a small value of \(D_n\). In R the `ks.test(x,y,...)`

function can be used to perform this test. The following arguments have to be specified:

`x`

is the vector of data`y`

is a string naming a theoretical CDF, as for instance`pnorm`

for the Normal distribution,`plnorm`

for the logNormal distribution.`...`

is used to specify the parameters of the considered distribution (for example`mean=...`

and`sd=...`

if`pnorm`

is used, or`meanlog=...`

and`sdlog=...`

if`plnorm`

is used).

However there is an issue with the Kolmogorov-Smirnov test implemented in `ks.test`

. As reported in its help page “*the parameters specified in … must be pre-specified and not estimated from the data*”.
In other words, it is not possible to specify a “** composite**” null hypothesis with the

`ks.test`

function.The point is that when the hypothesis is composite, we need to use the sample data to estimate the population parameters. When we plug-in the test the estimated parameters, the Kolmogorov-Smirnov test statistic is closer to the null distribution than it would be when the population parameters are known (and this results in increased Type II error probability, i.e. of the probability of not rejecting \(H_0\) when it is not true). See, for example, the following case where we simulate 200 values from the Uniform(0,1) distribution (see here for the definition of continuos Uniform distribution) by using the `runif`

function:

```
set.seed(4)
= runif(200)
x hist(x)
```

The data aren’t for sure normally distributed because they were simulated from the Uniform distribution. However, if we run the Kolmogorov-Smirnov test with `ks.test`

and by using the sample statistics (mean and sd), we obtain

```
ks.test(x, "pnorm",
mean=mean(x),
sd=sd(x))
```

```
##
## One-sample Kolmogorov-Smirnov test
##
## data: x
## D = 0.089686, p-value = 0.08011
## alternative hypothesis: two-sided
```

The p-value is 0.0801094: if we consider as threshold 0.05, we conclude the data suggest to not reject the normality hypothesis (and we know this is a wrong decision, i.e. second type error).

To overcome this issue it is possible to use the corrected version of the Kolmogorov-Smirnov test by means of the `LcKS`

function included in the `KScorrect`

library (that has to be installed and then loaded):

```
library(KScorrect)
= LcKS(x,"pnorm", parallel = TRUE) ks2
```

```
## Warning in LcKS(x, "pnorm", parallel = TRUE): Consider setting 'parallel =
## FALSE' to increase the speed of this function.
```

`## LcKS is running in parallel with 3 worker(s) using doParallelSNOW [1.0.16]`

`$p.value ks2`

`## [1] 6e-04`

In this case the p-value leads to correctly reject the normality assumption for the data contained in `x`

.

Let’s now consider one of the quantitative variable in `datareg`

and check if we can be considered as generated from the Normal model:

```
#nasdaq_KS = LcKS(datareg$ibm,"pnorm")
#nasdaq_KS$p.value
```

Given the small p-value we reject the normality hypothesis; this conclusion is coherent with the features of the distribution represented in the following plots:

```
#plot(density(datareg$ibm))
#qqnorm(datareg$ibm, datax=T)
#qqline(datareg$ibm, datax=T)
```

### 8.4.1 Other gof-test

**test (CVM-test) is an other goodness-of-fit test whose statistics is a**

*The Cramér–von Mises**distance*between the empirical distribution of the data \({\displaystyle F_{n}}\) and a theoretical distribution \(F\), i,e,

The test statistics is \[ {\displaystyle \omega ^{2}=\int _{-\infty }^{\infty }[F_{n}(x)-F(x)]^{2}\,\mathrm {d} F(x)} \]

The ** Anderson–Darling test** (AD-test), again is goodness-of-fit test whose statistics is a

*distance*between the empirical distribution of the data \({\displaystyle F_{n}}\) and a theoretical distribution \(F\). When applied to testing whether a normal distribution adequately describes a set of data, it is one of the most powerful statistical tools for detecting most departures from normality.

The **Anderson–Darling** test statistics
\[
{\displaystyle A^{2}=n\int _{-\infty }^{\infty }{\frac {(F_{n}(x)-F(x))^{2}}{F(x)\;(1-F(x))}}\,dF(x),}
\]

In R the test both the CVM-test and the AD-test are implemented in the library `goftest`

. The two main function are `cvm.test`

and `ad.test`

.

By default, each test assumes that the parameters of the null distribution are known (a *simple null hypothesis*). If the parameters were estimated (calculated from the data) then the user should set `estimated=TRUE`

which uses a method adjust for the effect of estimating the parameters from the data (i.e. it adjust the p-value for *composite null hypothesis*)

### 8.4.2 Gof test for three set of data

As an example we are going to analyze the other set of data, that we already have seen in class: the *magenta* (file magenta_data.txt), the *green* (file green_data.txt), and the *cyan* data (file cyan_data.txt)

Let’s load the data and produce some of the plot that we know already. Observe that, since the files contain an univariate sequence of data, we can use the function `scan`

to inport the data.

*Magenta data*

```
<- scan(file="files/magenta_data.txt")
series1
par(mfrow=c(1,3))
hist(series1,freq=F,col="magenta")
boxplot(series1,col="magenta")
qqnorm(series1,datax = T,col="magenta",pch="*",cex=1.5)
qqline(series1,datax=T)
```

** Summary statistics** of the Magenta data

`summary(series1)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7596 1.1169 1.2362 1.2409 1.3530 1.8869
```

For this set of data we want to perform the following test: \[ H_0:\text{Data have been generated from a Gamma distribution} \]

To perform the test we need to estimate the parameter \(\alpha\), i.e. the shape, and the parameter \(\beta\), i.e., the rate, of the Gamma distribution, given the observation.
We will perform the estimation via the
`fitdistr`

of the `MASS`

library

```
library(MASS)
= fitdistr(x=series1,densfun="gamma")
fitN fitN
```

```
## shape rate
## 52.199691 42.066751
## ( 2.326943) ( 1.884256)
```

Then we can compute the p-value for the three gof-test we have studied

*Kolmogorov-Smirnov*

`= LcKS(series1,"pgamma") ks2 `

```
## Warning in LcKS(series1, "pgamma"): Consider setting 'parallel = TRUE' to
## increase the speed of this function.
```

```
#ks2$p.value
0.6448
```

`## [1] 0.6448`

*Anderson-Darling*

```
library(goftest)
ad.test(series1, "pgamma", shape=fitN$estimate["shape"],
rate=fitN$estimate["rate"], estimated=TRUE)
```

```
##
## Anderson-Darling test of goodness-of-fit
## Braun's adjustment using 32 groups
## Null hypothesis: Gamma distribution
## with parameters shape = 52.1996908290833, rate = 42.0667510631009
## Parameters assumed to have been estimated from data
##
## data: series1
## Anmax = 5.1624, p-value = 0.07556
```

*Cramer-von Mises*

```
cvm.test(series1, "pgamma", shape=fitN$estimate["shape"],
rate=fitN$estimate["rate"], estimated=TRUE)
```

```
##
## Cramer-von Mises test of goodness-of-fit
## Braun's adjustment using 32 groups
## Null hypothesis: Gamma distribution
## with parameters shape = 52.1996908290833, rate = 42.0667510631009
## Parameters assumed to have been estimated from data
##
## data: series1
## omega2max = 0.91382, p-value = 0.1089
```

In the three cases we obtained large p-value, suggesting an data evidence in favour of the null hypotesis, indeed we conclude that a Gamma model fits well the data.

*Green data*

```
<- scan("files/green_data.txt")
series2
par(mfrow=c(1,3))
hist(series2,freq=T,col="green")
boxplot(series2,col="green")
qqnorm(series2,datax = T,col="green",pch="*",cex=1.5)
qqline(series2,datax=T)
```

** Summary statistics** of the Green data

`summary(series2)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.620 2.743 4.018 4.032 5.329 10.482
```

For this set of data we want to perform the following test: \[ H_0:\text{Data have been generated from a Normal distribution} \]

To perform the test we need to estimate the parameter \(\mu\), i.e. the mean, and the parameter \(\sigma\), i.e., the standard deviation, of the Normal distribution, given the observation. In this case they corresponds with the sample mean and with the sample standard deviation

```
library(MASS)
<- mean(series2)
hat_mu<- sd(series2) hat_sig
```

Then we can compute the p-value for the three gof-test we have studied

*Kolmogorov-Smirnov*

`= LcKS(series1,"pnorm",parallel = T) ks2 `

```
## Warning in LcKS(series1, "pnorm", parallel = T): Consider setting 'parallel =
## FALSE' to increase the speed of this function.
```

`## LcKS is running in parallel with 3 worker(s) using doParallelSNOW [1.0.16]`

`$p.value ks2`

`## [1] 0.0282`

*Anderson-Darling*

```
ad.test(series2, "pnorm", mean=hat_mu,
sd=hat_sig, estimated=TRUE)
```

```
##
## Anderson-Darling test of goodness-of-fit
## Braun's adjustment using 32 groups
## Null hypothesis: Normal distribution
## with parameters mean = 4.03225574117, sd = 1.98338995259963
## Parameters assumed to have been estimated from data
##
## data: series2
## Anmax = 3.5744, p-value = 0.3694
```

*Cramer-von Mises*

```
cvm.test(series2, "pnorm", mean=hat_mu,
sd=hat_sig, estimated=TRUE)
```

```
##
## Cramer-von Mises test of goodness-of-fit
## Braun's adjustment using 32 groups
## Null hypothesis: Normal distribution
## with parameters mean = 4.03225574117, sd = 1.98338995259963
## Parameters assumed to have been estimated from data
##
## data: series2
## omega2max = 0.50485, p-value = 0.7108
```

*Cyan data*

```
<- scan(file="files/cyan_data.txt")
series3
par(mfrow=c(1,3))
hist(series3,freq=T,col="cyan")
boxplot(series3,col="cyan")
qqnorm(series3,datax = T,col="cyan",pch="*",cex=1.5)
qqline(series3,datax=T)
```

** Summary statistics** cyan data

`summary(series3)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.25854 -0.62827 0.04657 0.10559 0.75867 4.29380
```

For this set of data we want to perform the following test: \[ H_0:\text{Data have been generated from a t-student distribution} \]

To perform the test we need to estimate the parameter \(\mu\), i.e. the non-central (or location) parameter the parameter \(\sigma\), i.e., scale parameter, and the parameter \(\nu\), i.e. the degrees of freedom of the t-student distribution, given the observation.
To this end we will use the
`fitdistr`

function of the `MASS`

library

```
library(MASS)
= fitdistr(x=series1,densfun="t")
fitN fitN
```

```
## m s df
## 1.240266e+00 1.692066e-01 6.969612e+01
## (5.501338e-03) (5.403389e-03) (1.107708e+02)
```

Nevertheless, we should observe that the R built-in function `pt`

, is parametrized only respect to \(\mu\) and \(\nu\).
To apply the gof-test implemented in the `goftest`

we have to take into account the lack of \(\sigma\). To this end we can observe that if \(X\sim\)t-student\((\mu,\sigma,\nu)\) then
if we let \(Y=X/\sigma\) we have that \(Y\sim\)t-student\((\mu/\sigma,1,\nu)\), so we can apply the test on the transformed data.

So we tranform the data

`<- series3/fitN$estimate["s"] series3_tran `

And we apply the Anderson-Darling test to the tranformed data

```
ad.test(series3_tran, "pt", df=20.17,
ncp=fitN$estimate["m"]/fitN$estimate["s"],
estimated=TRUE)
```

```
##
## Anderson-Darling test of goodness-of-fit
## Braun's adjustment using 24 groups
## Null hypothesis: Student's t distribution
## with parameters df = 20.17, ncp = 7.32988991315944
## Parameters assumed to have been estimated from data
##
## data: series3_tran
## Anmax = 418.58, p-value = 0.0005758
```

In the same way we apply the Cramer-von Mises test to the tranformed data

```
cvm.test(series3_tran, "pt", df=20.17,
ncp=fitN$estimate["m"]/fitN$estimate["s"],
estimated=TRUE)
```

```
##
## Cramer-von Mises test of goodness-of-fit
## Braun's adjustment using 24 groups
## Null hypothesis: Student's t distribution
## with parameters df = 20.17, ncp = 7.32988991315944
## Parameters assumed to have been estimated from data
##
## data: series3_tran
## omega2max = 7.3789, p-value < 2.2e-16
```

We obtained quite large values of the p-value, and we can conclude that there is not evidence to reject the null hypothesis, so we conclude that the cyan data are distributed according to a t-student distribution.

## 8.5 Joint analysis of the assess of four different stocks

The file
`prices.txt`

contain daily prices of the following stocks (from 2020-11-05, to 2021-11-03):
- `ge`

General Electrics
- `nok`

Nokia
- `nok`

Facebook
- `sp`

Standard and Poor (SP)

Let’s clear the R-worksheet and load the data

```
###
rm(list=ls())
<- read.table("files/prices.txt",header=T)
pricesstr(prices)
```

```
## 'data.frame': 247 obs. of 5 variables:
## $ Date: chr "2020-11-05" "2020-11-06" "2020-11-09" "2020-11-10" ...
## $ ge : num 63.7 64.5 69.5 71.6 70.9 ...
## $ nok : num 3.54 3.52 3.65 3.67 3.75 3.75 3.81 3.9 3.93 3.94 ...
## $ face: num 250 248 243 227 234 ...
## $ sp : num 345 345 350 349 352 ...
```

We observe that the firs variable in the data-frame is the day at which the prices have been collected. Nevertheless, R reed this column as of type character `chr`

. A very useful library to transform string (i.e `chr`

object) into dates is `lubridate`

```
library("lubridate")
is(prices$Date)
```

```
## [1] "character" "vector" "data.frameRowLabels"
## [4] "SuperClassMethod" "index" "atomicVector"
```

```
$Date<-ymd(prices$Date)
pricesis(prices$Date)
```

`## [1] "Date" "oldClass" "time_timeSeries"`

`$Date[1] #first day of obeservation prices`

`## [1] "2020-11-05"`

`tail(prices$Date)[1] # last day of observation`

`## [1] "2021-10-27"`

Let’s give a look at the time series of the prices

```
par(mfrow=c(2,2))
for(i in 1:4){
plot(prices$Date,prices[,i+1],type="l",
main=paste("Prices of", names(prices)[i+1]),col=i+1,lwd=2)
}
```

We observe a quite strong “trend” for all the time series, moreover the prices of the different stocks are very different, form the average of 4.8728745$ for a Nokia’s share to the average of 408.0274877$ for a SP’s share

For more information we can look at the *univariate* or *marginal* statistics for the four stocks

`summary(prices[,-1])`

```
## ge nok face sp
## Min. : 63.72 Min. :3.520 Min. :204.1 Min. :345.3
## 1st Qu.: 92.51 1st Qu.:4.060 1st Qu.:229.0 1st Qu.:380.1
## Median :103.44 Median :4.900 Median :261.8 Median :414.1
## Mean : 99.45 Mean :4.873 Mean :262.3 Mean :408.0
## 3rd Qu.:105.70 3rd Qu.:5.670 3rd Qu.:293.1 3rd Qu.:435.4
## Max. :114.62 Max. :6.290 Max. :324.1 Max. :464.7
```

Now we want to study the relationship between the four shares. To this end we compute the Variance-Covariance matrix and the correlation matrix.

Covariance

`cov(prices[,-1])`

```
## ge nok face sp
## ge 96.596734 4.2184380 172.14966 229.90921
## nok 4.218438 0.6668425 24.80614 23.43198
## face 172.149662 24.8061397 1142.65856 957.33674
## sp 229.909207 23.4319836 957.33674 1007.06617
```

Correlation

`cor(prices[,-1])`

```
## ge nok face sp
## ge 1.0000000 0.5256042 0.5181634 0.7371335
## nok 0.5256042 1.0000000 0.8986474 0.9042081
## face 0.5181634 0.8986474 1.0000000 0.8924368
## sp 0.7371335 0.9042081 0.8924368 1.0000000
```

Moreover we depict a pairs-plot of the prices of the four stocks

```
par(mfrow=c(1,1),mar=c(3,3,1,1),mgp=c(1.75,.75,0))
pairs(prices[,-1], main="Pairs of the Prices")
```

From the pairs plot we can appreciate the dependence between the prices, nevetheless, the dependence is not linear. So the correlation matrix is not properly interpretable, indeed the entries of the matrix measure the linear dependence between the corresponding pair of random variables.

### 8.5.1 The log Returns

Now we want to compute the log-returns of the four assets. We First compute the log prices

`<- log(prices[,-1]) logprices`

Then we apply the function `diff`

to all the
column of `logprices`

to obtain the log-returns

`<- apply(logprices,2,diff) logRet`

We can give a look at the four time-series logreturns

```
par(mfrow=c(2,2))
for(i in 1:4){
matplot(prices$Date[-1],logRet[,i],type="l",
main=paste("log returns of",names(prices)[i+1]),col=i+1,lwd=2)
}
```

they look much more stationary than the prices.

Nevertheless the logReturns ar not really Normal distributed, as we can see from the histogram

```
par(mfrow=c(2,2))
for(i in 1:4){
hist(logRet[,i],
main=paste("hist log-ret of",names(prices)[i+1]),
col=i+1,lwd=2,freq=F)
}
```

From the Normal probability plots

```
par(mfrow=c(2,2))
for(i in 1:4){
qqnorm(logRet[,i],main=paste("qqnorm log-ret of",names(prices)[i+1]),
col=i+1,lwd=2,datax = T)
qqline(logRet[,i],datax=T,lty=2,lwd=2)
}
```

Let’s now give a look at the summary (i.e. at the mariginal statistics), at the covariance and at the correlation matrix of the log-Returns

Summary

`summary(logRet)`

```
## ge nok face
## Min. :-0.0768403 Min. :-0.090416 Min. :-0.0668334
## 1st Qu.:-0.0110392 1st Qu.:-0.010124 1st Qu.:-0.0086579
## Median : 0.0003787 Median : 0.000000 Median : 0.0007323
## Mean : 0.0020680 Mean : 0.001986 Mean : 0.0005124
## 3rd Qu.: 0.0136856 3rd Qu.: 0.011678 3rd Qu.: 0.0097723
## Max. : 0.0749904 Max. : 0.143894 Max. : 0.0611787
## sp
## Min. :-0.024744
## 1st Qu.:-0.003277
## Median : 0.001519
## Mean : 0.001207
## 3rd Qu.: 0.006059
## Max. : 0.023951
```

Covariance

`cov(logRet)`

```
## ge nok face sp
## ge 4.499589e-04 7.093867e-05 -2.836971e-05 6.031814e-05
## nok 7.093867e-05 5.694915e-04 6.712406e-05 6.684402e-05
## face -2.836971e-05 6.712406e-05 3.515832e-04 5.875652e-05
## sp 6.031814e-05 6.684402e-05 5.875652e-05 6.015415e-05
```

Correlation

`cor(logRet)`

```
## ge nok face sp
## ge 1.00000000 0.1401371 -0.07132708 0.3666306
## nok 0.14013706 1.0000000 0.15001010 0.3611488
## face -0.07132708 0.1500101 1.00000000 0.4040259
## sp 0.36663056 0.3611488 0.40402587 1.0000000
```

To better understand the correlation matrix we can produce a pairs plot, that is a scatter-plot between all the pairs of variables in the data.frame

Paris of the log-returns

```
par(mfrow=c(1,1),mar=c(3,3,1,1),mgp=c(1.75,.75,0))
pairs(logRet, main="Pairs of the log-returns")
```

The sequences of log-returns are weekly correlated. Indeed, we can appreciate a week linear trend between the log-returns in the by looking at the pairs plot.

## 8.6 Portfolio

Suppose that at the 5\(^{th}\) of November 2020 we had around 2000$ that we want to invest. In particular we want to invest on Facebook, General Electric and Nokia.

Let’s start by looking at the prices of the three stocks at the “2020-11-05”

```
<- prices[1,c("face","ge","nok")]
price_port price_port
```

```
## face ge nok
## 1 250.1 63.71612 3.54
```

Suppose now we want to differentiate our investment by putting 1000 $ on Facebook, 700$ on General Electric, and 300$ on Nokia.

So the budget is split as follow

`<- c(1000,700,300) budget`

Now we have to compute how many share of each stock we can buy with our budget

```
<- round(budget/price_port,0)
shares shares
```

```
## face ge nok
## 1 4 11 85
```

Indeed the total money I have to invest is

```
=sum(shares*price_port)
total_inve total_inve
```

`## [1] 2002.177`

We have got a portfolio. I want to compute how much each stock weight in it. That is the weigth of the value of each stock compared with the total money that I have invested:

```
=as.numeric((shares*price_port)/total_inve)
w w
```

`## [1] 0.4996561 0.3500576 0.1502864`

After observing the prices of the three stocks for one year, I cam compute the mean of the daily log return of the portfolio.

```
<- t(w)%*%colMeans(logRet[,c("face","ge","nok")])
mean_por mean_por
```

```
## [,1]
## [1,] 0.001278404
```

Variance of log-returns in the portfolio

```
<-t(w)%*%cov(logRet[,c("face","ge","nok")])%*%w
var_por var_por
```

```
## [,1]
## [1,] 0.0001633963
```

Coefficient of variation

`<- sqrt(var_por)/mean_por cv_por`

## 8.7 Exercises Lab 6

### 8.7.1 Exercise 1

The daily time series of prices for the ENI asset are available in the file named *E.csv* (from 2012-11-09 to 2017-11-09). Import the data.

Compute the gross returns (by using Adj.Close prices). Represent the gross returns distribution with an histogram. Comment the plot.

Test the normality of the gross returns distribution.

Fit the Normal and t distribution to the data. Provide the estimated parameters. Moreover, which is the best model by using the deviance, the AIC and BIC?

Provide the KDE plot. Moreover, add to the plot the Normal and t PDFs whose parameters have been estimated at the previous point.

Which is the value of the fitted Normal and t PDFs for \(x=0.9\)? Are they similar?

### 8.7.2 Exercise 2

Simulate 1000 values from the logNormal distribution with with parameters equal to 15 (meanlog) and 0.5 (varlog). Use the value 789 as seed.

Plot the data using the boxplot and the normal probability plot.

Fit to the data a distribution. You have to choose which theoretical distribution, but knowing where the data come from it’s easy. See also the available distributions in

`?fitdistr`

. Which are the estimated parameters?Plot the KDE of the simulated data together with the fitted PDF. To compute the latter, for a reasonable sequence of \(x\) values, you will have to use a function in R whose names starts with

`d...`

.Fit to the data the Exponential distribution (see here. You will have to use

`densfun=exponential`

in`fitdistr`

. Provide the parameter estimates.In terms of deviance, AIC and BIC which is the best model?

### 8.7.3 Exercise 3

The file `prices_ex.cs`

contains the daily prices of three different stocks of the USA market: Google (GOG), Amazon (AMZN) and Tesla (TSLA). In particular, the series report one-year daily quotations (from 16/11/2020 to 15/11/2021). Import the data in R and comment each of the following point:

- Plot both the original time series in two different charts.
- Transform both time series computing the 1st and the 2nd difference and plot these new series in the same graph of point 1. Which kind of effect do you notice?
- Compute the log-returns for of the three time-series and plot the obtained series. Do you see the expected effect?
- Study the marginal distribution of log-returns via graphical representation. Comment on how the normal distribution could fits the data.
- Let’s focus on the sequence of log-returns of Amazon. Identify, using a plot, the two most extreme values. Then delete the day with the two most extreme values from the time series. Do you think that the Normal distribution is a suitable model for this reduced series of data. Give an answer using some test of hypothesis.
- Let’s focus now on the Google sequence. We want to investigate if the t-student could fit well this sequence. How many degree of freedom we should consider, use some plot to justify your answer and compare it with the ML estimation of this parameter.
- Plot a graph to investigate the joint distribution of the log-returns. Moreover, compute some index to assess the linear dependence between pairs of log-returns.
- Compute a basket composed of the three stocks having a budged of 20000 $
- Compute the mean and the variance of the daily log-return of your basket.