Chapter 5 Lab 3 - 23/10/2021
In this lecture we will learn how to compute jontly returns and how to study the their marginal distribution.
For this lecture we use the same set of data of Lab 2 included in the Adjcloseprices.csv file. We first proceed with the data import as described in Lab 2; the object prices
will be created:
= read.csv("~/Dropbox/UniBg/Didattica/Economia/2021-2022/PSBF_2122/R_LABS/Lab03/prices.csv", sep=",", dec=".")
prices class(prices)
## [1] "data.frame"
head(prices)
## Date A2Aadj ISPajd JUVEadj
## 1 10/01/17 0.982659 1.597666 0.288256
## 2 11/01/17 0.999368 1.595077 0.289546
## 3 12/01/17 1.012895 1.573067 0.288164
## 4 13/01/17 1.030400 1.596372 0.289546
## 5 16/01/17 1.021647 1.566594 0.284573
## 6 17/01/17 1.022443 1.566594 0.288256
str(prices)
## 'data.frame': 1213 obs. of 4 variables:
## $ Date : chr "10/01/17" "11/01/17" "12/01/17" "13/01/17" ...
## $ A2Aadj : num 0.983 0.999 1.013 1.03 1.022 ...
## $ ISPajd : num 1.6 1.6 1.57 1.6 1.57 ...
## $ JUVEadj: num 0.288 0.29 0.288 0.29 0.285 ...
Note that in the previous code the entire path of the file is used; this will be obviously different in your case. If you are not sure how to specify the path, use the RStudio feature to import data, as described in Section 4.1.
5.1 Compute returns jointly
We start by computing log returns using the definition provided in Section 4.5. One possibility is to perform the computation separately for each asset, starting for example with A2A:
= diff(log(prices$A2Aadj)) logretA2A
The output is a vector. This code should be repeated for all the remaining assets giving rise to a number of code lines equal to the number of considered assets. This could result in a long list of code lines when the number of assets is large. An alternative is to compute the log returns jointly for all the available assets by using the apply
function introduced in Section 4.3. In particular, we first compute log prices by applying the log
function column by column (the second argument is set equal to 2); note that we exclude the first column of prices as it contains dates:
= apply(prices[,-1],2,log)
logprices head(logprices)
## A2Aadj ISPajd JUVEadj
## [1,] -0.0174931163 0.4685438 -1.243906
## [2,] -0.0006321998 0.4669220 -1.239441
## [3,] 0.0128125674 0.4530272 -1.244226
## [4,] 0.0299470764 0.4677336 -1.239441
## [5,] 0.0214160309 0.4489038 -1.256765
## [6,] 0.0221948617 0.4489038 -1.243906
We then compute the differences of the log prices by means of the diff
function:
= apply(logprices, 2, diff)
logret head(logret)
## A2Aadj ISPajd JUVEadj
## [1,] 0.0168609165 -0.001621803 0.004465205
## [2,] 0.0134447672 -0.013894794 -0.004784417
## [3,] 0.0171345090 0.014706338 0.004784417
## [4,] -0.0085310454 -0.018829719 -0.017324368
## [5,] 0.0007788307 0.000000000 0.012859163
## [6,] 0.0092950844 -0.002482982 0.000000000
We check which kind of object logret
is
class(logret)
## [1] "matrix" "array"
and we realize that it is not a data frame but rather a matrix. As we prefer working with data frame we transform the matrix into a data frame as follows:
= data.frame(logret)
logret class(logret)
## [1] "data.frame"
5.2 Boxplot
We aim to use boxplots to study and compare the available log returns distributions. It is possible to obtain a plot containing the boxplots of all the assets by using the following code:
boxplot(logret,
horizontal = TRUE,
col = rainbow(ncol(logret)))
We note that JUVE is characterized by higher variability and a larger number of extreme events. Moreover, for all the assets the median log return value is close to zero, the whiskers are symmetric and the median is placed in the middle of the box.
If you want to define the whiskers by using the minimum and maximum value use the range
option of the boxplot
function (see ?boxplot
); in this case outliers are not shown:
boxplot(logret, range = 0,
horizontal = TRUE,
col = rainbow(ncol(logret)))
5.3 Histogram and Kernel Density Estimator
The A2A log returns can be represented graphically by using the histogram:
hist(logret$A2Aadj)
The plot represents the values of the log returns on the x-axis (with classes of values set automatically by R) and the corresponding frequencies (i.e. number of days) on the y-axis. From the histogram a very long tail on the left is clearly visible.
By using the breaks
arguments it is possible to specify a different number of classes of values. If the breaks
parameter is a single integer value the hist
function will use a number of bins as close as possible of the specified value of breaks
.
hist(logret$A2Aadj, breaks=3)
hist(logret$A2Aadj, breaks=20)
Note that the classes are equally sized. It is also possible to set the limits of the classes by providing to breaks
a vector of values:
hist(logret$A2Aadj,
breaks = c(-0.25,0, 0.01, 0.15))
In this case the used class are \([-0.25,0]\),\((0,0.01]\) and \((0.01,0.15]\).
The plot of the Kernel Density Estimator (KDE) can be obtained by using the plot
and density
functions:
plot(density(logret$A2Aadj))
Note that by default R sets the best value for the bandwidth parameter (see the bottom of the plot). In order to study the effect of the bandwidth on the final KDE, we double its value by using the adjust
argument:
plot(density(logret$A2Aadj,adjust=2))
The multiplicative constant specified with adjust
is applied to the original value set automatically by R. When the bandwidth is doubled the KDE line is smoother. If instead the bandwidth is halved:
plot(density(logret$A2Aadj,adjust=0.5))
the final KDE is characterized by more bumps, is less smooth and shows more fine details of the distribution.
It is also possible to plot together the histogram and the KDE line:
hist(logret$A2Aadj, freq = F, ylim = c(0,35))
lines(density(logret$A2Aadj),
col = "red",
lwd = 1.5)
Note that the argument freq=F
is used to represent densities instead of frequencies in the y-axis. In this way th histogram and the KDE are comparable and can be plotted together. Finally, note that in the density histogram the area of each single bar is represented by the absolute frequency of the class of values.
5.4 Sample skewness and kurtosis index
In order to evaluate the asymmetry of the distribution it is possible to compute the skewness index given by this formula: \[ sk = \frac{\frac{\sum_{i=1}^n (x_i-\bar x)^3}{n}}{\sigma^3} \] which is implemented in R as follows
mean((logret$A2Aadj - mean(logret$A2Aadj))^3) /
sd(logret$A2Aadj)^3
## [1] -2.206542
where the function sd
computes the standard deviation. The skewness index is negative and suggests a negative skewness (long left tail).
The sample kurtosis instead is given by \[ sk = \frac{\frac{\sum_{i=1}^n (x_i-\bar x)^4}{n}}{\sigma^4} \] and can be computed as follows:
mean((logret$A2Aadj - mean(logret$A2Aadj))^4) /
sd(logret$A2Aadj)^4
## [1] 31.29707
The kurtosis value is extremely bigger than three, meaning that the distribution is characterized by excess of kurtosis (with respect to the normal distribution).
5.5 Remove rows with missing data
We now use a new set of data available in the posteit.csv file. We first proceed with the data import; the object poste
will be created:
= read.csv("~/Dropbox/UniBg/Didattica/Economia/2021-2022/PSBF_2122/R_LABS/Lab03/posteit.csv", sep=";", dec=".")
poste head(poste)
## Date PSTadj
## 1 2017-01-10 4.706218
## 2 2017-01-11 4.717724
## 3 2017-01-12 4.679369
## 4 2017-01-13 4.729231
## 5 2017-01-16 4.652520
## 6 2017-01-17 4.641014
The function summary
can be applied to the data frame in order to get the main summary statistics for all the variables included in the data frame:
summary(poste)
## Date PSTadj
## Length:1213 Min. : 4.461
## Class :character 1st Qu.: 5.518
## Mode :character Median : 7.213
## Mean : 7.414
## 3rd Qu.: 8.966
## Max. :12.555
## NA's :3
It is worth to note that the PSTadj
time series have 3 missing values (NA
), i.e. three days for which prices are not available.
In order to find the missing values, we will use the is.na
function. It returns, for each element in a vector, TRUE
or FALSE
according if the value is missing or not. See for example what happens with the first 20 values:
head(is.na(poste$PSTadj), n=20)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
The following command
sum(is.na(poste$PSTadj))
## [1] 3
returns the number of missing values in the vector.
We want to create a new data frame containing only the non missing values. Remembering what was done in Lab. 1, we need to set a proper condition equal to TRUE
if the value is not missing. We thus need to use the complement of the is.na
function obtained using the !
operator and the squared parenthesis with the 2 indexes for the selection (all the columns and only the rows satisfying the condition):
= poste[!is.na(poste$PSTadj),] poste2
We want now instead to substitie in poste
the missing value with the median computed using the non missing value. For computing the median we have to specify the option na.rm = TRUE
, otherwise the function will return NA
:
median(poste$PSTadj, na.rm = TRUE)
## [1] 7.212518
Finally, the substitution:
$PSTadj[is.na(poste$PSTadj)] = median(poste$PSTadj, na.rm = TRUE) poste
Note that now there are no more missing values:
summary(poste)
## Date PSTadj
## Length:1213 Min. : 4.461
## Class :character 1st Qu.: 5.520
## Mode :character Median : 7.213
## Mean : 7.413
## 3rd Qu.: 8.965
## Max. :12.555
5.6 Exercises Lab 3
5.6.1 Exercise 1
The file AdjclosepricesLab3.csv contains the daily adjusted close prices observed from 01-09-2016 to 27-08-2020 for the following indexes: NASDAQ Composite (IXIC), Dow Jones Industrial Average (DJI), Gold (Gold) and CBOE Volatility Index (VIX). Data source: Yahoo Finance.
- Import the data in R and check the structure of the data. Are there any missing values?
- Compute for the 4 assets the gross returns by using less code as possible. Create a new data frame containing all the net returns.
- Provide altogether the boxplots of the 4 distributions (one single plot is required). Comment the plot.
- Provide now the same plot but removing VIX. What do you observe? Hint: for not selecting one column of the data frame it is possible to use the - symbol; e.g.
mydata[,-1]
for omitting the first column ofmydata
. - Represent in the same plot the KDE and the histogram for IXIC. Comment the plot.
- Represent in the same plot the 4 KDEs for the gross return time series. Comment the plot.
- Compute for the gross return distributions the sample skeweness. Comment the values.
- Compute for the gross return distributions the sample kurtosis Comment the values.