Chapter 4 Descriptive statistics

4.1 Distribution of a single feature

In capital market modeling, descriptive statistics tools are needed to describe the development of past rates of return. In turn, based on past data, one can infer the distribution of probable rates of return in the future. Below, we describe how to describe the distribution of a single rate of return, and in the next section we show measures of the co-variation.

4.1.1 Averages

The arithmetic average is the simplest and most basic tool for describing the distribution of rates of return.

The average simple net rate of return for periods \(1\) to \(n\) is:

\[\begin{equation} \bar{R} = \frac{\sum_{i=1}^n R_i}{n} \tag{4.1} \end{equation}\]

Similarly, the average total return:

\[\begin{equation} \bar{\boldsymbol{\mathcal{R}}} = \frac{\sum_{i=1}^n \boldsymbol{\mathcal{R}}_i}{n} = 1 + \bar{R} \tag{4.2} \end{equation}\]

and the average log rate:

\[\begin{equation} \bar{r} = \frac{\sum_{i=1}^n r_i}{n} \tag{4.3} \end{equation}\]

In spreadsheets, we use the AVERAGE function - Google, Excel. In R, we use the mean function.

4.1.2 Average geometric return, CAGR

General definition of geometric mean for \(n\) positive numbers \(x_1,..., x_n\):

\[\begin{equation} G = \sqrt[n]{\prod_{i=1}^n x_i} = \sqrt[n]{x_1\cdot x_2\cdot \cdot \cdot x_n} = (x_1\cdot x_2\cdot \cdot \cdot x_n)^{1/n} \tag{4.4} \end{equation}\]

Rates of return can be negative, so the general definition does not apply. In financial mathematics and practice, the term “geometric average rate of return” is used in a slightly different meaning – the geometric mean of the total returns over consecutive periods, minus 1 (DeFusco et al. 2007):

\[\begin{equation} R_G = \left(\prod_{i=1}^t \boldsymbol{\mathcal{R}}_i\right)^{\frac{1}{t}} - 1 = \left[\prod_{i=1}^t (1+R_i)\right]^{\frac{1}{t}} - 1 \tag{4.5} \end{equation}\]

In the absence of dividends:

\[\begin{equation} R_G = \left(\frac{P_t}{P_0} \right)^{\frac{1}{t}} - 1 \tag{4.6} \end{equation}\]

As you can see, when there are no dividends and the resulting average rate is annual, the net geometric mean return can be identified with the compound annual growth rate (CAGR).

For logarithmic rates of return, we do not calculate the geometric mean. However, it is worth noting that:

\[\begin{equation} R_G = e^{\bar{r}}-1 \tag{4.7} \end{equation}\]

There is no ready-made formula for CAGR in spreadsheets, but the GEOMEAN functions – Google, Excel and PRODUCT – Google, Excel could be of some help.

4.1.3 Variance and standard deviation

The formula for variance comes in two versions: one referred to as the population formula and the sample formula.

The formula for the variance for a population of size \(N\), using simple net rates of return as an example:

\[\begin{equation} \sigma^2_R = \frac{\sum_{i=1}^N \left(R_i-\bar{R}\right)^2}{N} \tag{4.8} \end{equation}\]

For the variance for a sample of size \(n\):

\[\begin{equation} s^2_R = \frac{\sum_{i=1}^n \left(R_i-\bar{R}\right)^2}{n-1} \tag{4.9} \end{equation}\]

Since we will often treat the determined historical data as a sample allowing us to draw conclusions about the underlying processes and, possibly, the future, we will use formula (4.9).

In spreadsheets we use the VAR function – Google, Excel, equivalent to VAR.S – Google, Excel. In R, sample variance can be calculated using the var function.

To calculate the variance in the population, you can use the VARP/VAR.P function – Google, Excel. Population variance is not easily available in R’s popular packages. One can use varpop function in the radiant.data package.

The standard deviation is the square root of the variance. Similarly, we have the formula for the population and for the sample:

\[\begin{equation} \sigma_R = \sqrt{\sigma^2_R} = \sqrt{\frac{\sum_{i=1}^N \left(R_i-\bar{R}\right)^2}{N}} \tag{4.10} \end{equation}\]

\[\begin{equation} s_R = \sqrt{s^2_R} = \sqrt{\frac{\sum_{i=1}^n \left(R_i-\bar{R}\right)^2}{n-1}} \tag{4.11} \end{equation}\]

In spreadsheets, the standard deviation for a sample is determined using the STDEV function – Google, Excel or (equivalently) STDEV.S – Google, Excel. In R, sample standard deviation is calculated with the sd function.

The standard deviation for a population can be calculated using the STDEVP – Google, Excel or STDEV.P function – Google, Excel. In R, one can use sdpop function in the radiant.data package.

4.1.4 Skewness / asymmetry

The moment-based asymmetry coefficient is calculated as follows:

\[\begin{equation} g_{1(R)} = \frac{1}{n}\sum_{i=1}^n\left(\frac{R_i-\bar{R}}{\sigma_R}\right)^3 \tag{4.12} \end{equation}\]

The above formula can be considered analogous to the formulas for the population variance and standard deviation.

The modified asymmetry coefficient has the following formula:

\[\begin{equation} G_{1(R)} = \frac{\sqrt{n(n-1)}}{n-2}g_{1(R)} \tag{4.13} \end{equation}\]

Sometimes formula similar to (4.12) is used but with the sample standard deviation:

\[\begin{equation} b_{1(R)} = \frac{1}{n}\sum_{i=1}^n\left(\frac{R_i-\bar{R}}{s_R}\right)^3 \tag{4.14} \end{equation}\]

The skewness coefficient tells which tail of the distribution: the left one – below the mean (negative coefficient values) or the right one – above the mean (positive values) is more drawn out. In the case return rates, a negative coefficient may indicate a greater probability of substantial losses, while a positive coefficient indicates a greater probability of substantial gains.

In spreadsheets, the modified asymmetry coefficient (\(G_1\)) can be calculated using the SKEW function – Google, Excel, while \(g_1\) can be obtained with SKEW.P – Google, Excel. In R the skewness function from the e1071 package can be used. Be default, it returns the \(b_1\) version. Other versions are available with the type option: skewness(data, type=1) returns \(g_1\) and skewness(data, type=2) returns \(G_1\).

4.1.5 Kurtosis

The excess kurtosis coefficient has the following formula:

\[\begin{equation} g_{2(R)} = \frac{1}{n}\sum_{i=1}^n\left(\frac{R_i-\bar{R}}{\sigma_R}\right)^4-3 \tag{4.15} \end{equation}\]

The above formula can be treated as a formula for the population. The formula for the sample is most often as follows:

\[\begin{equation} G_{2(R)} = \frac{n-1}{(n-2)(n-3)}\left[(n+1)g_{2(R)}+6\right] \tag{4.16} \end{equation}\]

Some statistical packages also use the following formula:

\[\begin{equation} b_{2(R)} = \frac{1}{n}\sum_{i=1}^n\left(\frac{R_i-\bar{R}}{s_R}\right)^4-3 \tag{4.17} \end{equation}\]

In spreadsheets, the KURT function calculates the coefficient according to the formula (4.16)Google, Excel. Kurtosis versions \(g_2\) i \(b_2\) can be calculated using AVERAGE, STDEV and array formulas. In R, kurtosis can be calculated using kurtosis function from the e1071 package. Be default, it returns the \(b_2\) version. However, one can obtain other versions with the type option: kurtosis(data, type=1) returns \(g_2\) and kurtosis(data, type=2) returns \(G_2\).

Excess (kurtosis) quantifies the extremity of tails of the distribution compared to the Gaussian (“normal”) distribution.

4.2 Measuring association

4.2.1 Covariance

The covariance formula for a population of size \(N\), for two data series (e.g. rates of return) \(R_i\) i \(S_i\):

\[\begin{equation} \sigma_{RS} = \frac{\sum_{i=1}^N \left(R_i-\bar{R}\right)\left(S_i-\bar{S}\right)}{N} \tag{4.18} \end{equation}\]

This is the “population” formula.

In spreadsheets, the COVAR function can be used – Google, Excel, which is equivalent to COVARIANCE.P – Google, Excel.

The analogous sample formula is as follows:

\[\begin{equation} s_{RS} = \frac{\sum_{i=1}^n \left(R_i-\bar{R}\right)\left(S_i-\bar{S}\right)}{n-1} \tag{4.19} \end{equation}\]

In spreadsheets we use the COVARIANCE.S function – Google, Excel.

4.2.2 Correlation

Correlation is the standardized covariance. It can be calculated using the population covariance/variance formulas:

\[\begin{equation} \rho_{RS} = \frac{\sigma_{RS}}{\sigma_R \sigma_S} \tag{4.20} \end{equation}\]

as well as the sample formulas:

\[\begin{equation} r_{RS} = \frac{s_{RS}}{s_R s_S} \tag{4.21} \end{equation}\]

Regardless of whether we use formulas (4.20) or (4.21) the result of the calculations should be the same. In spreadsheets we use the CORREL function – Google, Excel or the identical PEARSON – Google, Excel.

4.3 Exercises

Exercise 4.1 (Linton 2019) In 2017, Warren Buffett predicted that the Dow Jones would hit the level of one million within a hundred years. Given that the DJIA was worth 22,375 at the time he made this prediction, what was his expected annual rate of return?

Exercise 4.2 SBBI is a report prepared by the CFA Institute that shows U.S. market rates of return for large-cap (large capitalization) stocks, small-cap stocks1, long-term corporate bonds, long-term government bonds, medium-term government bonds, short-term U.S. treasury bills, and inflation.

Annual (simple) rates of return for the years 1926-2015 are available at the following link: https://docs.google.com/spreadsheets/d/1rfE78O1POHpQl15AQU2gtp15cV7h9JLNeAkRN66TNqE/edit?usp=sharing

For each of these asset groups, determine, for the period 1926-2015:

  1. simple average net rate of return

  2. “geometric” average rate of return (CAAGR)

  3. standard deviation of simple rates

  4. skewness of simple rates

  5. kurtosis of simple rates

  6. average log rate of return

  7. standard deviation of log returns

  8. skewness of log returns

  9. kurtosis of log return

Exercise 4.3 For a selected company, download data from the Yahoo database and calculate the statistics specified in the previous exercise for daily and monthly returns. Use adjusted close prices.

Exercise 4.4 For a selected stock, check whether the daily returns that include the weekend are, on average, higher than the daily returns on other days of the week. Is the standard deviation of the through-the-weekend daily returns higher?

Exercise 4.5 Check the correlation between the returns of two selected companies based on (a) monthly (b) daily net returns.

References

DeFusco, Richard A., Dennis W. McLeavey, Jerald E. Pinto, David E. Runkle, and Mark J. P. Anson. 2007. Quantitative Investment Analysis. Hoboken, N.J.
Linton, Oliver. 2019. Financial Econometrics: Models and Methods. Cambridge NewYork, NY Port Melbourne, VIC New Dehli Singapore: Cambridge University Press.

  1. They are “small-cap” but they are on the stock exchange, so they are still quite large!↩︎