Chapter 2 Forecasting tomorrow’s market move: difficult binary classification

2.1 Purpose of analysis

Our data in this chapter are the daily closing values of the FTSE 100 index, observed from 22 October 1992 to 19 October 2020. Our purpose is to predict the direction of tomorrow’s market movement (so the sign of the difference between tomorrow’s close and today’s close) from the recent history of closing prices, including today’s. This is an extremely difficult example of binary classification, and the Efficient Market Hypothesis can be interpreted to mean that it is fundamentally impossible to beat, in this task, the constant predictor \(\equiv 1\), i.e. one that always predicts a positive market move tomorrow.

The dataset in a CSV format is available here.

2.2 Preparing our data objects

We first create a few useful objects.

ftse <- read.csv("FT-SE100.csv")
ftse.cls <- ftse$X.CLOSE
n <- length(ftse.cls)
ftse.ret <- (ftse.cls[2:n] - ftse.cls[1:(n-1)]) / ftse.cls[1:(n-1)]

In read.csv, you may have to provide your own path to where the FT-SE100.csv file is located. The daily closing values are in ftse.cls and the day returns in ftse.ret. We now define a few functions that will be used to construct different regression problems involving ftse.ret and ftse.cls.

ret_matrix <- function(x.ret, lag) {
  # x.ret - return time series
  # lag - how many predictive lags to include in x?
  n <- length(x.ret)
  y <- x.ret[(1+lag):n]
  x <- matrix(0, n-lag, lag)
  for (i in 1:lag) x[,i] <- x.ret[(1+lag-i):(n-i)]
  list(x=x, y=y)
}

ret_matrix_mscale <- function(x.ret, scales = c(1, 2, 5, 10)) {
  # x.ret - return time series
  # scales - spans for averages of past returns, as predictive features in x
  lag <- max(scales)
  n <- length(x.ret)
  m <- ret_matrix(x.ret, lag)
  p <- length(scales)
  x <- matrix(0, n-lag, p)
  for (i in 1:p) x[,i] <- apply(m$x[,1:scales[i],drop=F], 1, mean)
  list(x=x, y=m$y)
}

pat_recogn_matrix <- function(x.cls, lag, std.norm = FALSE) {
  # x.cls - time series of daily closes
  # lag - what length of the past to include as predictive features in x?
  # std.norm - standardise each predictive path of length lag to have sample variance one?
  n <- length(x.cls)
  y <- (x.cls[(lag+1):n] - x.cls[lag:(n-1)]) / x.cls[lag:(n-1)]
  x <- matrix(0, n-lag, lag)
  for (i in 1:(n-lag)) {
    x[i,] <- x.cls[i:(i+lag-1)] - mean(x.cls[i:(i+lag-1)])
    if (std.norm) x[i,] <- x[i,] / sqrt(var(x[i,]))
  }
  list(x=x, y=y)
}

ret_matrix, short for “return matrix,” takes a return time series (such as ftse.ret) and produces a response vector y containing returns on each day from lag+1 to the end (day n), and the corresponding predictor matrix x. Take any row of x, and the element of y located in the same position as the row in x. If that element of y contains the return on day \(t\), then the corresponding row in x contains the returns on days \(t-1, t-2, \ldots, t-\)lag. In other words, the predictors for tomorrow’s return, are the returns on lag consecutive days, ending today. The returned matrix x is a tall matrix, i.e. time progresses down its columns.

ret_matrix_mscale (where mscale is short for “multiscale”) is similar to ret_matrix but the predictors are not returns on individual days. Instead, they are average returns on consecutive days ending today, where the averages are computed over the numbers of days specified in scales. For example, if scales is c(1, 2, 5, 10) (the default), then the predictors for each tomorrow’s return are: today’s return (scale 1), the average of today’s and yesterday’s return (scale 2), the average return over the past trading week ending today (scale 5), and the same for the past two trading weeks (scale 10). For more on multiscale time series models like this one, see Baranowski and Fryzlewicz (2020).

pat_recogn_matrix (short for “pattern recognition”) is different in that it takes the closing values (such as ftse.cls) on input, rather than returns. The reason is that in this function, the predictors for each tomorrow’s return are the closing prices on lag consecutive days, ending today. They are demeaned, and, if std.norm is TRUE, also standardised to have sample variance one. The name of the function comes from the fact that this creates a sort of a pattern recognition problem: what pattern in the recent closing prices tends to correspond to what return tomorrow?

Let us now use these functions to create the three different regression problems.

ftse.mat <- ret_matrix(ftse.ret, 5)
ftse.mat.mscale <- ret_matrix_mscale(ftse.ret)
ftse.pat <- pat_recogn_matrix(ftse.cls, 10, TRUE)

2.3 Fitting the model and monitoring performance

Let us now define a function which will take a regression problem (e.g. any of the ones defined above), train a Keras neural network model on its subset (called a training set), and test its performance on the next subset (called a validation set). The task is not to predict the response itself, but the sign of the response around a certain central measure. In the below, we will use 1 to correspond to a positive market move around the central measure, and 0 otherwise. We first define the function and then discuss its construction.

ret_fit_est_perf <- function(ret.mat, train.ind, val.ind, sign.around, units, epochs, batch_size, times) {

  history <- matrix(0, 4, epochs)
  d <- dim(ret.mat$x)

  for (i in 1:times) {
    model.ret <- keras_model_sequential() %>% layer_dense(units = units[1], activation = "relu", input_shape = d[2]) %>% layer_dense(units = units[2], activation = "relu") %>% layer_dense(units = 1, activation = "sigmoid")
    model.ret %>% compile(optimizer = "rmsprop", loss = "binary_crossentropy", metrics = c("accuracy"))
    history.ret <- model.ret %>% fit(ret.mat$x[train.ind,], as.integer(ret.mat$y[train.ind] > sign.around(ret.mat$y[train.ind])),
                                     epochs = epochs, batch_size = batch_size, validation_data = list(ret.mat$x[val.ind,],
                                                                                                      as.integer(ret.mat$y[val.ind] > sign.around(ret.mat$y[val.ind]))))
    history[1,] <- history[1,] + history.ret$metrics$loss
    history[2,] <- history[2,] + history.ret$metrics$accuracy
    history[3,] <- history[3,] + history.ret$metrics$val_loss
    history[4,] <- history[4,] + history.ret$metrics$val_accuracy
  }
  history / times
}

We now go through the function line by line, and also discuss its input parameters in the process.

history <- matrix(0, 4, epochs)

The history matrix contains four rows and epochs columns. The epochs parameter specifies how many times the fit of the model to the training data should be iteratively improved. If epochs is too small, the fit may be too crude. If epochs is too large, then the model may overfit the training data and therefore may generalise poorly to previously unseen data; this is checked in the procedure by testing the model on the validation part of the data, and recorded in an appropriate part of the history matrix.

Therefore, each column of the history matrix will record four quantities (in this order):

  • loss on the training set,
  • accuracy on the training set,
  • loss on the validation set,
  • accuracy on the validation set.

The first column of history records these quantities after one training epoch, the second column – after two epochs, and so on, until epochs number of epochs has been reached.

What loss do we use? If you look ahead to the compile function later on, you will see that the relevant parameter is

loss = "binary_crossentropy"

Binary cross-entropy is nothing else than the negative log-likelihood of the parameter of the Bernoulli distribution, when presented with a sample of 0’s and 1’s. Here is an explanation of the binary cross-entropy which does not use the word “likelihood,” and here is an explanation of the Bernoulli log-likelihood which does not use the term “entropy.”

Accuracy, on the other hand, is simply the percentage of correctly predicted 0’s or 1’s in the respective sample (training or validation).

d <- dim(ret.mat$x)

Recall that the ret.mat parameter is a regression problem such as those stored in the variables ftse.mat, ftse.mat.mscale or ftse.pat. d simply records the dimensions (number of rows followed by the number of columns) of the x portion of the regression problem in ret.mat.

In the loop beginning with

for (i in 1:times)

note that the body of the loop does not use the iterator i in any sense. Therefore some readers may wonder why we are repeating the same procedure times times. This is because the procedure is random: the fit function is stochastic in the sense that it uses a stochastic approximation to the ideal fit to the data. Therefore, the results will be slightly different each time, and to have a better idea of the “true” (=average) performance, we repeat the fitting procedure a number of times and store the cumulative performance parameters (loss and accuracy) across the times runs in the history matrix.

model.ret <- keras_model_sequential() %>% layer_dense(units = units[1], activation = "relu", input_shape = d[2]) %>% layer_dense(units = units[2], activation = "relu") %>% layer_dense(units = 1, activation = "sigmoid")

The pipe notation %>%, described in detail e.g. on https://r4ds.had.co.nz/pipes.html (and available in the magrittr package, required by keras), provides an alternative way of passing arguments to functions. Essentially, and symbolically, A(...) %>% B(...) %>% C(...) is the same as C(B(A(...),...),...).

keras_model_sequential is one way of defining a neural network in keras. It is suitable for models that involve a sequence of layers, each acting on the output of the preceding layer (the first layer acts on the original data). A single layer performs an affine transformation of its input, followed by a certain nonlinear transformation; this then becomes the output of this layer (to be fed into the next layer, etc.).

Therefore, model.ret, defined above, consists of three layers, each acting on the output of the previous layer, with the first layer acting directly on the data:

layer_dense(units = units[1], activation = "relu", input_shape = d[2])
layer_dense(units = units[2], activation = "relu")
layer_dense(units = 1, activation = "sigmoid")

How to understand the layer_dense command? The “density” of the layer simply means that each element of the input contributes to each element of the output. Therefore, if we were to draw a graph connecting those elements of the input that contribute to each element of the output, such a graph would be “dense” in the sense of having all possible links.

The first layer acts on the data itself, therefore the shape of its input, specified in input_shape, is simply the length of each data vector, or the dimensionality of the regression problem considered (recall that d <- dim(ret.mat$x)). The subsequent layers do not need to specify the input shape as this is implied by the dimension of the output of the preceding layer.

The dimension of the output of the first, and each next, layer is given in its units argument. So the dimension of the output of the first layer is units[1] (this units vector, of length 2, is one of the arguments of ret_fit_est_perf), the dimension of the output of the second layer is units[2], and the dimension of the third and final layer is 1. The final layer is deliberately designed to return only one scalar: this is a problem with only one (univariate) response, so the ultimate output of the model (or: the predicted response) should of course also have one dimension.

We said above that each layer used an affine transformation first, followed by a nonlinear transformation. For the affine part, in the case of the first layer, imagine a matrix linearly transforming input of dimension d[2] into output of dimension units[1], and adding a bias term to each element of the output. What happens next is the nonlinear part: each of the units[1] components of the affine output get individually transformed through the nonlinear function specified in the activation argument. This nonlinear function is referred to as an activation function. The relu function (the activation function for layers 1 and 2) is given by \[ \text{RELU}(x) = x^+ = \max(0, x). \] It is natural that layers should need nonlinear activation functions: if they did not have any, we would have a simple linear classifier, but we after hopefully something more sophisticated here. A classifier with a large number of layers, each with a suitable nonlinear activation function such as relu, is able to approximate even complex functions well, which is the core argument behind the usefulness of deep learning (e.g. neural network models with many layers). In this particular model, if layers 1 and 2 did not have any nonlinearities (but layer 3 retained its sigmoid activation), then our model would reduce to simple logistic regression. This is because the sigmoid activation function is nothing else but the standard logistic function \[ \sigma(x) = (1 + e^{-x})^{-1}. \] A couple of questions invite themselves at this point.

  • How do we know how many layers, and how many units in each layer, to choose? The answer is the same as to any other model selection question: it is both an art and a science, and experimentation and experience go a long way here. One helpful thought may be that the first layer (when fitted later, more on this below) will be responsible for selecting the best predictive features of the original data, the second layer – for selecting the best predictive features of these features, and so on. In our case, given the modest dimensionality of the data, two layers do not seem too few (but I have not tried a different number of layers here). We will comment on the numbers of units in this example later on.

  • How do we know what activation functions to use for each layer? The shape of the activation function impacts at least two aspects of the system: what sorts of functions it is capable of approximating (well), and the behaviour of the numerical optimiser used to fit the model later. As regards the approximation aspect, if we were only to use the relu activation functions in each layer, i.e. without the sigmoid function in the final layer, our function approximator would have to be piecewise linear as this property is inherited down the layers. As regards the numerical optimisation of the fit to the data (see later on), this behaves differently for different activation functions, because it uses their gradients. The positive part of the relu function is attractive from this point of view as it has a constant gradient equal to one. The same cannot be said of the negative part, which produces a flat gradient, so it is impossible of optimisers to get out of the corresponding regions if they find themselves here. Nonetheless, the relu activation function appears to be a good and frequently used default choice in neural networks, see e.g.  here for more details. The reason we use the sigmoid activation function in the final layer is the same why this function is used in logistic regression (we have a binary classification problem here).

model.ret %>% compile(optimizer = "rmsprop", loss = "binary_crossentropy", metrics = c("accuracy"))

The compile function, according to its own help file, “configures a Keras model for training.” (Would configure therefore be a better name?) We have already covered the loss parameter (our chosen value: binary_crossentropy) and the metrics parameter (accuracy) above. Chollet and Allare (2018) refer to the rmsprop optimiser as “generally a good enough choice, whatever [the] problem.” Note, from the notation, that the compile function modifies the model (here: saved in model.ret) in place.

history.ret <- model.ret %>% fit(ret.mat$x[train.ind,], as.integer(ret.mat$y[train.ind] > sign.around(ret.mat$y[train.ind])),
  epochs = epochs, batch_size = batch_size, validation_data = list(ret.mat$x[val.ind,],
  as.integer(ret.mat$y[val.ind] > sign.around(ret.mat$y[val.ind]))))

The function fit fits the model (which consists of: the model architecture, the optimiser, the loss and the error metrics) to the training data, and monitors the performance of the fit on the validation data. We have already discussed the epochs parameter. A batch is a chunk of data processed before the model updates its parameters. For a good discussion of the batch size, see e.g. here.

What does fitting the model involve here? We are not in a typical statistical setting, in which there are far fewer parameters than observations, and there are all uniquely identifiable and estimable. Here, our parameters are the matrices corresponding to the affine transformation parts in the following workflow:

data -> affine transformation -> diagonal nonlinear transformation ->
  layer 1 -> affine transformation -> diagonal nonlinear transformation ->
  ...
  layer K -> affine transformation -> diagonal nonlinear transformation ->
  output

where the layers labelled \(1, \ldots, K\) are the hidden layers. In our particular case, we have \(K = 2\) hidden layers. Therefore we have three matrices to estimate, of the following sizes

(1+input_shape) x units[1]   (data -> layer 1)
(1+units[1]) x units[2]      (layer 1 -> layer 2)
(1+units[2]) x 1             (layer 2 -> output)

where the 1 in 1+ corresponds to the intercept (bias) parameter. These are still not too many parameters to select in our particular setting, but the number of parameters can easily become much larger in more complex problems with more layers and more complex data. Therefore we cannot really speak of “estimation” here; any configuration of parameters that leads to a good fit to data (without overfitting) will be attractive. The optimiser updates the parameter values in an attempt to obtain a good fit, but we will skip the details here, as the focus of these notes is on the data-analytic aspects of neural networks, rather than on their computational aspects.

What about the sign.around function in the response vectors passed to the fit function (training set: 2nd argument; validation set: 2nd element of the 5th argument)? We wish to predict the direction (or sign) of the market movement around some central measure. Consider two different options for such a central measure: the median, and zero. Let us start with zero first: predicting the sign of the market movement around zero simply means predicting the sign of the market movement itself. However, chances are that the best prediction of this is simply “always positive,” or in other words the implied advice to the market professional is “buy and hold.” Predicting around the median should be more interesting as we then look for structure unrelated to the “average return” (as measured by the median) - so in some sense we should expect the implied trading strategy to produce only loosely correlated returns with respect to simply buying and holding, which may be attractive to portfolio managers.

history[1,] <- history[1,] + history.ret$metrics$loss
history[2,] <- history[2,] + history.ret$metrics$accuracy
history[3,] <- history[3,] + history.ret$metrics$val_loss
history[4,] <- history[4,] + history.ret$metrics$val_accuracy

For each value of the iterator i (recall we are inside the for (i in 1:times) loop), the history matrix stores the respective quantities (loss on the training set in row 1, accuracy on the training set in row 2, loss on the validation set in row 3 and accuracy on the validation set in row 4) measured after each epoch.

history / times

The function returns the average of the history matrix across the times executions (which reduces the randomness of the result).

Let us now run our function ret_fit_est_perf with the appropriate parameters, and discuss their values, but first note that the execution of ret_fit_est_perf below is not fast, and takes of the order of minutes on my machine. This is principally because we are repeating the fit 50 times (note the value of the times parameter) in order to reduce the randomness of the fit. Randomness reduction in any other way in the keras package does not appear to be easy. Here, it seems an essential thing to do, because the signal is so weak: choosing one number of epochs may give us the accuracy of just under 50%, and choosing another number of epochs – the accuracy of just over 50%, which, for our predictive system, means being either just worse or just better than a random guess, i.e. a whole lot of difference. Therefore reducing the randomness in order to choose the ‘right’ number of epochs seems of importance here.

h.mat <- ret_fit_est_perf(ftse.mat, 1:5000, 5001:6000, median, c(3,2), 20, 16, 50)
h.mat.mscale <- ret_fit_est_perf(ftse.mat.mscale, 1:5000, 5001:6000, median, c(3,2), 20, 16, 50)
h.pat <- ret_fit_est_perf(ftse.pat, 1:5000, 5001:6000, median, c(5, 3), 20, 16, 50)

The index ranges 1:5000 and 5001:6000 correspond to the training and validation data, respectively. The executions above use the median centrality measure. We experiment with the numbers of units: c(3,2) in the first two executions and c(5,3) in the last. We use 20 epochs, and a batch size of 16. We repeat the whole process 50 times and record the average performance over those 50 instances. The accuracy on the validation set, as function of the number of epochs, can be plotted e.g. as below.

plot(h.pat[4,], xlab="Epoch", ylab="Accuracy")

The complete h.pat matrix looks like this.

> h.pat
          [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]      [,8]      [,9]     [,10]
[1,] 0.7079381 0.6948333 0.6933121 0.6926375 0.6922011 0.6918543 0.6915957 0.6913716 0.6911735 0.6910029
[2,] 0.5023400 0.5078880 0.5123640 0.5163400 0.5188760 0.5208040 0.5228240 0.5241280 0.5258800 0.5272160
[3,] 0.6968660 0.6951267 0.6946879 0.6945795 0.6946131 0.6946330 0.6946529 0.6947304 0.6947918 0.6948923
[4,] 0.5016400 0.4991000 0.5026200 0.5035800 0.5031000 0.5048400 0.5058200 0.5045800 0.5046400 0.5056600
         [,11]     [,12]     [,13]     [,14]     [,15]     [,16]     [,17]     [,18]     [,19]     [,20]
[1,] 0.6908313 0.6906875 0.6905633 0.6904147 0.6903005 0.6901776 0.6900608 0.6899693 0.6898742 0.6897556
[2,] 0.5280640 0.5286400 0.5294000 0.5297920 0.5308200 0.5311320 0.5316720 0.5324600 0.5325920 0.5329520
[3,] 0.6948824 0.6949750 0.6950514 0.6951782 0.6952689 0.6953550 0.6954566 0.6954855 0.6955755 0.6956913
[4,] 0.5053400 0.5047600 0.5048000 0.5046600 0.5047600 0.5050400 0.5044200 0.5059800 0.5049000 0.5049400

As we can see from the 4th row, the accuracy on the validation set is very low, which shows (not unexpectedly, of course) the difficulty of the problem. A similar look at the h.mat and h.mat.mscale matrices shows an even lower validation accuracy.

In the context of market movement prediction, however, anything that is significantly and systematically over 50% can already provide valuable signal.

2.4 Running the fitted model on test data

We next define a function that will test the model “trained” as above on the test data.

ret_fit_test <- function(ret.mat, train.ind, val.ind, test.ind, sign.around, units, epochs, batch_size) {

  d <- dim(ret.mat$x)

  model.ret <- keras_model_sequential() %>% layer_dense(units = units[1], activation = "relu", input_shape = d[2]) %>% layer_dense(units = units[2], activation = "relu") %>% layer_dense(units = 1, activation = "sigmoid")
  model.ret %>% compile(optimizer = "rmsprop", loss = "binary_crossentropy", metrics = c("accuracy"))
  history.ret <- model.ret %>% fit(ret.mat$x[train.ind,], as.integer(ret.mat$y[train.ind] > sign.around(ret.mat$y[train.ind])),
                                     epochs = epochs, batch_size = batch_size, validation_data = list(ret.mat$x[val.ind,],
                                                                                                      as.integer(ret.mat$y[val.ind] > sign.around(ret.mat$y[val.ind]))))
  results <- model.ret %>% evaluate(ret.mat$x[test.ind,], as.integer(ret.mat$y[test.ind] > sign.around(ret.mat$y[test.ind])))

  model.ret %>% predict(ret.mat$x[test.ind,]) -> pr.move

  simple.pnl <- cumsum(ret.mat$y[test.ind] * (2*(pr.move > 1/2)-1))

  list(model.ret = model.ret, history.ret = history.ret, results = results, pr.move = pr.move, simple.pnl = simple.pnl)
}

In light of our earlier discussion, this should now be practically self-explanatory. The evaluate function evaluates the fitted model on the test data. The predict function extracts the probabilistic predictions (of the next day’s market move direction around the specified centrality measure), and the simple.pnl vector computes cumulative profits and losses from a simple trading strategy which takes the position +1 in the market if the predicted probability of a positive move is larger than 0.5, and -1 otherwise.

We now run the function with the smallest number of epochs (in this case: 7) for which the validation accuracy is (visually) among the highest attained. We plot the associated cumulative profits and losses as below.

ret_fit_test(ftse.pat, 1:5000, 5001:6000, 6001:7000, median, c(5, 3), 7, 16) -> ftse.pat.final
ts.plot(ftse.pat.final$simple.pnl)

We can also see how this compares to simply “buying and holding” the same instrument over this time period (of course not taking into account any trading costs).

lines(cumsum(ftse.pat$y[6001:7000]), col="red")

Finally, we can check how the returns from both strategies are correlated. With some luck, this should be fairly small, and certainly significantly smaller than one.

cor(diff(ftse.pat.final$simple.pnl), ftse.pat$y[6002:7000])

References

Baranowski, R., and P. Fryzlewicz. 2020. “Multiscale Autoregression on Adaptively Detected Timescales.” Preprint. http://stats.lse.ac.uk/fryzlewicz/amar/amar.pdf.
Chollet, F., and J. J. Allare. 2018. Deep Learning with R. Manning.