Chapter 2 Forecasting tomorrow’s market move: difficult binary classification
2.1 Purpose of analysis
Our data in this chapter are the daily closing values of the FTSE 100 index, observed from 22 October 1992 to 19 October 2020. Our purpose is to predict the direction of tomorrow’s market movement (so the sign of the difference between tomorrow’s close and today’s close) from the recent history of closing prices, including today’s. This is an extremely difficult example of binary classification, and the Efficient Market Hypothesis can be interpreted to mean that it is fundamentally impossible to beat, in this task, the constant predictor \(\equiv 1\), i.e. one that always predicts a positive market move tomorrow.
The dataset in a CSV format is available here.
2.2 Preparing our data objects
We first create a few useful objects.
<- read.csv("FT-SE100.csv")
ftse <- ftse$X.CLOSE
ftse.cls <- length(ftse.cls)
n <- (ftse.cls[2:n] - ftse.cls[1:(n-1)]) / ftse.cls[1:(n-1)] ftse.ret
In read.csv
, you may have to provide your own path to where the FT-SE100.csv
file is located. The daily
closing values are in ftse.cls
and the day returns in ftse.ret
. We now define a few functions that
will be used to construct different regression problems involving ftse.ret
and ftse.cls
.
<- function(x.ret, lag) {
ret_matrix # x.ret - return time series
# lag - how many predictive lags to include in x?
<- length(x.ret)
n <- x.ret[(1+lag):n]
y <- matrix(0, n-lag, lag)
x for (i in 1:lag) x[,i] <- x.ret[(1+lag-i):(n-i)]
list(x=x, y=y)
}
<- function(x.ret, scales = c(1, 2, 5, 10)) {
ret_matrix_mscale # x.ret - return time series
# scales - spans for averages of past returns, as predictive features in x
<- max(scales)
lag <- length(x.ret)
n <- ret_matrix(x.ret, lag)
m <- length(scales)
p <- matrix(0, n-lag, p)
x for (i in 1:p) x[,i] <- apply(m$x[,1:scales[i],drop=F], 1, mean)
list(x=x, y=m$y)
}
<- function(x.cls, lag, std.norm = FALSE) {
pat_recogn_matrix # x.cls - time series of daily closes
# lag - what length of the past to include as predictive features in x?
# std.norm - standardise each predictive path of length lag to have sample variance one?
<- length(x.cls)
n <- (x.cls[(lag+1):n] - x.cls[lag:(n-1)]) / x.cls[lag:(n-1)]
y <- matrix(0, n-lag, lag)
x for (i in 1:(n-lag)) {
<- x.cls[i:(i+lag-1)] - mean(x.cls[i:(i+lag-1)])
x[i,] if (std.norm) x[i,] <- x[i,] / sqrt(var(x[i,]))
}list(x=x, y=y)
}
ret_matrix
, short for “return matrix,” takes a return time series (such as ftse.ret
) and produces a response vector y
containing returns on each day from lag+1
to the end (day n
), and the corresponding predictor matrix x
. Take any row of x
,
and the element of y
located in the same position as the row in x
. If that element of y
contains the return on day \(t\), then
the corresponding row in x
contains the returns on days \(t-1, t-2, \ldots, t-\)lag
. In other words, the predictors for tomorrow’s
return, are the returns on lag
consecutive days, ending today. The returned matrix x
is a tall matrix, i.e. time progresses
down its columns.
ret_matrix_mscale
(where mscale
is short for “multiscale”) is similar to ret_matrix
but the predictors are not returns on individual
days. Instead, they are average returns
on consecutive days ending today, where the averages are computed over the numbers of days specified in scales
. For example, if
scales
is c(1, 2, 5, 10)
(the default), then the predictors for each tomorrow’s return are: today’s return (scale 1), the average
of today’s and yesterday’s return (scale 2), the average return over the past trading week ending today (scale 5), and the same for the
past two trading weeks (scale 10). For more on multiscale time series models like this one, see Baranowski and Fryzlewicz (2020).
pat_recogn_matrix
(short for “pattern recognition”) is different in that it takes the closing values (such as ftse.cls
) on input,
rather than returns. The reason is that in this function, the predictors for each tomorrow’s return are the closing prices on
lag
consecutive days, ending today. They are demeaned, and, if std.norm
is TRUE
, also standardised to have sample variance one.
The name of the function comes from the fact that this creates a sort of a pattern recognition problem: what pattern in the recent closing
prices tends to correspond to what return tomorrow?
Let us now use these functions to create the three different regression problems.
<- ret_matrix(ftse.ret, 5)
ftse.mat <- ret_matrix_mscale(ftse.ret)
ftse.mat.mscale <- pat_recogn_matrix(ftse.cls, 10, TRUE) ftse.pat
2.3 Fitting the model and monitoring performance
Let us now define a function which will take a regression problem (e.g. any of the ones defined above), train a Keras neural network model on its subset (called a training set), and test its performance on the next subset (called a validation set). The task is not to predict the response itself, but the sign of the response around a certain central measure. In the below, we will use 1 to correspond to a positive market move around the central measure, and 0 otherwise. We first define the function and then discuss its construction.
<- function(ret.mat, train.ind, val.ind, sign.around, units, epochs, batch_size, times) {
ret_fit_est_perf
<- matrix(0, 4, epochs)
history <- dim(ret.mat$x)
d
for (i in 1:times) {
<- keras_model_sequential() %>% layer_dense(units = units[1], activation = "relu", input_shape = d[2]) %>% layer_dense(units = units[2], activation = "relu") %>% layer_dense(units = 1, activation = "sigmoid")
model.ret %>% compile(optimizer = "rmsprop", loss = "binary_crossentropy", metrics = c("accuracy"))
model.ret <- model.ret %>% fit(ret.mat$x[train.ind,], as.integer(ret.mat$y[train.ind] > sign.around(ret.mat$y[train.ind])),
history.ret epochs = epochs, batch_size = batch_size, validation_data = list(ret.mat$x[val.ind,],
as.integer(ret.mat$y[val.ind] > sign.around(ret.mat$y[val.ind]))))
1,] <- history[1,] + history.ret$metrics$loss
history[2,] <- history[2,] + history.ret$metrics$accuracy
history[3,] <- history[3,] + history.ret$metrics$val_loss
history[4,] <- history[4,] + history.ret$metrics$val_accuracy
history[
}/ times
history }
We now go through the function line by line, and also discuss its input parameters in the process.
<- matrix(0, 4, epochs) history
The history
matrix contains four rows and epochs
columns. The epochs
parameter specifies how many times the fit of the
model to the training data should be iteratively improved. If epochs
is too small, the fit may be too crude. If epochs
is too large,
then the model may overfit the training data and therefore may generalise poorly to previously unseen data; this is checked in the
procedure by testing the model on the validation part of the data, and recorded in an appropriate part of the history
matrix.
Therefore, each column of the history
matrix will record four quantities (in this order):
- loss on the training set,
- accuracy on the training set,
- loss on the validation set,
- accuracy on the validation set.
The first column of history
records these quantities after one training epoch, the second column – after two epochs, and so on, until
epochs
number of epochs has been reached.
What loss do we use? If you look ahead to the compile
function later on, you will see that the relevant parameter is
= "binary_crossentropy" loss
Binary cross-entropy is nothing else than the negative log-likelihood of the parameter of the Bernoulli distribution, when presented with a sample of 0’s and 1’s. Here is an explanation of the binary cross-entropy which does not use the word “likelihood,” and here is an explanation of the Bernoulli log-likelihood which does not use the term “entropy.”
Accuracy, on the other hand, is simply the percentage of correctly predicted 0’s or 1’s in the respective sample (training or validation).
<- dim(ret.mat$x) d
Recall that the ret.mat
parameter is a regression problem such as those stored in the variables ftse.mat
, ftse.mat.mscale
or
ftse.pat
. d
simply records the dimensions (number of rows followed by the number of columns) of the x
portion of the
regression problem in ret.mat
.
In the loop beginning with
for (i in 1:times)
note that the body of the loop does not use the iterator i
in any sense. Therefore some readers may wonder why we
are repeating the same procedure times
times. This is because the procedure is random: the fit
function is stochastic
in the sense that it uses a stochastic approximation to the ideal fit to the data. Therefore, the results will be slightly different
each time, and to have a better idea of the “true” (=average) performance, we repeat the fitting procedure a number of times
and store the cumulative performance parameters (loss and accuracy) across the times
runs in the history
matrix.
<- keras_model_sequential() %>% layer_dense(units = units[1], activation = "relu", input_shape = d[2]) %>% layer_dense(units = units[2], activation = "relu") %>% layer_dense(units = 1, activation = "sigmoid") model.ret
The pipe notation %>%
, described in detail e.g. on https://r4ds.had.co.nz/pipes.html (and available in the magrittr package,
required by keras), provides an alternative way of passing arguments to functions. Essentially, and symbolically,
A(...) %>% B(...) %>% C(...)
is the same as C(B(A(...),...),...)
.
keras_model_sequential
is one way of defining a neural network in keras. It is suitable for models that involve a sequence
of layers, each acting on the output of the preceding layer (the first layer acts on the original data).
A single layer performs an affine transformation of its input, followed by a certain nonlinear transformation; this then becomes the output
of this layer (to be fed into the next layer, etc.).
Therefore, model.ret
, defined above, consists of three layers, each acting on the output of the previous layer, with the first layer acting directly on the data:
layer_dense(units = units[1], activation = "relu", input_shape = d[2])
layer_dense(units = units[2], activation = "relu")
layer_dense(units = 1, activation = "sigmoid")
How to understand the layer_dense
command? The “density” of the layer simply means that each element of the input contributes to
each element of the output. Therefore, if we were to draw a graph connecting those elements of the input that contribute to each element
of the output, such a graph would be “dense” in the sense of having all possible links.
The first layer acts on the data itself, therefore the shape of its input, specified in input_shape
, is simply the length of each
data vector, or the dimensionality of the regression problem considered (recall that d <- dim(ret.mat$x)
). The subsequent layers do
not need to specify the input shape as this is implied by the dimension of the output of the preceding layer.
The dimension of the output of the first, and each next, layer is given in its units
argument. So the dimension of the output of the
first layer is units[1]
(this units
vector, of length 2, is one of the arguments of ret_fit_est_perf
), the dimension of the output of the second layer is units[2]
, and the dimension of the third and final layer is 1. The final layer is deliberately designed to return only one scalar: this is a problem with only one (univariate) response, so the ultimate output of the model (or: the predicted response) should of course also have one dimension.
We said above that each layer used an affine transformation first, followed by a nonlinear transformation. For the affine part, in
the case of the first layer, imagine a matrix linearly transforming input of dimension d[2]
into output of dimension units[1]
, and adding a bias term to each element of the output. What
happens next is the nonlinear part: each of the units[1]
components of the affine output get individually transformed through the
nonlinear function specified in the activation
argument. This nonlinear function is referred to as an activation function.
The relu
function (the activation function for layers 1 and 2) is given by
\[
\text{RELU}(x) = x^+ = \max(0, x).
\]
It is natural that layers should need nonlinear activation functions: if they did not have any, we would have a simple linear
classifier, but we after hopefully something more sophisticated here. A classifier with a large number of layers, each with a suitable
nonlinear activation function such as relu
, is able to approximate even complex functions well, which is the core argument behind
the usefulness of deep learning (e.g. neural network models with many layers). In this particular model, if layers 1 and 2 did not
have any nonlinearities (but layer 3 retained its sigmoid
activation), then our model would reduce to simple logistic regression.
This is because the sigmoid
activation function is nothing else but the standard logistic function
\[
\sigma(x) = (1 + e^{-x})^{-1}.
\]
A couple of questions invite themselves at this point.
How do we know how many layers, and how many units in each layer, to choose? The answer is the same as to any other model selection question: it is both an art and a science, and experimentation and experience go a long way here. One helpful thought may be that the first layer (when fitted later, more on this below) will be responsible for selecting the best predictive features of the original data, the second layer – for selecting the best predictive features of these features, and so on. In our case, given the modest dimensionality of the data, two layers do not seem too few (but I have not tried a different number of layers here). We will comment on the numbers of units in this example later on.
How do we know what activation functions to use for each layer? The shape of the activation function impacts at least two aspects of the system: what sorts of functions it is capable of approximating (well), and the behaviour of the numerical optimiser used to fit the model later. As regards the approximation aspect, if we were only to use the
relu
activation functions in each layer, i.e. without the sigmoid function in the final layer, our function approximator would have to be piecewise linear as this property is inherited down the layers. As regards the numerical optimisation of the fit to the data (see later on), this behaves differently for different activation functions, because it uses their gradients. The positive part of therelu
function is attractive from this point of view as it has a constant gradient equal to one. The same cannot be said of the negative part, which produces a flat gradient, so it is impossible of optimisers to get out of the corresponding regions if they find themselves here. Nonetheless, therelu
activation function appears to be a good and frequently used default choice in neural networks, see e.g. here for more details. The reason we use the sigmoid activation function in the final layer is the same why this function is used in logistic regression (we have a binary classification problem here).
%>% compile(optimizer = "rmsprop", loss = "binary_crossentropy", metrics = c("accuracy")) model.ret
The compile
function, according to its own help file, “configures a Keras model for training.” (Would configure
therefore be a better
name?) We have already covered the loss
parameter (our chosen value: binary_crossentropy
) and the metrics
parameter (accuracy
)
above. Chollet and Allare (2018) refer to the rmsprop
optimiser as “generally a good enough choice, whatever [the] problem.” Note, from the notation,
that the compile
function modifies the model (here: saved in model.ret
) in place.
<- model.ret %>% fit(ret.mat$x[train.ind,], as.integer(ret.mat$y[train.ind] > sign.around(ret.mat$y[train.ind])),
history.ret epochs = epochs, batch_size = batch_size, validation_data = list(ret.mat$x[val.ind,],
as.integer(ret.mat$y[val.ind] > sign.around(ret.mat$y[val.ind]))))
The function fit
fits the model (which consists of: the model architecture, the optimiser, the loss and the error metrics) to the
training data, and monitors the performance of the fit on the validation data. We have already discussed the epochs
parameter. A batch
is a chunk of data processed before the model updates its parameters. For a good discussion of the batch size, see e.g.
here.
What does fitting the model involve here? We are not in a typical statistical setting, in which there are far fewer parameters than observations, and there are all uniquely identifiable and estimable. Here, our parameters are the matrices corresponding to the affine transformation parts in the following workflow:
data -> affine transformation -> diagonal nonlinear transformation ->
layer 1 -> affine transformation -> diagonal nonlinear transformation ->
...
layer K -> affine transformation -> diagonal nonlinear transformation ->
output
where the layers labelled \(1, \ldots, K\) are the hidden layers. In our particular case, we have \(K = 2\) hidden layers. Therefore we have three matrices to estimate, of the following sizes
(1+input_shape) x units[1] (data -> layer 1)
(1+units[1]) x units[2] (layer 1 -> layer 2)
(1+units[2]) x 1 (layer 2 -> output)
where the 1
in 1+
corresponds to the intercept (bias) parameter. These are still not too many parameters to select in our particular setting, but the number of parameters can easily become much larger in more complex problems
with more layers and more complex data. Therefore we cannot really speak of “estimation” here; any configuration of parameters that leads
to a good fit to data (without overfitting) will be attractive. The optimiser updates the parameter values in an attempt to obtain a good
fit, but we will skip the details here, as the focus of these notes is on the data-analytic aspects of neural networks, rather than on
their computational aspects.
What about the sign.around
function in the response vectors passed to the fit
function (training set: 2nd argument; validation set: 2nd element of the 5th argument)? We wish to predict the direction (or sign) of the market movement around some central measure. Consider two different options for such a central measure: the median, and zero. Let us start with zero first: predicting the sign of the market movement around zero simply means predicting the sign of the market movement itself. However, chances are that the best prediction of this is simply “always positive,” or in other words the implied advice to the market professional is “buy and hold.” Predicting around the median should be more interesting as we then look for structure unrelated to the “average return” (as measured by the median) - so in some sense we should expect the implied trading strategy to produce only loosely correlated returns with respect to simply buying and holding, which may be attractive to portfolio managers.
1,] <- history[1,] + history.ret$metrics$loss
history[2,] <- history[2,] + history.ret$metrics$accuracy
history[3,] <- history[3,] + history.ret$metrics$val_loss
history[4,] <- history[4,] + history.ret$metrics$val_accuracy history[
For each value of the iterator i
(recall we are inside the for (i in 1:times)
loop), the history
matrix stores the respective
quantities (loss on the training set in row 1
, accuracy on the training set in row 2
, loss on the validation set in row 3
and
accuracy on the validation set in row 4
) measured after each epoch.
/ times history
The function returns the average of the history
matrix across the times
executions (which reduces the randomness of the result).
Let us now run our function ret_fit_est_perf
with the appropriate parameters, and discuss their values, but first note that the execution of ret_fit_est_perf
below is not fast, and takes of the order of minutes on my machine. This is principally
because we are repeating the fit 50 times (note the value of the times
parameter) in order to reduce the randomness of the fit.
Randomness reduction in any other way in the keras package does not appear to be easy. Here, it seems an essential thing to do, because
the signal is so weak: choosing one number of epochs may give us the accuracy of just under 50%, and choosing another number of epochs
– the accuracy of just over 50%, which, for our predictive system, means being either just worse or just better than a random guess, i.e.
a whole lot of difference. Therefore reducing the randomness in order to choose the ‘right’ number of epochs seems of importance here.
<- ret_fit_est_perf(ftse.mat, 1:5000, 5001:6000, median, c(3,2), 20, 16, 50)
h.mat <- ret_fit_est_perf(ftse.mat.mscale, 1:5000, 5001:6000, median, c(3,2), 20, 16, 50)
h.mat.mscale <- ret_fit_est_perf(ftse.pat, 1:5000, 5001:6000, median, c(5, 3), 20, 16, 50) h.pat
The index ranges 1:5000
and 5001:6000
correspond to the training and validation data, respectively. The executions above use the median centrality measure. We experiment with the numbers of units: c(3,2)
in the first two executions and c(5,3)
in the last. We use 20 epochs, and a batch size of 16. We repeat the whole process 50 times and record the average performance over those 50 instances. The accuracy on the validation set, as function of the number of epochs, can be plotted e.g. as below.
plot(h.pat[4,], xlab="Epoch", ylab="Accuracy")
The complete h.pat
matrix looks like this.
> h.pat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.7079381 0.6948333 0.6933121 0.6926375 0.6922011 0.6918543 0.6915957 0.6913716 0.6911735 0.6910029
[2,] 0.5023400 0.5078880 0.5123640 0.5163400 0.5188760 0.5208040 0.5228240 0.5241280 0.5258800 0.5272160
[3,] 0.6968660 0.6951267 0.6946879 0.6945795 0.6946131 0.6946330 0.6946529 0.6947304 0.6947918 0.6948923
[4,] 0.5016400 0.4991000 0.5026200 0.5035800 0.5031000 0.5048400 0.5058200 0.5045800 0.5046400 0.5056600
[,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
[1,] 0.6908313 0.6906875 0.6905633 0.6904147 0.6903005 0.6901776 0.6900608 0.6899693 0.6898742 0.6897556
[2,] 0.5280640 0.5286400 0.5294000 0.5297920 0.5308200 0.5311320 0.5316720 0.5324600 0.5325920 0.5329520
[3,] 0.6948824 0.6949750 0.6950514 0.6951782 0.6952689 0.6953550 0.6954566 0.6954855 0.6955755 0.6956913
[4,] 0.5053400 0.5047600 0.5048000 0.5046600 0.5047600 0.5050400 0.5044200 0.5059800 0.5049000 0.5049400
As we can see from the 4th row, the accuracy on the validation set is very low, which shows (not unexpectedly, of course) the difficulty of the problem. A similar look at the h.mat
and h.mat.mscale
matrices shows an even lower validation accuracy.
In the context of market movement prediction, however, anything that is significantly and systematically over 50% can already provide valuable signal.
2.4 Running the fitted model on test data
We next define a function that will test the model “trained” as above on the test data.
<- function(ret.mat, train.ind, val.ind, test.ind, sign.around, units, epochs, batch_size) {
ret_fit_test
<- dim(ret.mat$x)
d
<- keras_model_sequential() %>% layer_dense(units = units[1], activation = "relu", input_shape = d[2]) %>% layer_dense(units = units[2], activation = "relu") %>% layer_dense(units = 1, activation = "sigmoid")
model.ret %>% compile(optimizer = "rmsprop", loss = "binary_crossentropy", metrics = c("accuracy"))
model.ret <- model.ret %>% fit(ret.mat$x[train.ind,], as.integer(ret.mat$y[train.ind] > sign.around(ret.mat$y[train.ind])),
history.ret epochs = epochs, batch_size = batch_size, validation_data = list(ret.mat$x[val.ind,],
as.integer(ret.mat$y[val.ind] > sign.around(ret.mat$y[val.ind]))))
<- model.ret %>% evaluate(ret.mat$x[test.ind,], as.integer(ret.mat$y[test.ind] > sign.around(ret.mat$y[test.ind])))
results
%>% predict(ret.mat$x[test.ind,]) -> pr.move
model.ret
<- cumsum(ret.mat$y[test.ind] * (2*(pr.move > 1/2)-1))
simple.pnl
list(model.ret = model.ret, history.ret = history.ret, results = results, pr.move = pr.move, simple.pnl = simple.pnl)
}
In light of our earlier discussion, this should now be practically self-explanatory. The evaluate
function evaluates the fitted model on the test data. The predict
function extracts the probabilistic predictions (of the next day’s market move direction around the specified centrality measure), and the simple.pnl
vector computes cumulative profits and losses from a simple trading strategy which takes the position +1
in the market if the predicted probability of a positive move is larger than 0.5
, and -1
otherwise.
We now run the function with the smallest number of epochs (in this case: 7) for which the validation accuracy is (visually) among the highest attained. We plot the associated cumulative profits and losses as below.
ret_fit_test(ftse.pat, 1:5000, 5001:6000, 6001:7000, median, c(5, 3), 7, 16) -> ftse.pat.final
ts.plot(ftse.pat.final$simple.pnl)
We can also see how this compares to simply “buying and holding” the same instrument over this time period (of course not taking into account any trading costs).
lines(cumsum(ftse.pat$y[6001:7000]), col="red")
Finally, we can check how the returns from both strategies are correlated. With some luck, this should be fairly small, and certainly significantly smaller than one.
cor(diff(ftse.pat.final$simple.pnl), ftse.pat$y[6002:7000])