16.4 Case studies of DL portfolios

\( \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\textm}[1]{\textsf{#1}} \def\T{{\mkern-2mu\raise-1mu\mathsf{T}}} \newcommand{\R}{\mathbb{R}} % real numbers \newcommand{\E}{{\rm I\kern-.2em E}} \newcommand{\w}{\bm{w}} % bold w \newcommand{\bmu}{\bm{\mu}} % bold mu \newcommand{\bSigma}{\bm{\Sigma}} % bold mu \newcommand{\bigO}{O} %\mathcal{O} \renewcommand{\d}[1]{\operatorname{d}\!{#1}} \)

As previously stated, research and experimentation in using DL for portfolio design (and more broadly, in finance) have been flourishing since around 2005. The reality is that we are still in the early stages of this exploration, and it remains uncertain whether the DL revolution will fully merge with financial systems.

There is a continuous and increasing flow of published papers on the application of DL to portfolio design. Generally, the results presented by the authors appear promising; however, one must proceed with caution. As extensively discussed in Chapter 8, numerous dangers and potential pitfalls exist in backtesting portfolios. These naturally extend to backtesting DL architectures for portfolio design, to name a few:

Overfitting: Even when results are obtained from test data not used in the training process, authors may have actually used the test data multiple times while adjusting the deep architectures (adding/removing layers, modifying layer parameters, etc.). Consequently, the results may be overfitted. Authors often do not provide specific details on the final deep networks selected, adding a sense of mystery to the system.
Look-ahead bias: When using high-level DL libraries, there is a possibility of making mistakes by leaking future data during the training process (this could potentially be detected in the testing phase). Even worse, leaking future data in the input (which affects both training and testing) or having incorrect time alignment in performance evaluation could also occur.
Ignoring transaction costs: DL systems typically work with high-frequency data because large amounts of training data are necessary. Ignoring transaction costs in the assessment with frequent rebalancing is entirely misleading and unacceptable. However, if the rebalancing is slow enough, such as weekly, monthly, or quarterly, transaction costs can be initially disregarded as a rough approximation.

An exhaustive overview of papers (as of 2020) of the developed DL models for financial applications can be found in (Ozbayoglu et al., 2020) and, in particular, of DL applied to financial time series forecasting in (Sezer et al., 2020). In the following, we will look into a few illustrative examples, with the understanding that this is just a snapshot that will become obsolete very fast as other publications appear.

Example #1: LSTM for financial time series forecasting

The paper (Fischer and Krauss, 2018) constitutes an example of the standard time series forecasting introduced in Section 16.3.2 and illustrated in Figure 16.13.

This is the most prevalent method for applying DL to portfolio design, specifically for time series modeling or forecasting. The authors employ an LSTM network, due to its inherent memory capabilities, and formulate the problem as a binary classification by defining two classes based on whether the return of each asset is larger or smaller than the cross-sectional median return. The network is then trained to minimize the cross-entropy.

In particular, the network for each asset has the following structure:

input layer: 1 feature (daily returns) with a lookback of \(k=240\) timesteps (corresponding approximately to one trading year);
hidden layer: LSTM with 25 hidden neurons (this configuration yields 2,752 parameters, leading to a sensible number of approximately 93 training examples per parameter);
output layer: fully connected with 2 neurons (corresponding to the two classes) and a softmax activation function (to obtain the probabilities of the two classes).

Once the DL architecture has been trained, its forecasts can be used to design a portfolio. Specifically, this DL architecture predicts the probability of each asset either outperforming or underperforming the cross-sectional median in period \(t\), using only information available up until time \(t-1\). The assets are then ranked based on the probability of outperforming the median, and a long-short quintile portfolio is subsequently formed (refer to Section 6.4.4 in Chapter 6 for details on quintile portfolios).

The empirical results in (Fischer and Krauss, 2018), based on daily data of S&P 500 stocks, demonstrate that using LSTM networks for forecasting in conjunction with a quintile portfolio outperforms the benchmarks (i.e., random forest, logistic regression, and a fully connected deep network with three hidden layers). Before transaction costs, the Sharpe ratio is approximately 5.8 (followed by the random forest at 5.0 and the fully connected network at 2.4). After accounting for transaction costs (using 5 bps or 0.05%), the Sharpe ratio decreases to 3.8 (followed by the random forest at 3.4 and the fully connected network at 0.9). For reference, the market had a Sharpe ratio of 0.7. However, while the overall results are positive, they seem to have been much better during the 1993-2009 period and deteriorated between 2010-2015, with profitability fluctuating around zero.

Example #2: Financial time series forecasting integrated with portfolio optimization

The paper (Butler and Kwon, 2023) provides an example of the portfolio-based time series forecasting presented in Section 16.3.3 and illustrated in Figure 16.14.

The overall architecture consists of the following two components:

a simple linear network for forecasting returns; and
a mean–variance portfolio (MVP) optimization component (refer to Chapter 7 for details on MVP).

As discussed in Section 16.3.3, the partial derivatives of the portfolio solution are essential for the backpropagation learning algorithm. These derivatives are derived in detail in (Butler and Kwon, 2023).

Numerical experiments were conducted on a universe of 24 global futures markets, using daily returns from 1986 to 2020. The proposed method was compared to the benchmark, where the forecasting block is trained to minimize the MSE. The results showed a significant improvement in terms of the Sharpe ratio, although transaction costs were not considered.

Example #3: End-to-end NN-based portfolio

The paper (Uysal et al., 2023) serves as an example of both the portfolio-based time series forecasting presented in Section 16.3.3, as illustrated in Figure 16.14, and the end-to-end architecture presented in Section 16.3.4, as depicted in Figure 16.15.

The authors propose two schemes: a model-based approach, where the neural network learns intermediate features that are fed into a portfolio optimization block, and a model-free approach, where the neural network directly outputs the portfolio allocation.

The model-free architecture has the following structure:

input layer: raw features (past \(k=5\) daily returns, past 10, 20, and 30-day average returns, and volatilities of each asset);
hidden layer: fully-connected with 32 neurons; and
output layer: 7 neurons (same as the number of assets) with the softmax function to obtain the normalized portfolio allocation.

The model-based architecture has the following structure:

input layer: same raw features as in the model-free case;
hidden layers:
- first a fully-connected hidden layer similar to the one in the model-free case,
- then a second hidden layer with the softmax function to obtain the risk budgeting; and
output layer: risk-parity portfolio (RPP) optimization block (refer to Chapter 11 for details on RPP).

The empirical results based on daily market data of seven ETFs during 2011-2021 appear promising (although transaction costs were not considered in this analysis). For the model-based case, the Sharpe ratio was around \(1.10\sim1.15\), while for the nominal risk-parity portfolio it was around \(0.62\sim0.79\) and for the \(1/N\) portfolio around \(0.41\sim0.83\). However, the performance for the model-free case was not impressive, with a Sharpe ratio around \(0.31\sim0.56\). A plausible explanation for this is that the model-free portfolio lacks any structure to guide the allocation, resulting in overfitting. Therefore, the model-based architecture is preferred.

Example #4: End-to-end DL-based portfolio

The paper (C. Zhang et al., 2021), which builds on (Z. Zhang et al., 2020a), is an example of the end-to-end architecture presented in Section 16.3.4 and illustrated in Figure 16.15.

This end-to-end framework bypasses the traditional forecasting step and eliminates the need for estimating the covariance matrix. It can optimize various objective functions, such as the Sharpe ratio and mean-variance trade-off. A notable aspect of this work is how the authors design neural layer structures to ensure that the output portfolio satisfies constraints about short selling, cardinality control, maximum positions for individual assets, and leverage.

The architecture is divided into two blocks: the score block (which produces a kind of raw portfolio) and the portfolio block (which enforces the desired constraints).

Score block: The score block takes the current market information as input, for example, a lookback of the previous \(k\) returns \(\left(\bm{x}_{t-k}, \dots,\bm{x}_{t-1}\right)\), and outputs the fitness scores for all the assets \(\bm{s}_t\). This block could be interpreted as making a forecast of the assets’ performance, similar to the traditional return forecast \(\bmu_t\), although it is not quite the same. In fact, it is more like a raw version of the portfolio weights \(\w_t\). The following different architectures are considered:
- linear model;
- fully-connected network with 64 units;
- single LSTM layer with 64 units; and
- CNN with 4 layers: the first 3 layers are one-dimensional convolutional layers with filters of size 32, 64, 128 (i.e., producing these number of feature maps), and each filter has the same kernel size (3,1), the last layer is a single LSTM with 64 units.
Portfolio block: This block takes the previous assets’ fitness scores \(\bm{s}_t\) as input and enforces the desired structure as follows:
- for the typical no-shorting normalized weights, this block is simply a softmax layer: \[ \w_t = \frac{e^{\bm{s}_t}}{\bm{1}^\T e^{\bm{s}_t}}; \]
- if shorting is allowed, then the softmax is modified to include the sign as \[ \w_t = \textm{sign}({\bm{s}_t})\times\frac{e^{\bm{s}_t}}{\bm{1}^\T e^{\bm{s}_t}}; \]
- to control the maximum position \(u\), the authors propose using the generalized sigmoid \(\sigma_a(z) = a + 1/(1 + e^{-z})\) (with \(a = (1 - u)/(Nu - 1)\)) applied elementwise to the scores \(\bm{s}_t\): \[ \w_t = \textm{sign}({\bm{s}_t}) \times \frac{\bm{\sigma}_a(|\bm{s}_t|)}{\bm{1}^\T \bm{\sigma}_a(|\bm{s}_t|)}; \]
- for the cardinality constraint (assuming shorting is allowed), the authors propose a layer that implements a long-short quintile portfolio (refer to Section 6.4.4 for a description of the quintile portfolio and (C. Zhang et al., 2021) for details on how to implement this layer in a way that it is differentiable, which is required for the learning process); and
- to enforce the leverage \(L\), one simply scales up with the factor \(L\).

The empirical results in (C. Zhang et al., 2021), based on daily data, show that the end-to-end architecture based on a single LSTM layer yields the best results, with a Sharpe ratio of 2.6, while the benchmarks achieved no better than 1.6. However, these results were obtained without considering transaction costs. Unfortunately, when even small transaction costs of 2 bps (i.e., 0.02%) are factored in, the superior performance disappears, resulting in a performance not significantly different from that of a simple benchmark (e.g., the maximum diversification portfolio). As the authors themselves suggest, further work is needed to account for transaction costs in the learning process, such as by controlling the turnover.

Example #5: End-to-end deep reinforcement learning portfolio

The paper (Z. Zhang et al., 2020b) offers an example of deep reinforcement learning presented in Section 16.3.5.

The system is designed to maximize the expected cumulative return, aligning with the reinforcement learning (RL) framework, which aims to maximize expected cumulative rewards through an agent’s interaction with an uncertain environment. Within this RL framework, the system can efficiently map various market situations to trading positions and seamlessly incorporate market frictions, such as commissions, into the reward functions. This allows for the direct optimization of trading performance. To represent the state space, the authors take into account several features, including past prices, returns over varying time frames, and technical indicators like the MACD⁷³ and the RSI⁷⁴. The action space is modeled as a simple discrete set (\(\{-1,0,1\}\) representing short, no holding, and long positions, respectively), and a continuous set that encompasses the entire \([-1,1]\) interval. The reward function consists of the volatility-adjusted return after accounting for transaction costs. In all models, the authors utilize two-layer LSTM networks with 64 and 32 units.

The authors evaluate their algorithms using 50 highly liquid futures contracts spanning from 2011 to 2019, examining performance variations across various asset classes such as commodities, equity indexes, fixed income, and foreign exchange markets. They contrast their algorithms with traditional time series momentum strategies, demonstrating that their approach surpasses these baseline models by generating positive profits even in the face of substantial transaction costs. The experimental results indicate that the proposed algorithms can effectively track major market trends without altering positions, as well as scale down or maintain positions during consolidation periods.

References

Butler, A., and Kwon, R. H. (2023). Integrating prediction in mean-variance portfolio optimization. Quantitative Finance, 23(3), 429–452.

Fischer, T., and Krauss, C. (2018). Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational Research, 270(2), 654–669.

Ozbayoglu, A. M., Gudelek, M. U., and Sezer, O. B. (2020). Deep learning for financial applications : A survey. Available at arXiv.

Sezer, O. B., Gudelek, M. U., and Ozbayoglu, A. M. (2020). Financial time series forecasting with deep learning : A systematic literature review: 2005–2019. Applied Soft Computing, 90.

Uysal, A. S., Li, X., and Mulvey, J. M. (2023). End-to-end risk budgeting portfolio optimization with neural networks. Annals of Operations Research.

Zhang, C., Zhang, Z., Cucuringu, M., and Zohren, S. (2021). A universal end-to-end approach to portfolio optimization via deep learning. Available at arXiv.

Zhang, Z., Zohren, S., and Roberts, S. (2020a). Deep learning for portfolio optimization. The Journal of Financial Data Science, 2(4), 8–20.

Zhang, Z., Zohren, S., and Roberts, S. (2020b). Deep reinforcement learning for trading. The Journal of Financial Data Science, 4(1), 1–16.

The moving average convergence divergence (MACD) is a momentum oscillator primarily used to trade trends.↩︎
The relative strength index (RSI) is a momentum indicator used in technical analysis that measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock or other asset.↩︎