## 13.5 Automatic sparsity control

The sparse index tracking formulation in (13.9) includes the regularization term \(\lambda \|\w\|_0\) in the objective, where \(\lambda\) is a hyper-parameter to be chosen in order to get the desired level of sparsity. Alternatively, it can be formulated by moving the sparsity term to the constraints as \(\|\w\|_0 \le k\) with \(k\) denoting the desired level of sparsity. Either way, by properly adjusting the sparsity hyper-parameter, \(\lambda\) or \(k\), different points on the error versus sparsity trade-off curves in Figures 13.9–13.10 can be achieved.

In practice, however, the goal is to choose a proper operating point on the error versus sparsity trade-off curve, preferably without having to compute the whole trade-off curve. Is there a convenient way to tune the sparsity hyper-parameter to get a proper operating point in the trade-off curve?

### 13.5.1 False discovery rate (FDR)

To properly answer the question on how to choose the operating point on the trade-off curve, we need to introduce some concepts from statistics and hypothesis testing. In particular, a key quantity when deciding whether to use a variable or not in a regression problem is the *false discovery rate* (FDR), which refers to the probability of wrongly including a variable.

In some applications, such as genomics, including the wrong variables can be catastrophic as it implies that some gene is incorrectly associated with some medical condition. In finance, for example, hundreds of papers have been written attempting to discover factors that explain the cross-section of expected returns, but are these results false? Apparently so. Based on a hypothesis testing analysis of empirical tests since 1967, it seems that most claimed research findings in financial economics are likely false (C. R. Harvey et al., 2016).

Controlling the FDR would be the ideal and sound way to decide whether to include a variable and this was achieved in the seminal 1995 paper (Benjamini and Hochberg, 1995). Since then, many FDR-controlling methods have been proposed, with the most popular one being based on the concept of *knockoffs* (R. F. Barber and Candès, 2015). Knockoffs are fictitious variables that are created for the purpose of the FDR control; they need to mimic the covariance structure of the original variables and this can be computationally demanding since this requires the estimation of the covariance matrix, which can be difficult in high-dimensional settings.

More recently, a more practical method for FDR control called T-Rex (for Terminating-Random EXperiments) was proposed based on the concept of *dummies* (Machkour, Muma, et al., 2022). Dummies are also fictitious variables to control the FDR, but they are easy to generate since they do not require to follow the same covariance structure of the original variables. Instead, dummies can be sampled from any univariate probability distribution. The T-Rex method effectively reduces the computation time by two orders of magnitude compared to the competing methods.

### 13.5.2 FDR for index tracking

In the context of sparse index tracking, instead of selecting the sparsity level through trial and error, a more robust approach would be to precisely control the FDR. However, the interpretation of FDR is slightly different in this case because, strictly speaking, all the assets in the definition of an index are valid variables to be selected. Nevertheless, once an asset has been selected, it is typically the case that many other assets become redundant because they are highly correlated with that selected asset. In this sense we can say that selecting these irrelevant assets would be a “false discovery” and it is to be avoided.

The application of the T-Rex method to sparse index tracking with FDR control was developed in (Machkour et al., 2024).^{59} Rather than having to fix or tune the hyper-parameter \(\lambda\) in (13.9), it automatically selects the assets to be included by controlling the FDR.

### 13.5.3 Numerical experiments

We now compare the tracking portfolios obtained from the sparse penalized regression formulation in (13.9) with the FDR-controlling T-Rex method (Machkour et al., 2024). The tracking portfolios are computed on a rolling window basis with a lookback period of two years and recomputed every six months.

Figure 13.12 shows the tracking error over time in terms of the plain returns (13.5) and cumulative returns (13.8), as well as the cardinality of the portfolios over time. As can be seen, the formulation in (13.9) is very sensitive to the choice of the parameter \(\lambda\), producing very different results in terms of tracking error and cardinality. On the other hand, the FDR-controlling T-Rex method automatically chooses the appropriate sparsity level without having to tune any parameter. The computational cost of T-Rex is slightly higher than that of solving (13.9) for a fixed \(\lambda\) but lower than solving (13.9) for a whole range of values for \(\lambda\).

### References

*The Annals of Statistics*,

*43*(5), 2055–2085.

*Journal of the Royal Statistical Society. Series B: Statistical Methodology*,

*57*(1), 289–300.

*Review of Financial Studies*,

*29*(1), 5–68.

*Available at arXiv*.

*Available at arXiv*.

*TRexSelector: T-Rex Selector: High-dimensional variable selection & FDR control*.

The R package

`TRexSelector`

(Machkour, Tien, et al., 2022) implements the T-Rex method based on (Machkour, Muma, et al., 2022; Machkour et al., 2024). ↩︎