Chapter 7 Clustering Functional Data

7.1 Finite Mixture Models

A finite mixture model is a probabilistic model defined by a convex combination of finitely many probability mass or density functions. Each density is referred to as a component, and the components are usually of the same family of distributions. Denote by $f$ the pdf (or pmf) of a $p$ -dimensional $G$ -component finite mixture model. Then, we can write $f$ as, $\begin{align*} f(x \mid \theta) = \sum_{g =1}^G \pi_g f(x\mid \theta_g), \end{align*}$ where the vector of $\pi_g$ s live on the standard $(G-1)$ -dimensional simplex, and $\theta = (\pi_1, \ldots, \pi_G, \theta_1, \ldots, \theta_G)$ is the collection of all model parameters. Finite mixture models are sometimes used for density estimation, as a finite mixture model with a sufficient number of components can estimate an arbitrary pdf with an arbitrary level of accuracy (Renshaw 2018), however they are more commonly used to model heterogeneous sub-populations within a larger population.

For frequentist estimation of finite mixture model parameters, the method of choice is maximum likelihood. For a sample $S = \{ x_1, \ldots, x_n\}$ supposed to have been drawn from a mixture density, the associated data likelihood takes the form, $\begin{align*} \mathscr{L}(\theta; S) = \prod_{i=1 }^{n} f(x_i \mid \theta) = \prod_{i=1 }^{n} \sum_{g=1}^G \pi_g f(x_i\mid \theta_g), \end{align*}$ which is difficult to optimize directly. To deal with this, we use a latent variable approach. Specifically, we assume the existence of a latent variable $Z$ distributed according to a multinoulli distribution with parameter vector $\pi = (\pi_1, \ldots, \pi_G)$ . A single draw from $Z$ then produces a vector $z=(z_g)_{g=1}^G$ where only one of the $z_g$ ’s takes the values $1$ while the rest take the value $0$ . We may therefore equate a roll of $Z$ with choosing a mixture density $f(\cdot\mid \theta_g)$ with probability $\pi_g$ . Hence each latent variable $Z_i$ associated with the sampled data point $x_i$ encodes the mixture component responsible for generating $x_i$ . Using this $Z$ the likelihood can be rewritten as, $\begin{align*} \mathscr{L}(\theta; S) = \prod_{i=1 }^{n}\prod_{g=1}^G\big\{ \pi_gf(x_i\mid \theta_g)\big\}^{z_{ig}}. \end{align*}$ In this form, we can easily apply the $\log$ and take derivatives with respect to the model parameters for each group. However, the issue is that the $z_i$ ’s are unobserved, and therefore need to be estimated. Our best estimate is, $\begin{align*} \hat{z}_{ig} = \mathbb{E}[ Z_{ig}\mid x_i, \theta] = \frac{\pi_g f(x_i \mid \theta_g)}{\sum_{j=1}^{G} \pi_j f(x_i \mid \theta_j)}. \end{align*}$ We then plug these into the loglikelihood, after which we can optimize it for an estimate of $\theta$ . Of course, computation of the $\hat{z}_{ig}$ ’s requires an estimate of $\theta$ , while estimating $\theta$ requires the $z_{ig}$ s. Hence, we get an iterative estimation algorithm which is a version of the $EM$ algorithm. The algorithm must be initialized with either an initial guess of the model parameters $\theta$ or the latent $z_i$ ’s. Under mild regularity conditions, this algorithm is guaranteed to converge to a local maximum (Wu 1983). This means proper initialization of mixture models at estimation time is key.

7.2 Challenges with Mixture Models

The estimation of mixture models often suffers from the curse of dimensionality because the number of parameters to be estimated often grows rapidly with respect to data dimension. For example, in the cases of a Gaussian mixture model, $\theta_g = \{\mu_g, \Sigma_g\}$ where $\mu_g$ is the mean parameter of the $g$ th component and $\Sigma_g$ is the covariance parameter. Denoting the data dimension by $p$ , estimation of $\Sigma_g$ in this case requires the estimation of $p(p+1)/2$ scalar-valued parameters, which grows quadratically with data dimension. Hence, many estimation methods for mixture models focus on parsimonious parameter specifications. That is, assumptions are made on the structure of the parameters within the model which significantly inhibits their growth with respect to data dimension. Some classic examples are (Fraley and Raftery 2002) and (Bouveyron, Girard, and Schmid 2007).

Another issue with mixture models is an abundance of hyperparameters, which cannot be optimized using the likelihood as with the usual model parameters. For example, the number of components of the mixture, $G$ is an integer-valued hyperparameter which is present in any formulation of a finite mixture model. However, parsimonious specification often also introduces additional, component-specific hyperparameters as well, such as the intrinsic dimension of (Bouveyron, Girard, and Schmid 2007). In order to find good values of these hyperparameters, we must fit models for each combination of values of interest. We then use the Bayesian Information Criterion (BIC) (Schwarz 1978) to choose the best model among those which were fitted. For a given estimated set of model parameters $\hat{\theta}$ , the BIC is computed as,
$\begin{align*} BIC(\hat{\theta}) = 2\ell(\hat{\theta}) - k\log(n), \end{align*}$ where $\ell(\hat\theta)$ is the log-likelihood value given the estimated parameters, $n$ is the sample size, and $k$ is the number of free parameters in the model. Hence, BIC penalizes the growth of the likelihood by the growth in the number of free parameters and the number of data points used, i.e. it works to help prevent the overfitting that occurs when increasing the likelihood value is increased arbitrarily by increasing model complexity.

Finally, although the EM-algorithm is guaranteed to converge, its convergence is linear, and it is deterministic once initialized. Hence, it cannot escape local modes, and one must start the algorithm in the “domain of attraction” of the global mode in order to find it. This makes initialization strategies an important component of mixture model estimation. Often, many random initializations with mini runs of the EM algorithm are used to find “promising” starting points. The best of these are then run until convergence. Alternatively, one may use non-monotonic versions of the algorithm, such as the stochastic EM algorithm, which gives up guaranteed convergence for the chance to escape local modes.

7.3 FunHDDC

We will discuss the below works in class. An associated R script can be found on Canvas.

7.4 Exercises

7.4.1 Enrichment

Exercise 7.1 (Finding Groups in Tridat) On Canvas there is a file trifuns.RData which contains functional data composed of 4 groups, with 250 observations from each group.

Plot the data and color the trajectories according to the true grouping.
Project the data onto a basis of your choice. Give a reason for your basis choice.
Cluster the data using the funHDDC function in R, fitting a model for $K=1,\ldots, 5$ using the k-means initialization and set keepAllRes=FALSE (Hint: this can all be done with one functional call).
Choose the best model from among these using BIC. Report the number of groups, model complexity, and BIC of your best model.
Plot the data and color the trajectories according to the grouping estimated by your best model and comment on the model’s estimated groups.
Cluster the data using the funHDDC function in R again, fitting a model for $K=1,\ldots, 5$ but this time using random initialization but still setting keepAllRes=FALSE (This may take around 5mins on a single core).
Report the best model from this run. How does it compare to the best model from your previous run?
If your model from this run is better than your previous model, make a new plot of the data using your new best model’s class assignments and comment on the estimated groups.
Cluster the data using the gmfd_kmeans function with metric="L2" for each of n.cl=2,3,4,5. This will require multiple functional calls. Determine the model which gives the best group structure. To do this, you can use the silhouette function from the cluster package (or similar) and the gmfd_diss function in the gmfd package.
Report which model gives the best grouping structure under this criterion.
Repeat the above for the gmfd_kmeans function, but this time use metric="mahalanobis" and p=10^6.
Compare this model to the previous gmfd_kmeans model which was fit using metric="L2" in terms of number of groups found and the associated grouping (using table or a plot).
Plot the data according to the best functional k-means model and comment on the estimated grouping.
Reflect on the analysis. Overall, how did the models perform? Which was the best and which was the worst?
If all your models found the true clustering, go back and try to break one of the methods. You do not have to do this if at least one of your models failed to find the true clustering. This means perhaps messing with hyperparameter specifications/values, initialization strategies, the basis used, or in the case of gmfd the number of discrete samples per observation.

References

Bouveyron, C., S. Girard, and C. Schmid. 2007. “High-Dimensional Data Clustering.” Computational Statistics & Data Analysis 52 (1): 502–19. https://doi.org/https://doi.org/10.1016/j.csda.2007.02.009.

Fraley, Chris, and Adrian E Raftery. 2002. “Model-Based Clustering, Discriminant Analysis, and Density Estimation.” Journal of the American Statistical Association 97 (458): 611–31. https://doi.org/10.1198/016214502760047131.

Renshaw, Arthur E. 2018. “Statistical Analysis of Finite Mixture Distributions.” Royal Statistical Society. Journal. Series A: General 150 (3): 283–83. https://doi.org/10.2307/2981482.

Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics 6 (2): 461–64. http://www.jstor.org/stable/2958889.

Wu, C. F. Jeff. 1983. “On the Convergence Properties of the EM Algorithm.” The Annals of Statistics 11 (1): 95–103. https://doi.org/10.1214/aos/1176346060.