6 Algorithm
6.1 Block Relaxation
Our task is to minimize \(\sigma(H,A)\) over \(H\) and \(A\), suitably constrained. Write the constraints as \(H\in\mathcal{H}\) and \(A\in\mathcal{A}\). The strategy we use is block relaxation (De Leeuw (2015a)). Thus we iterate as follows.
- Set \(k=0\) and start with some \(H^{(0)}\).
- \(A^{(k)}\in{\mathop{\mathbf{argmin}}\limits_{A\in\mathcal{A}}}\ \sigma(H^{(k)},A)\).
- \(H^{(k+1)}\in{\mathop{\mathbf{argmin}}\limits_{H\in\mathcal{H}}}\ \sigma(H,A^{(k)})\).
- If converged stop. Else \(k\leftarrow k+1\) and go to step 1.
It is assumed that step 1, updating \(A\) for given \(H\), can be carried out simply by some form of linear least squares. We assume that for each \(\ell\) there is at least one \(j\) such that \(A_{j\ell}=I\). Note that this is the case for MLR, PCA, EFA, and for all Gifi Systems.
Step 2 is somewhat more intricate, because of the cone restrictions. In partitioned form we can write the loss function as \[ \sigma(H,A)=\sum_{i=1}^m\mathbf{tr}\ H_i'\sum_{j=1}^mH_j\sum_{\ell=1}^LA_{j\ell}A_{i\ell}' \]
\[ B_{ij}(A)=\sum_{\ell=1}^LA_{j\ell}A_{i\ell}' \]
6.2 Majorization
\[ \mathbf{tr}\ H'HG=\mathbf{tr}\ (\tilde H + (H - \tilde H))'(\tilde H + (H - \tilde H))G\geq\mathbf{tr}\ \tilde H'\tilde HG+2\mathbf{tr}\ \tilde H'(H - \tilde H)G \]
\[ \mathbf{tr}\ H'\tilde HG(\tilde H) \]
6.3 Alternating Least Squares
The standard way to minimize loss function \(\eqref{E:oldloss}\) is implemented in the OVERALS
program (Van der Burg, De Leeuw, and Verdegaal 1988, Meulman and Heiser (2012)). It is also the one used in the homals
package (De Leeuw and Mair 2009a).
In this paper the algorithm is different because we use the loss function \(\eqref{E:gifiloss}\). We still use ALS, which means in this case that we cycle through three substeps in each iteration. We update \(A\) for given \(X\) and \(H\), we then update \(X\) for given \(H\) and \(A\), and finally we update \(H\) for given \(X\) and \(A\). Algorithm A goes as follows.
- Set \(k=0\) and start with some \(X^{(0)},H^{(0)},A^{(0)}\).
- \(X^{(k+1)}=\mathbf{ortho}(\mathbf{center}(H^{(k)}A^{(k)})\).
- For \(j=1,\cdots,m\) compute \(A_j^{(k+1)}=\{H_j^{(k)}\}^+X^{(k+1)}\).
- For \(j=1,\cdots,m\) and \(s=1,\cdots p_j\) compute \(h_{js}^{(k+1)}=\mathbf{proj}_{\mathcal{K}_{js}\cap\mathcal{S}}((X^{(k+1)}-\sum_{t<s}h_{jt}^{(k+1)}\{a_{jt}^{(k+1)}\}'-\sum_{t>s}h_{jt}^{(k)}\{a_{jt}^{(k+1)}\}')a_s^{(k+1)})\).
- If converged stop. Else \(k\leftarrow k+1\) and go to step 1.
In step 1 we use superscript + for the Moore-Penrose inverse. In step 2 the center operator does column centering, the ortho operator finds an orthonormal basis for the column space of its argument.
The complicated part is step 4, the optimal scaling, i.e. the updating of \(H_j\) for given \(X\) and \(A_j\). We cycle through the variables in the block, each time projecting a single column on the cone of admissible transformations of the variable, and then normalizing the projection to length one. The target, i.e. the vector we are projecting, is complicated, because the other variables in the same block must be taken into account.
In order to simplify the optimal scaling computations within an iteration we can use majorization (De Leeuw 1994, Heiser (1995), Lange, Hunter, and Yang (2000), De Leeuw (2015a)). This has the additional benefit that the optimal scaling step becomes embarassingly parallel. We expand the loss for block \(j\) around a previous solution \(\tilde H_j\). \[ \mathbf{SSQ}(X-H_jA_j)= \mathbf{SSQ}(X-\tilde H_jA_j)-2\mathbf{tr}\ (H_j-\tilde H_j)'(X-\tilde H_jA_j)A_j' +\mathbf{tr}\ A_j'(H_j-\tilde H_j)'(H_j-\tilde H_j)A_j. \] Now \[ \mathbf{tr}\ (H_j-\tilde H_j)A_jA_j'(H_j-\tilde H_j)'\leq\kappa_j\ \mathbf{tr}\ (H_j-\tilde H_j)'(H_j-\tilde H_j), \] where \(\kappa_j\) is the largest eigenvalue of \(A_j'A_j\). Thus \[ \mathbf{SSQ}(X-H_jA_j)\leq\mathbf{SSQ}(X-\tilde H_jA_j)+\kappa_j\ \mathbf{SSQ}(H_j-U_j)-\frac{1}{\kappa_j}\ \mathbf{SSQ}((X-\tilde H_jA_j)A_j'), \] where \(U_j\) is the target \[ U_j=\tilde H_j+\frac{1}{\kappa_j}(X-\tilde H_jA_j)A_j'.\tag{3} \] It follows we can update the optimal scaling of the variables by projecting the columns of \(U_j\) on their respective cones and then normalizing. See De Leeuw (1975) for results on normalized cone regression. This can be done for all variables in the block separately, without taking any of the other variables in the block (or in any of the other blocks) into account. Thus the optimal scaling is easy to parallellize. The resulting algorithm B is as follows.
- Set \(k=0\) and start with some \(X^{(0)},H^{(0)},A^{(0)}\).
- \(X^{(k+1)}=\mathbf{ortho}(\mathbf{center}(H^{(k)}A^{(k)})\).
- For \(j=1,\cdots,m\) compute \(A_j^{(k+1)}=\{H_j^{(k)}\}^+X^{(k+1)}\).
- For \(j=1,\cdots,m\) compute \(U_j^{(k+1)}=H_j^{(k)}+\frac{1}{\kappa_j}(X^{(k+1)}-H_j^{(k)}A_j^{(k+1)})\{A_j^{(k+1)}\}'\) and for \(s=1,\cdots p_j\) compute \(h_{js}^{(k+1)}=\mathbf{proj}_{\mathcal{K}_{js}\cap\mathcal{S}}(u_{js}^{(k+1)})\).
- If converged stop. Else \(k\leftarrow k+1\) and go to step 1.
6.4 Implementation Details
If we follow the ALS strategy strictly the \(\mathbf{ortho}()\) operator should be implemented using Procrustus rotation (Gibson 1962). Thus if \(Z=K\Lambda L'\) is the singular value decomposition of \(X\), then \(\mathbf{ortho}(Z)=KL'\). Note, however, that any other basis for the column space of \(Z\) merely differs from the Procrustus basis by a rotation. And this rotation matrix will carry unmodified into the upgrade of \(A_j\) in step 2 of the algorithm, and thus after steps 1 and 2 the loss will be the same, no matter whch rotation we select. In our algorithm we use the QR decomposition to find the basis, using the Gram-Schmidt code from De Leeuw (2015c).
In actual computation we column-center the basis and compute a full rank QR decomposition, using the code in De Leeuw (2015c). Thus \(G_\ell=Q_\ell R_\ell\),
We implement the cone restrictions by the constraints \(h_{js}=G_{js}z_s\) in combination with \(T_{js}h_{js}\geq 0\). Thus the transformed variables must be in the intersection of the subspace spanned by the columns of the transformation basis \(G_{js}\) and the polyhedral convex cones of all vectors \(h\) such that \(T_{js}h\geq 0\). We suppose that all columns of the \(G_{js}\) add up to zero, and we require, in addition, the normalization \(SSQ(h_{js})=1\).
We use the code described in De Leeuw (2015b) to generate B-spline bases. Note that for coding purposes binary indicators are B-splines of degree zero, while polynomials are B-splines without interior knots. We include the utility functions to generate lists of knots. There is knotsQ()
for knots at the quantiles, knotsR()
for knots equally spaced on the range, knotsD()
for knots at the data points, and knotsE()
for no interior knots. Also note that binary indicators can be created for qualitative non-numerical variables, for which B-splines are not defined. We have added the option using degree -1 to bypass the B-spline code and generate an indicator matrix, using the utility makeIndicator()
. Note that 'makeIndicator(foo)
is equivalent to bsplineBasis(foo, degree = 0, innerknots = sort(unique(foo)))
. Throughout we first orthonormalize the basis matrices \(G_{js}\), using the Gram-Schmidt code from De Leeuw (2015c).
The matrices \(T_{js}\) in the homogeneous linear inequality restrictions that define the cones \(\mathcal{K}_{js}\) can be used to define monotonicity or convexity of the resulting transformations. In the current implementation we merely allow for monotonicity, which means the \(T_{js}\) do not have to be stored. The transformations for each variable can be restricted to be increasing, or they can be unrestricted. By using splines without interior knots we allow in addition for polynomial transformations, which again can be restricted to be either monotonic or not. Note that it is somewhat misleading to say we are fitting monotone splines or polynomials, we are mainly requiring monotonicity at the data points.
If there are multiple copies of a variable in a block then requiring the transformation to be ordinal means that we want the transformation of the first copy to be monotonic. The transformations of the other copies are not constrained to be monotonic. If you want all copies to be transformed monotonically, you have to explicitly introduce them as separate variables.
For variables with copies there is yet another complication. For copies we have \(H_jA_j=G_j(Z_jA_j)=G_jY_j\). If we require monotonicity in MVAOS we constrain a column of \(H_j\) (in fact, the first one) to be monotonic. In classic Gifi, in which the \(G_j\) are binary indicators, we constrain the first column of \(Y_j\), which automatically implies the first column of \(G_jY_j\) is monotonic as well. In previous Gifi work with B-splines, we also constrained the first column of \(Y_j\), which again implied the first column of \(G_jY_j\) was monotnic as well. But in our current MVAOS implementation monotonicity of the first column of \(H_j\) does not imply monotonicity of the first column of \(H_jA_j\), even if the basis \(G_j\) is a binary indicator. This discrepancy between the old and the new Gifi only comes into play for ordinal variables with multiple copies.
Missing data are incorporated in the definition of the cones of transformations by using a \(G_{js}\) which is the direct sum of a spline basis for the non-missing and an identity matrix for the missing data. This is called missing data multiple in Gifi (1990). There are no linear inequality restrictions on the quantifications of the missing data.
6.5 Wrappers
The homals()
implementation in De Leeuw and Mair (2009a) is a single monolithic program in R, which specializes to the various MVAOS techniques by a suitable choice of its parameters. This approach has some disadvantages. If we want principal component analysis, we already know all blocks are singletons. If we want multiple correspondence analysis we know each variable has \(p\) copies. If we want multiple regression, we know there are two blocks, and one is a singleton. So it is somewhat tedious to specify all parameters all of the time. Also, some of the output, graphical and otherwise, is specific to a particular technique. For regression we want residuals and fitted values, in canonical analysis we want block scores and loadings. And, more generally, we may want the output in a form familiar from the classical MVA techniques. It is indeed possible to transform the homals()
output to more familar forms (De Leeuw (2009)), but this requires some extra effort.
In this book we go back to the original approach of Gifi (1990) and write separate programs for nonlinear versions principal component analysis, multiple regression, canonical analysis, discriminant analysis, and so on.
These programs, now written in R and no longer in FORTRAN, are wrappers for the main computational core, the program gifiEngine()
. The wrappers, which have the familiar names morals()
, corals()
, princals()
, homals()
, criminals()
, overals()
, primals()
, and canals()
, create a gifi object from the data and parameters, and then pass this to gifiEngine()
. Computations are itereated to convergence, and result are stored in a xGifi object. Then the output is transformed to a format familiar from the corresponding technique from classical MVA. Each wrapper foo
returns a structure of class foo
.
This modular approach saves code, because both makeGifi()
and gifiEngine()
are common to all programs. It also makes it comparatively easy to add new wrappers not currently included, possibly even contributed by others.
Although we like the above quotation from Hill (1990), it is not quite accurate. Our current generation of wrappers can use B-spline bases, it can use an arbitrary number of copies of a variable, and each copy can be either categorical, ordinal, polynomial, or splinical. Thus, even more so than the original gifi programs, we have a substantial generalization of the classical techniques, not merely a sequence of synonyms.
6.6 Structures
The computations are controlled by the arguments to the wrappers. These arguments are used to construct three structures: the gifi, the gifiBlock, and the gifiVariable. A gifi is just a list of gifiBlocks, and a gifiBlock is a list of gifiVariables. This reflects the partitioning of the variables into blocks. A gifiVariable contains a great deal of information about the variable. The function makeGifiVariable()
is a constructor that returns a structure of class gifiVariable
. The contents of a gifiVariable
remain the same throughout the computations.
return (structure (
list (
data = data,
basis = basis,
qr = qr,
copies = copies,
degree = degree,
ties = ties,
missing = missing,
ordinal = ordinal,
active = active,
name = name,
type = type
),
class = "gifiVariable"
))
There are three corresponding structures containing initial and intermediate results, and eventually output, the xGifi, xGifiBlock, and xGifiVariable. Again, an xGifi is a list of xGifiBlocks, and an xGifiBlock is a list of xGifiVariables. The constructor for an xGifiVariable returns an object of class xGifiVariable
, which contains the elements that are updated in each iteration during the computations. There is an xGifiVariable
for both active and passive variables.
return (structure (
list(
transform = transform,
weights = weights,
scores = scores,
quantifications = quantifications
),
class = "xGifiVariable"
))
References
De Leeuw, J. 2015a. Block Relaxation Algorithms in Statistics. Bookdown. https://bookdown.org/jandeleeuw6/bras/.
Van der Burg, E., J. De Leeuw, and R. Verdegaal. 1988. “Homogeneity Analysis with K Sets of Variables: An Alternating Least Squares Approach with Optimal Scaling Features.” Psychometrika 53: 177–97. http://www.stat.ucla.edu/~deleeuw/janspubs/1988/articles/vanderburg_deleeuw_verdegaal_A_88.pdf.
Meulman, J.M., and W.J. Heiser. 2012. IBM Spss Categories 21. IBM Corporation.
De Leeuw, J., and P. Mair. 2009a. “Homogeneity Analysis in R: the Package homals
.” Journal of Statistical Software 31 (4): 1–21. http://www.stat.ucla.edu/~deleeuw/janspubs/2009/articles/deleeuw_mair_A_09a.pdf.
De Leeuw, J. 1994. “Block Relaxation Algorithms in Statistics.” In Information Systems and Data Analysis, edited by H.H. Bock, W. Lenski, and M.M. Richter, 308–24. Berlin: Springer Verlag. http://www.stat.ucla.edu/~deleeuw/janspubs/1994/chapters/deleeuw_C_94c.pdf.
Heiser, W.J. 1995. “Convergent Computing by Iterative Majorization: Theory and Applications in Multidimensional Data Analysis.” In Recent Advantages in Descriptive Multivariate Analysis, edited by W.J. Krzanowski, 157–89. Oxford: Clarendon Press.
Lange, K., D.R. Hunter, and I. Yang. 2000. “Optimization Transfer Using Surrogate Objective Functions.” Journal of Computational and Graphical Statistics 9: 1–20.
De Leeuw, J. 1975. “A Normalized Cone Regression Approach to Alternating Least Squares Algorithms.” Department of Data Theory FSW/RUL. http://www.stat.ucla.edu/~deleeuw/janspubs/1975/notes/deleeuw_U_75a.pdf.
Gibson, W.A. 1962. “On the Least Squares Orthogonalization of an Oblique Transformation.” Psychometrika 27: 193–95.
De Leeuw, J. 2015c. “Exceedingly Simple Gram-Schmidt Code.” http://rpubs.com/deleeuw/84866.
De Leeuw, J. 2015b. “Exceedingly Simple B-spline Code.” http://rpubs.com/deleeuw/79161.
Gifi, A. 1990. Nonlinear Multivariate Analysis. New York, N.Y.: Wiley.
De Leeuw, J. 2009. “Regression, Discriminant Analysis, and Canonical Analysis with homals.” Preprint Series 562. Los Angeles, CA: UCLA Department of Statistics. http://www.stat.ucla.edu/~deleeuw/janspubs/2009/reports/deleeuw_R_09c.pdf.
Hill, M.O. 1974. “Correspondence Analysis: a Neglected Multivariate Method.” Applied Statistics 23: 340–54.
1990. “Review of A. Gifi, Multivariate Analysis.” Journal of Ecology 78 (4): 1148–9.