2.1 Some Dualisms
The type of multivariate analysis (MVA) we discuss in this book is sometimes called descriptive or exploratory, as opposed to inferential or confirmatory. It is located somewhere on the line between computational linear algebra and statistics, and it is probably close to data analysis, Big Data, machine learning, knowledge discovery, data mining, business analytics, or whatever other ill-defined label is used for the mode du jour.
In the days of Gifi (1990) there was a small-scale civil war between the mathematical statistical (confirmatory) approach to MVA and the data analytical (exploratory) approach. This is not a new conflict, because it has its roots in the Pearson-Yule debate (Mackenzie (1978)). The first shots in modern times were probably fired by Tukey (1962), but much additional polemic heat was generated in the 50 years since Tukey’s famous paper. In order to stand our ground we were forced to participate, for example with De Leeuw (1984a), De Leeuw (1988a), De Leeuw (1990).
Here is what Gifi (1990) says, clearly with some intent to provoke.
The statistical approach starts with a statistical model, usually based on the multinormal distribution. The model is assumed to be true, and within the model certain parametric hypotheses are constructed. The remaining free parameters are estimated and the hypotheses are tested. (Gifi (1990), p. 19)
The data analytic approach does not start with a model, but looks for transformations and combinations of the variables with the explicit purpose of representing the data in a simple and comprehensive, and usually graphical, way. (Gifi (1990), p. 19)
Gifi’s first chapter, in particular his section 1.5, outlines an approach to statistics that emphasizes techniques over models. A technique is a map of data into representations of some sort. Representations can be test statistics, confidence intervals, posterior distributions, tables, graphs. Statistics studies the construction, properties, and performance of techniques. Models are used to gauge techniques, that is to see how they perform on synthetic data, which are data described by equations or generated by sampling. Of course models can also be used to inspire techniques, but data analysis does not deal with the inspirational phase. The models themselves are a part of the client sciences, not of statistics. One of the key characteristics of a technique is its stability, which is studied by using data perturbations of various sorts. Small and unimportant perturbations of the data should lead to small and unimportant changes in the output. A large and important class of perturbations is based on sampling from a population, leading to sampling distributions, confidence intervals, hypothesis tests, and standard errors. In Gifi’s branch of statistics the emphasis shifts from equations to algorithms, and from explanation to prediction.
In related philosophizing Breiman (2001) contrasted the two cultures of data modeling (98% of statisticians) and algorithmic modeling (2% of statistians), and implored statisticians to spend less time and energy in the first culture and more in the second.
Reading a preprint of Gifi’s book (1990) many years ago uncovered a kindred spirit. (Breiman (2001), p. 205)
This was written some time after the influential paper by Breiman and Friedman (1985), which introduced the Gifi-like ACE technique for multiple regression.
The emphasis in the data modeling culture is on explanation or information, the emphasis in the algorithmic modeling culture on prediction. There are various ways to present and evaluate this distinction. A good overview, from the philosophy of science point of view, is Shmueli (2010). From the lofty heights of Academia we hear
The two goals in analyzing data which Leo calls prediction and information I prefer to describe as “management” and “science.” Management seeks profit, practical answers (predictions) useful for decision making in the short run. Science seeks truth, fundamental knowledge about nature which provides understanding and control in the long run. (Parzen, in the discussion of Breiman (2001), p. 224)
The emphasis on techniques was also shared by Cleveland (2014), who proposed a new curriculum for statistics departments with more emphasis on computing with data and tool evaluation. Another early ally was Laurie Davis, see the interesting papers by Davies (1995) and Davies (2008).
The first chapter of Gifi (1990) contains an interesting discussion of statistical practice with special reference to multivariate data. The point of view taken there, with its emphasis on ‘techniques’, has points of contact with the present paper where we use Tukey’s nomenclature and refer to ‘procedure’. (Davies (2008), p. 192)
Of course currently the big discussion is if Data Science is actually statistics under a new name. And, more importantly, who should teach it. And, even more importantly, which department should receive the grant money. Parzen may believe that statisticians seek the Truth, whatever that is, but the current situation in Academia is that there is no truth if you do not consider profit. Statistics departments are typically small, and they feel threatened by gigantic Schools of Engineering looming over them (American Statistical Association (2014), Yu (2014)). It is partly a question of scale: there are too many data to fit into statistics departments. Numerous new graduate Data Science programs are popping up, in many cases geared toward management and not so much toward science. Statistics departments are seriously considering changing their names, before the levies break and they are flooded by the inevitable rise of data.
We shall not pay much attention any more to these turf and culture wars, because basically they are over. Data analysis, in its multitude of disguises and appearances, is the winner. Classical statistics departments are gone, or on their way out. They may not have changed their name, but their curricula and hiring practices are very different from what they were 20 or even 10 years ago.
Neither do men put new wine into old bottles: else the bottles break, and the wine runneth out, and the bottles perish: but they put new wine into new bottles, and both are preserved. (Matthew 9:17)
Notwithstanding the monumental changes, inferential statistics remains an important form of stability analysis for data analysis techniques. Probabilistic models are becoming more and more important in many branches of science, and perturbing a probabilistic model is most naturally done by sampling. Thus huge parts of classical statistrucs are preserved, and not surprisingly these are exactly the parts useful in data analysis.
2.2 Quantifying Qualitative Data
One way of looking at Multivariate Analysis with Optimal Scaling, or MVAOS, is as an extension of classical linear multivariate analysis to variables that are binary, ordered, or even unordered categorical. In R terminology, classical MVA techniques can thus be applied if some or all of the variables in the dataframe are factors. Categorical variables are quantified and numerical variables are transformed to optimize the linear or bilinear least squares fit.
Least squares and eigenvalue methods for quantifying multivariate qualitative data were first introduced by Guttman (1941), although there were some bivariate predecessors in the work of Pearson, Fisher, Maung, and Hirschfeld (see De Leeuw (1983) or Gower (1990) for a historical overview). In this earlier work the emphasis was often on optimizing quadratic forms, or ratios of quadratic forms, and not so much on least squares, distance geometry, and graphical representations such as biplots (Gower and Hand (1996), Gower, Le Roux, and Gardner-Lubbe (2015), Gower, Le Roux, and Gardner-Lubbe (2016)). They were taken up by, among others, De Leeuw (1968a), by Benzécri and his students in France (see Cordier (1965)), and by Hayashi and his students in Japan (see Tanaka (1979)). Early applications can be found in ecology, following an influential paper by Hill (1974). With increasing emphasis on software the role of graphical representations has increased and continues to increase.
In De Leeuw (1974) a first attempt was made to unify most classical descriptive multivariate techniques using a single least squares loss function and a corresponding alternating least squares (ALS) optimization method. His work then bifurcated to the ALSOS project, with Young and Takane at the University of North Carolina Chapell Hill, and the Gifi project, at the Department of Data Theory of Leiden University.
The ALSOS project was started in 1973-1974, when De Leeuw was visiting Bell Telephone Labs in Murray Hill. ALSOS stands for Alternating Least Squares with Optimal Scaling. The ALS part of the name was provided by De Leeuw (1968b) and the OS part by Bock (1960). At early meetings of the Psychometric Society some members were offended by our use of “Optimal Scaling”, because they took it to imply that their methods of scaling were supposedly inferior to ours. But the “optimal” merely refers to optimality in the context of a specific least squares loss function.
Young, De Leeuw, and Takane applied the basic ALS and OS methodology to conjoint analysis, regression, principal component analysis, multidimensional scaling, and factor analysis, producing computer programs (and SAS modules) for each of the techniques. An overview of the project, basically at the end of its lifetime, is in Young, De Leeuw, and Takane (1980) and Young (1981).
The ALSOS project was clearly inspired by the path-breaking work of Kruskal (1964a) and Kruskal (1964b), who designed a general way to turn metric multivariate analysis techniques into non-metric ones. In fact, Kruskal applied the basic methodology developed for multidimensional scaling to linear models in Kruskal (1965), and to principal component analysis in Kruskal and Shepard (1974) (which was actually written around 1965 as well). In parallel developments closely related nonmetric methods were developed by Roskam (1968) and by Guttman and Lingoes (see Lingoes (1973)).
The Gifi project took its inspiration from Kruskal, but perhaps even more from Guttman (1941) (and to a lesser extent from the optimal scaling work of Fisher, see Gower (1990)). Guttman’s quantification method, which later became known as multiple correspondence analysis, was merged with linear and nonlinear principal component analysis in the HOMALS/PRINCALS techniques and programs (De Leeuw and Van Rijckevorsel (1980)). The MVAOS loss function that was chosen ultimately, for example in the work of Van der Burg, De Leeuw, and Verdegaal (1988), had been used earlier by Carroll (1968) in multi-set canonical correlation analysis of numerical variables.
A project similar to ALSOS/Gifi was ACE, short for Alternating Conditional Expectations. The ACE method for regression was introduced by Breiman and Friedman (1985) and the ACE method for principal component analysis by Koyak (1987). Both techniques use the same ALS block relaxation methods, but instead of projecting on a cone or subspace of possible transformation, they apply a smoother (typically Friedman’s supersmoother) to find the optimal transformation. This implies that the method is intended primarily for continuous variables, and that the convergence properties of the ACE algorithm are more complicated than those of a proper ALS algorithms.
An even more closely related project, by Winsberg and Ramsay, uses the cone of I-splines (integrated B-splines) to define the optimal transformations. The technique for linear models is in Winsberg and Ramsay (1980) and the one for principal component analysis in Winsberg and Ramsay (1983). Again, the emphasis on monotonic splines indicates that continuous variables play a larger role than in the ALSOS or Gifi system.
So generally there have been a number of projects over the last 50 years that differ in detail, but apply basically the same methodology (alternating least squares and optimal scaling) to generalize classical MVA techniques. Some of them emphasize transformation of continuous variables, some emphasize quantification of discrete variables. Some emphasize monotonicity, some smoothness. Usually these projects include techniques for regression and principal component analysis, but in the case of Gifi the various forms of correspondence analysis and canonical analysis are also included.
2.3 Beyond Gifi
The techniques discussed in Gifi (1990), and implemented in the corresponding computer programs, use a particular least squares loss function and minimize it by alternating least squares algorithms. All techniques use what Gifi calls meet loss, which is basically the loss function proposed by Carroll (1968) for multiset canonical correlation analysis. Carroll’s work was extended in Gifi by using optimal scaling to transform or quantify variables coded with indicators, and to use constraints on the parameters to adapt the basic technique, often called homogeneity analysis, to different classical MVA techniques.
There have been various extensions of the classical Gifi repertoire by adding techniques that do not readily fit into meet loss. Examples are path analysis (Coolen and De Leeuw (1987)), linear dynamic systems (Bijleveld and De Leeuw (1991)), and factor analysis (De Leeuw (2004)). But adding these techniques does not really add up to a new framework.
Somewhat more importantly, De Leeuw and Van Rijckevorsel (1988) discuss various ways to generalize meet loss by using fuzzy coding. Transformations are no longer step functions, and coding can be done with fuzzy indicators, such as B-spline bases. This makes it easier to deal with variables that have many ordered categories. Although this is a substantial generalization the basic framework remains the same.
One of the outgrowths of the Gifi project was the aspect approach, first discussed systematically by De Leeuw (1988b), and implemented in the R package
aspect by Mair and De Leeuw (2010). In its original formulation it uses majorization to optimize functions defined on the space of correlation matrices, where the correlations are computed over transformed variables, coded by indicators. Thus we optimize aspects of the correlation matrix over transformations of the variables. The
aspect software was recently updated to allow for B-spline transformations (De Leeuw (2015)). Many different aspects were implemented, based on eigenvalues, determinants, multiple correlations, and sums of powers of correlation coefficients. Unformately, aspects defined in terms of canonical correlations, or generalized canonical correlations, were not covered. Thus the range of techniques covered by the
aspect approach has multiple regression and principal component analysis in common with the range of the Gifi system, but is otherwise disjoint from it.
In De Leeuw (2004) a particular correlation aspect was singled out that could bridge the gap between the aspect approach and the Gifi approach, provided orthoblocks of transformations were introduced. This is combined with the notion of copies, introduced in De Leeuw (1984b), to design a new class of techniques that encompasses all of Gifi and that brings generalized canonical correlation analysis in the aspect framework. Thus correlation aspects, and the majorization algorithms to optimize them, are now a true generalization of the Gifi system. This is the system we discuss in this book.
Gifi, A. 1990. Nonlinear Multivariate Analysis. New York, N.Y.: Wiley.
Mackenzie, D. 1978. “Statistical Theory and Social Interests: A Case-Study.” Social Studies of Science 8: 35–83.
Tukey, J.W. 1962. “The Future of Data Analysis.” Annals of Mathematical Statistics 33: 1–79.
De Leeuw, J. 1984a. “Models of Data.” Kwantitatieve Methoden 5: 17–30. http://www.stat.ucla.edu/~deleeuw/janspubs/1984/articles/deleeuw_A_84c.pdf.
De Leeuw, J. 1988a. “Models and Techniques.” Statistica Neerlandica 42: 91–98. http://www.stat.ucla.edu/~deleeuw/janspubs/1988/articles/deleeuw_A_88c.pdf.
De Leeuw, J. 1990. “Data Modeling and Theory Construction.” In Operationalization and Research Strategy, edited by J.J. Hox and J. De Jong-Gierveld. Amsterdam, The Netherlands: Swets; Zeitlinger. http://www.stat.ucla.edu/~deleeuw/janspubs/1990/chapters/deleeuw_C_90b.pdf.
Breiman, L. 2001. “Statistical Modeling: The Two Cultures.” Statistical Science 16 (3): 199–231.
Breiman, L., and J.H. Friedman. 1985. “Estimating Optimal Transformations for Multiple Regression and Correlation.” Journal of the American Statistical Association 80: 580–619.
Shmueli, G. 2010. “To Explain or to Predict ?” Statistical Science 25: 289–310.
Cleveland, W.S. 2014. “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.” Statistical Analysis and Data Mining 7: 414–17.
Davies, P. L. 1995. “Data Features.” Statistica Neerlandica 49 (2): 185–245.
Davies, P. 2008. “Approximating Data.” Journal of the Korean Statistical Society 37: 191–211.
American Statistical Association. 2014. “Discovery with Data: Leveraging Statistics with Computer Science to Transform Science and Society.” http://www.amstat.org/policy/pdfs/BigDataStatisticsJune2014.pdf.
Yu, B. 2014. “Let Us Own Data Science.” IMS Bulletin 43: 1, 13–16.
Guttman, L. 1941. “The Quantification of a Class of Attributes: A Theory and Method of Scale Construction.” In The Prediction of Personal Adjustment, edited by P. Horst, 321–48. New York: Social Science Research Council.
De Leeuw, J. 1983. “On the Prehistory of Correspondence Analysis.” Statistica Neerlandica 37: 161–64. http://www.stat.ucla.edu/~deleeuw/janspubs/1983/articles/deleeuw_A_83b.pdf.
Gower, J.C. 1990. “Fisher’s Optimal Scores and Multiple Correspondence Analysis.” Biometrics 46: 947–61.
Gower, J.C., and D.J. Hand. 1996. Biplots. Monographs on Statistics and Applied Probability 54. Chapman; Hall.
Gower, J.C., N.J. Le Roux, and S. Gardner-Lubbe. 2015. “Biplots: Quantitative Data.” WIREs Computational Statistics 7: 42–62. doi:10.1002/wics.1338.
Gower, J.C., N.J. Le Roux, and S. Gardner-Lubbe. 2015. “Biplots: Quantitative Data.” WIREs Computational Statistics 7: 42–62. doi:10.1002/wics.1338.2016. “Biplots: Qualitative Data.” WIREs Computational Statistics 8: 82–111. doi:10.1002/wics.1377.
De Leeuw, J. 1968a. “Canonical Discriminant Analysis of Relational Data.” Research Note 007-68. Department of Data Theory FSW/RUL. http://www.stat.ucla.edu/~deleeuw/janspubs/1968/reports/deleeuw_R_68e.pdf.
Cordier, B. 1965. “L’Analyse Factorielle des Correspondances.” Thèse de Troisieme Cycle, Université de Rennes; Faculté des Sciences.
Tanaka, Y. 1979. “Review of the Methods of Quantification.” Environmental Health Perspectives 32: 113–23.
Hill, M.O. 1974. “Correspondence Analysis: a Neglected Multivariate Method.” Applied Statistics 23: 340–54.
De Leeuw, J. 1974. Canonical Analysis of Categorical Data. Leiden, The Netherlands: Psychological Institute, Leiden University. http://www.stat.ucla.edu/~deleeuw/janspubs/1974/books/deleeuw_B_74.pdf.
De Leeuw, J. 1968b. “Nonmetric Discriminant Analysis.” Research Note 06-68. Department of Data Theory, University of Leiden. http://www.stat.ucla.edu/~deleeuw/janspubs/1968/reports/deleeuw_R_68d.pdf.
Bock, R.D. 1960. “Methods and Applications of Optimal Scaling.” Psychometric Laboratory Report 25. Chapell Hill, N.C.: L.L. Thurstone Psychometric Laboratory, University of North Carolina.
Young, F.W., J. De Leeuw, and Y. Takane. 1980. “Quantifying Qualitative Data.” In Similarity and Choice. Papers in Honor of Clyde Coombs, edited by E.D. Lantermann and H. Feger. Bern: Hans Huber. http://www.stat.ucla.edu/~deleeuw/janspubs/1980/chapters/young_deleeuw_takane_C_80.pdf.
Young, F. W. 1981. “Quantitative Analysis of Qualitative Data.” Psychometrika 46: 357–88.
Kruskal, J.B. 1964a. “Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis.” Psychometrika 29: 1–27.
Kruskal, J.B. 1964a. “Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis.” Psychometrika 29: 1–27.1964b. “Nonmetric Multidimensional Scaling: a Numerical Method.” Psychometrika 29: 115–29.
Kruskal, J.B. 1964a. “Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis.” Psychometrika 29: 1–27.1964b. “Nonmetric Multidimensional Scaling: a Numerical Method.” Psychometrika 29: 115–29. 1965. “Analysis of Factorial Experiments by Estimating Monotone Transformations of the Data.” Journal of the Royal Statistical Society B27: 251–63.
Kruskal, J.B., and R.N. Shepard. 1974. “A Nonmetric Variety of Linear Factor Analysis.” Psychometrika 39: 123–57.
Roskam, E.E. 1968. “Metric Analysis of Ordinal Data in Psychology.” PhD thesis, University of Leiden.
Lingoes, J.C. 1973. The Guttman-Lingoes Nonmetric Program Series. Mathesis Press.
De Leeuw, J., and J.L.A. Van Rijckevorsel. 1980. “HOMALS and Princals: Some Generalizations of Principal Components Analysis.” In Data Analysis and Informatics. Amsterdam: North Holland Publishing Company. http://www.stat.ucla.edu/~deleeuw/janspubs/1980/chapters/deleeuw_vanrijckevorsel_C_80.pdf.
Van der Burg, E., J. De Leeuw, and R. Verdegaal. 1988. “Homogeneity Analysis with K Sets of Variables: An Alternating Least Squares Approach with Optimal Scaling Features.” Psychometrika 53: 177–97. http://www.stat.ucla.edu/~deleeuw/janspubs/1988/articles/vanderburg_deleeuw_verdegaal_A_88.pdf.
Carroll, J.D. 1968. “A Generalization of Canonical Correlation Analysis to Three or More Sets of Variables.” In Proceedings of the 76th Annual Convention of the American Psychological Association, 227–28. Washington, D.C.: American Psychological Association.
Koyak, R. 1987. “On Measuring Internal Dependence in a Set of Random Variables.” Annals of Statistics 15: 1215–28.
Winsberg, S., and J. O. Ramsay. 1980. “Monotone Transformations to Additivity.” Biometrika 67: 669–74.
Winsberg, S., and J. O. Ramsay. 1980. “Monotone Transformations to Additivity.” Biometrika 67: 669–74.1983. “Monotone Spline Transformations for Dimension Reduction.” Psychometrika 48: 575–95.
Coolen, H., and J. De Leeuw. 1987. “Least Squares Path Analysis with Optimal Scaling.” Research Report RR-87-03. Leiden, The Netherlands: Department of Data Theory FSW/RUL. http://www.stat.ucla.edu/~deleeuw/janspubs/1987/reports/coolen_deleeuw_R_87.pdf.
Bijleveld, C.C.J.H., and J. De Leeuw. 1991. “Fitting Longitudinal Reduced-Rank Regression Models by Alternating Least Squares.” Psychometrika 56 (3): 433–47. http://www.stat.ucla.edu/~deleeuw/janspubs/1991/articles/bijleveld_deleeuw_A_91.pdf.
De Leeuw, J. 2004. “Least Squares Optimal Scaling of Partially Observed Linear Systems.” In Recent Developments in Structural Equation Models, edited by K. van Montfort, J. Oud, and A. Satorra. Dordrecht, Netherlands: Kluwer Academic Publishers. http://www.stat.ucla.edu/~deleeuw/janspubs/2004/chapters/deleeuw_C_04a.pdf.
De Leeuw, J., and J.L.A. Van Rijckevorsel. 1980. “HOMALS and Princals: Some Generalizations of Principal Components Analysis.” In Data Analysis and Informatics. Amsterdam: North Holland Publishing Company. http://www.stat.ucla.edu/~deleeuw/janspubs/1980/chapters/deleeuw_vanrijckevorsel_C_80.pdf.1988. “Beyond Homogeneity Analysis.” In Component and Correspondence Analysis, edited by J.L.A. Van Rijckevorsel and J. De Leeuw, 55–80. Chichester, England: Wiley. http://www.stat.ucla.edu/~deleeuw/janspubs/1988/chapters/deleeuw_vanrijckevorsel_C_88.pdf.
De Leeuw, J. 1988b. “Multivariate Analysis with Optimal Scaling.” In Proceedings of the International Conference on Advances in Multivariate Statistical Analysis, edited by S. Das Gupta and J.K. Ghosh, 127–60. Calcutta, India: Indian Statistical Institute. http://www.stat.ucla.edu/~deleeuw/janspubs/1988/chapters/deleeuw_C_88b.pdf.
Mair, P., and J. De Leeuw. 2010. “A General Framework for Multivariate Analysis with Optimal Scaling: The R Package aspect.” Journal of Statistical Software 32 (9): 1–23. http://www.stat.ucla.edu/~deleeuw/janspubs/2010/articles/mair_deleeuw_A_10.pdf.
De Leeuw, J . 2015. “Aspects of Correlation Matrices.” doi:10.13140/RG.2.1.2086.5367.
De Leeuw, J. 1984b. “The Gifi System of Nonlinear Multivariate Analysis.” In Data Analysis and Informatics, edited by E. Diday et al. Vol. III. Amsterdam: North Holland Publishing Company. http://www.stat.ucla.edu/~deleeuw/janspubs/1984/chapters/deleeuw_C_84c.pdf.