3 Coding and Transformations

3.1 Variables and Multivariables

In the multivariate analysis techniques presented in this paper the data are measurements or classifications of $n$ objects by $m$ variables. Perhaps it is useful to insert some definitions here. A variable is a function that maps a domain of objects to a range of values. Domains are finite. The elements of the domain can be individuals, animals, plants, time points, locations, and so on. It is useful to distinguish the codomain (or range) of a variable and its image. The codomain of a variable can be the real numbers, but the image always is a finite set, the actual values the variable assumes on the domain. A multivariable is a sequence of variables defined on the same domain, with possibly different codomains. Multivariables are implemented in R as dataframes. Variables can have a finite codomain, which can be either ordered or unordered. This corresponds with a factor or an ordered factor in R. MVAOS techniques quantify factors, replacing the values in the image by real numbers. If the variables are real-valued to start with we replace real numbers by other real numbers and we transform instead of quantify. The distinction between quantification and transformation is somewhat fluid, because the image of a variable is always finite and thus, in a sense, all variables are categorical (a point also emphasized, for example, in Holland (1979)).

Although the variables in a multivariable have the same domain, there can be different numbers of missing data for different variables. We handle this in the same way as R, by adding NA to the range of all variables. In this context it is also useful to define latent or unobserved variables. These are variables for which all values are missing, i.e. for which the image only contains NA. At first thought it seems somewhat perverse to have such completely missing variables, but think of examples such as principal components, factor scores, or error terms in linear models.

3.2 Induced Correlations and Aspects

If all categorical variablea are quantified and all numerical variable are transformed we can compute the induced correlation matrix of the transformed and quantified variables. In the forms of MVAOS we consider in this book the statistics we compute, except for the transformations themselves, are usually functions of this induced correlation matrix. This means that they are functions of the second order relations between the variables, or, in order words, they are joint bivariate. Higher order moments and product moments are ignored. Different multivariate distributions with the same bivariate marginals will give the same MVAOS results.

3.3 Transformed Variables

The data are collected in the $n\times m$ matrix $H$ , which codes the observations on the $m$ variables. MVAOS does not operate on the data directly, but on transformations or quantifications of the variables. Choosing a transformation to minimize a loss function is known as optimal scaling. Clearly this so-called optimality is only defined in terms of a specific loss function, with specific constraints. Different constraints will lead to different optimal transformations.

Let us define the types of transformations we are interested in. The $n\times m$ matrix of transformed variables $H$ has columns $h_j$ , which are constrained by $h_j=G_jz_j$ , where $G_j$ is a given matrix defining the basis for variable $j$ . In addition we require $h_j\in\mathcal{C}_j$ and $h_j\in\mathcal{S}$ , where $\mathcal{C}_j$ is a cone of transformations and $\mathcal{S}$ is the unit sphere in $\mathbb{R}^n$ . This will be discussed in more detail in later sections, but for the time being think of the example in which $h_j$ is required to be a (centered and normalized) monotone polynomial function of the image values of variable $j$ . The whole of $\mathbb{R}^n$ and a single point in $\mathbb{R}^n$ are both special cases of these normalized cones. It is important, especially for algorithm construction, that the restrictions are defined for each variable separately. An exception to this rule is the orthoblock, using terminology from De Leeuw (2004), which requires that all or some of the columns of $H$ are not only normalized but also orthogonal to each other. Clearly a normalized variable is an orthoblock of size one.

3.4 Bases

In earlier MVAOS work, summarized for example in Gifi (1990) or Michailidis and De Leeuw (1998), the basis matrices $G_j$ were binary zero-one matrices, indicating category membership. The same is true for the software in IBM SPSS Categories (Meulman and Heiser 2012) or in the R package homals (De Leeuw and Mair 2009 a). In this paper we extend the current MVAOS software using B-spline bases, which provide a form of fuzzy non-binary coding suitable for both categorical and numerical variables (Van Rijckevorsel and De Leeuw 1988). These generalizations were already discussed in De Leeuw, Van Rijckevorsel, and Van der Wouden (1981) and Gifi (1990), but corresponding easily accessible software was never released.

In this book we continue to use indicators for bases. Thus bases $G_j$ must be non-negative, with rows that add up to one. If there is only one non-zero entry in each row, which of course is then equal to one, the indicator is crisp, otherwise it is fuzzy. B-spline bases are the prime example of fuzzy indicators, but other examples are discussed in Van Rijckevorsel and De Leeuw (1988). Only B-spline bases are implemented in our software, however.

Note that the identity matrix is a crisp indicator. This is of importance in connection with missing data and orthoblocks.

3.5 Copies and Rank

Within a block there can be more than one version of the same variable. These multiple versions are called copies. They were first introduced into the Gifi framework by De Leeuw (1984 b). Since MVAOS transforms variables, having more than one copy is not redundant, because different copies can and will be transformed differently. As a simple example of copies, think of using different monomials or orthogonal polynomials of a single variable $x$ in a polynomial regression. The difference between copies and simply including a variable more than once is that copies have the same basis $G_j$ .

In the algorithm copies of a variable are just treated in exactly the same way as other variables. The notion of copies replaces the notion of the rank of a quantification used in traditional Gifi, which in turn generalizes the distinction between single and multiple quantifications. A single variable has only one copy in its block, a multiple variable has the maximum number of copies.

In our software the copies of a variable by definition have the same basis. It is possible, of course, to include the same variable multiple times, but with different bases. This must be done, however, at the input level. In terms of the structures defined in the software, a gifiVariable can have multiple copies but it only has one basis. If there is more than one basis for a variable, then we need to define an additional gifiVariable. Also note that copies of a variable are all in the same block. If you want different versions of a variable in different blocks, then that requires you to create different gifiVariables.

Defining copies is thus basically a coding problem. It can be handled simply by adding a variable multiple times to a data set, and giving each variable the same bases. In our algorithm we use the fact that copies belong to the same variable to create some special shortcuts and handling routines.

Ordinality restrictions on variables with copies require some special attention. In our current implementation we merely require the first copy to be ordinal with the data, the other copies are not restricted. Once again, if you want ordinal restrictions on all copies you need to create separate gifiVariables for each copy.

3.6 Orthoblocks

3.7 Constraints

As discussed earlier, each variable has a cone of transformations associated with it, and we optimize over these transformations. In ALSOS and classical Gifi the three type of transformation cones considered are nominal, ordinal, and numerical. Our use of B-splines generalizes this distinction, because both numerical and nominal can be implemented using splines. What remains is the choice for the degree of the spline and the location of the knots.

Choice of degree and knots is basically up to the user, but the programs have some defaults. In most cases the default is to use crisp indicators with knots at the data points. Of course for truly categorical variables (i.e. for factors in R) crisp indicators are simply constructed by using the levels of the factor. We include some utilities to place knots at percentiles, or equally spaced on the range, or to have no interior knots at all (in which case we fit polynomials).

And finally the user decides, for all variables, if she wants the transformations (step functions, splines, and polynomials) to be monotonic with the data. Default is not requiring monotonicity.

Note that we require the spline to be monotonic in the non-missing data points – this does not mean the spline is monotonic outside the range of the data (think, for example, of a quadratic polynomial), it does not even mean the spline is monotonic between data points. This makes our spline transformations different from the integrated B-splines, or I-splines, used by Winsberg and Ramsay (1983), which are monotone on the whole real line. Because each variable has a finite image we are not really fitting a spline, we are fitting a number of discrete points that are required to be on a spline, and optionally to be monotonic with the data. In Winsberg and Ramsay (1983) the requirement is that the fitted points are on an I-spline, which automatically makes them monotonic with the data. Clearly our approach is the less restrictive one.

3.8 Missing Data

The utility makeMissing() expands the basis for the non-missing data in various ways. Option “m” (for “multiple”) is the default. It replaces the basis with the direct sum of the non-missing basis and an identity matrix for the missing elements. Option “s” (for “single”) adds a single binary column to the basis indicating which elements are missing. Option “a” (for “average”) codes missing data by having all the elements in rows of the basis corresponding with missing data equal to one over the number of rows. With all three options the basis remains an indicator. Some of these options make most sense in the context of crisp indicators, where they are compared in Meulman (1982).

So suppose the data are

##      [,1] 
## [1,] -0.50
## [2,]    NA
## [3,]  0.75
## [4,]  0.99
## [5,]    NA

Create a basis for the non-missing values with

mprint(basis <- bsplineBasis(x[which(!is.na(x))],1,c(-1,0,1)))

##      [,1]  [,2]  [,3] 
## [1,]  0.50  0.50  0.00
## [2,]  0.00  0.25  0.75
## [3,]  0.00  0.01  0.99

The three different completion options for missing data give

mprint (makeMissing (x, basis, missing = "m"))

##      [,1]  [,2]  [,3]  [,4]  [,5] 
## [1,]  0.50  0.50  0.00  0.00  0.00
## [2,]  0.00  0.00  0.00  1.00  0.00
## [3,]  0.00  0.25  0.75  0.00  0.00
## [4,]  0.00  0.01  0.99  0.00  0.00
## [5,]  0.00  0.00  0.00  0.00  1.00

mprint (makeMissing (x, basis, missing = "s"))

##      [,1]  [,2]  [,3]  [,4] 
## [1,]  0.50  0.50  0.00  0.00
## [2,]  0.00  0.00  0.00  1.00
## [3,]  0.00  0.25  0.75  0.00
## [4,]  0.00  0.01  0.99  0.00
## [5,]  0.00  0.00  0.00  1.00

mprint (makeMissing (x, basis, missing = "a"))

##      [,1]  [,2]  [,3] 
## [1,]  0.50  0.50  0.00
## [2,]  0.33  0.33  0.33
## [3,]  0.00  0.25  0.75
## [4,]  0.00  0.01  0.99
## [5,]  0.33  0.33  0.33

The default option for missing data in the previous version of the Gifi system was “missing data deleted”, which involves weighting the rows in the loss functions by the number of non-missing data in that row. This leads to some complications, and consequently we have no option “d” in this version of Gifi.

3.9 Active and Passive Variables

If a variable is passive (or supplementary) it is incorporated in the analysis, but it does not contribute to the loss. Thus an analysis that leaves the passive variables out will give the same results for the active variables. Passive variables are transformed like all the others, but they do not contribute to the block scores, and thus not to the loss. They have category quantifications and scores, and can be used in the corresponding plots.

If all variables in a block are passive, then the whole block does not contribute to the loss. This happens specifically for singletons, if the single variable in the block is passive.

3.10 Interactive Coding

One of the major contributions of Analyse des Données is the emphasis on coding, which in our context can be defined as choosing how to represent the raw data of an experiment in an actual data frame (and, to a lesser extent, how to choose blocks, number of copies, dimensionality, degrees, and knots). In the section we discuss one important coding variation. Suppose we have $n$ observations on two factors, one with $p$ levels and one with $q$ levels. Then the data can be coded as $n$ observations on one factor with $p\times q$ levels, and we can construct a corresponding crisp indicator. The same reasoning applies to more than two categorical variables, which we can always code interactively. It also applies to bases for numerical variables, where we can define an interactive basis by using products of columns from the bases of each of the variables.

If $G=\{g_{is}\}$ and $H=\{h_{it}\}$ are two indicators of dimensions $n\times m_g$ and $n\times m_h$ , then the $n\times m_gm_h$ matrix with elements {g_{is}h_{it}} is again an indicator: the elements are non-negative, and rows add up to one.

mprint (x <- bsplineBasis (1:9/10, 1, .5))

##       [,1]  [,2]  [,3] 
##  [1,]  1.00  0.00  0.00
##  [2,]  0.75  0.25  0.00
##  [3,]  0.50  0.50  0.00
##  [4,]  0.25  0.75  0.00
##  [5,]  0.00  1.00  0.00
##  [6,]  0.00  0.75  0.25
##  [7,]  0.00  0.50  0.50
##  [8,]  0.00  0.25  0.75
##  [9,]  0.00  0.00  1.00

mprint (y <- makeIndicator (c (rep (1, 5), rep (2, 4))))

##       [,1]  [,2] 
##  [1,]  1.00  0.00
##  [2,]  1.00  0.00
##  [3,]  1.00  0.00
##  [4,]  1.00  0.00
##  [5,]  1.00  0.00
##  [6,]  0.00  1.00
##  [7,]  0.00  1.00
##  [8,]  0.00  1.00
##  [9,]  0.00  1.00

mprint (makeColumnProduct (list (x, y)))

##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6] 
##  [1,]  1.00  0.00  0.00  0.00  0.00  0.00
##  [2,]  0.75  0.00  0.25  0.00  0.00  0.00
##  [3,]  0.50  0.00  0.50  0.00  0.00  0.00
##  [4,]  0.25  0.00  0.75  0.00  0.00  0.00
##  [5,]  0.00  0.00  1.00  0.00  0.00  0.00
##  [6,]  0.00  0.00  0.00  0.75  0.00  0.25
##  [7,]  0.00  0.00  0.00  0.50  0.00  0.50
##  [8,]  0.00  0.00  0.00  0.25  0.00  0.75
##  [9,]  0.00  0.00  0.00  0.00  0.00  1.00

References

Holland, P.W. 1979. “The Tyranny of Continuous Models in a World of Discrete Data.” IHS-Journal 3: 29–42.

De Leeuw, J. 2004. “Least Squares Optimal Scaling of Partially Observed Linear Systems.” In Recent Developments in Structural Equation Models, edited by K. van Montfort, J. Oud, and A. Satorra. Dordrecht, Netherlands: Kluwer Academic Publishers. http://www.stat.ucla.edu/~deleeuw/janspubs/2004/chapters/deleeuw_C_04a.pdf.

Gifi, A. 1990. Nonlinear Multivariate Analysis. New York, N.Y.: Wiley.

Michailidis, G., and J. De Leeuw. 1998. “The Gifi System for Descriptive Multivariate Analysis.” Statistical Science 13: 307–36. http://www.stat.ucla.edu/~deleeuw/janspubs/1998/articles/michailidis_deleeuw_A_98.pdf.

Meulman, J.M., and W.J. Heiser. 2012. IBM Spss Categories 21. IBM Corporation.

De Leeuw, J., and P. Mair. 2009a. “Homogeneity Analysis in R: the Package homals.” Journal of Statistical Software 31 (4): 1–21. http://www.stat.ucla.edu/~deleeuw/janspubs/2009/articles/deleeuw_mair_A_09a.pdf.

Van Rijckevorsel, J.L.A., and J. De Leeuw, eds. 1988. Component and Correspondence Analysis. Wiley. http://www.stat.ucla.edu/~deleeuw/janspubs/1988/books/vanrijckevorsel_deleeuw_B_88.pdf.

De Leeuw, J., J.L.A. Van Rijckevorsel, and H. Van der Wouden. 1981. “Nonlinear Principal Component Analysis Using B-Splines.” Methods of Operations Research 43: 379–94. http://www.stat.ucla.edu/~deleeuw/janspubs/1981/articles/deleeuw_vanrijckevorsel_vanderwouden_A_81.pdf.

De Leeuw, J. 1984b. “The Gifi System of Nonlinear Multivariate Analysis.” In Data Analysis and Informatics, edited by E. Diday et al. Vol. III. Amsterdam: North Holland Publishing Company. http://www.stat.ucla.edu/~deleeuw/janspubs/1984/chapters/deleeuw_C_84c.pdf.

Winsberg, S., and J. O. Ramsay. 1980. “Monotone Transformations to Additivity.” Biometrika 67: 669–74.

1983. “Monotone Spline Transformations for Dimension Reduction.” Psychometrika 48: 575–95.

Meulman, J.J. 1982. Homogeneity Analysis of Incomplete Data. Leiden, The Netherlands: DSWO Press.