Chapter 3 Correspondence Analysis

PCA for two qualitative variables represented by a contingency table. The data set has variables in the rows AND columns and each element of the contingency table is the number of observations.

The barycenter is found by multiplying the inverse of the total of the contingency table by the total of the columns:

                    ((1xI)(IxJ)(Jx1))^-1 * (1xI)(IxJ)

The contingency table is transformed into a probability matrix by multiplying the rows by their masses and the columns by their weight. The masses are obtained by the inverse of the total of the contingency table multiplied by the total of the rows:

                   M = diag((1xI)(IxJ)(Jx1))^-1 * (IxJ)(Jx1)

The weights are obtained by the inverse of the columns componenet to the barycenter: W = diag(((1xI)(IxJ)(Jx1))^-1 * (1xI)(IxJ))^-1

Finally, GSVD is applied to the new probability matrix with the constraints:

                    t(P)MP = t(Q)WQ = I

3.1 Data set: Beer

It is a contingency table of 9 different beers (rows) on 30 beer characteristics (columns).

bitter complement goodcolor picon good pot clear disappointing refreshing golden queasy aperitif bland floral
Alken 3 5 21 3 17 3 26 9 15 17 2 8 20 3
Bavik 2 5 19 3 21 7 18 7 17 17 5 8 6 13
Bock 5 4 9 1 11 1 33 18 11 10 9 3 8 3
Emelisse 50 2 15 5 3 6 2 25 2 15 17 4 7 3
Jupiler 6 2 8 4 12 1 34 18 18 13 1 2 26 4
Moor 41 3 6 1 2 3 16 21 8 6 26 4 6 5
Piedboeuf 1 7 26 1 12 3 1 14 11 28 2 1 23 4
Ridder 20 3 5 6 6 3 16 17 7 6 20 1 14 3
Simcoe 36 3 17 1 9 4 3 15 6 13 14 5 7 3

3.3 Analysis

3.3.1 Symmetric

## [1] "It is estimated that your iterations will take 0.02 minutes."
## [1] "R is not in interactive() mode. Resample-based tests will be conducted. Please take note of the progress bar."
## ===========================================================================
## [1] "Row dimensions do not match for X and Y. Creating default."
## [1] "It is estimated that your iterations will take 0.02 minutes."
## [1] "R is not in interactive() mode. Resample-based tests will be conducted. Please take note of the progress bar."
## ===========================================================================

3.3.3 Scree Plot

Even though 5 dimensions are reliable via the permutation test only 2 are above the Kaiser line.

3.3.6 Symmetric Map

Symmetric plots normalize the rows and the columns differently. Therefore, we can only compare rows to rows and columns to columns. When interpreting the columns and rows together it needs to be interpreted against the mean (origin) instead of each other.

3.3.7 Factor Scores with Symmetric Map

3.3.8 Contributions and bootstrap ratios barplots

3.3.9 Bootstrap Ratios

Bootstraps support contributions.

3.4 Summary

When we interpret the factor scores and loadings together, the CA revealed:

Do you prefer symmetric or asymmetric plot for your data? Symmetric, due to small effect size via eigenvalue per dimension.

Component 1

Rows: Hoppy vs Sweet

Cols: Bad taste to Good taste

Interpret: Hoppy beers have a bad taste and sweet beers have a good taste

Component 2

Rows: Sweet vs Bland

Cols: Physical characteristics

Interpret: Sweeter beers have stronger physical characteristics as opposed to bland beers