4.1 What is MCA?
Definition and Purpose:
Correspondence analysis (CA) is a multivariate analysis method that is built upon PCA. CA is best when use to analyze nominal data or qualitative data (as opposed to quantitative). CA takes Contingency table as its Input. Besides being able to compare rows to rows and column to column, CA is especially powerful in comparing rows and columns directly with each others. Instead of using normal Euclidean distance (Pythagorean distance to best-fit line) to find Inertia like PCA, CA uses chi-square distance to measure the distances among each data point.
Contingency table is a two-way table that summarize the relationship between several categorical variables. A contingency table is a special type of frequency distribution table, where two variables are shown simultaneously.(StatisticHowTo)
General Processes of MCA:
• First, follow PCA procedures to find and project on new components.
• Second, CA requires us to calculate the chi-square distance between the data points. This accomplish 2 major goals:
(1) The variance between rows and columns will be similar. This allows them to be both mapped together
(2) Create the factor scores for both rows and columns.
• Third, CA finds Inertia by using Generalized SVD to decompose the chi-square X^2 into orthogonal components. Important concepts: row/col profile, profile matrix, Mass, Weight. • Fourth, CA projects row and col Factor Scores onto new plane in either Symmetric or Asymmetric plots.
Other important aids:
• Scree plot: describes how much inertia is explained by each component. Note that this is not a “tell all - be all” plot used to decide which component is important or not. We should use scree plot as a exploratory tool only. • Permutation test for eigenvalues: tests if eigenvalues are reliable. • Correlation Plot: describes the linear relationships between variables (use when applicable). This is another exploratory tool we should use before doing CA. Note that in CA, we use the the Inertia to plot correlation. • Contribution: describes the effect strength (compared to average contribution) of each variable and observations. It is usually plotted in bar plots along with Bootstrap Ratios.
MCA is built on the foundation of PCA and mainly CA, in that in combine the mathematical maneuverings from PCA and the idea of binarizing data & mass/weight from CA. Thus, MCA can be used to analyze datasets with multiple categorical variables. MCA requires level-coding for the datasets (if not already done) and relies on chi-square distance to perform its analysis. However, MCA tends to over-estimate the eigen value compared to CA. In the report followed, I also would hope that readers can see that level-coding for (0 & 1) for orignally qualitative datasets can be awkward, especially during the binning process when bins do not have the same number of datapoint.
126.96.36.199 Let’s do data analysis using MCA!