8.1 What is CA?
Definition and Purpose:
Correspondence analysis (CA) is a multivariate analysis method that is built upon PCA. CA is best when use to analyze nominal data or qualitative data (as opposed to quantitative). CA takes Contingency table as its Input. Besides being able to compare rows to rows and column to column, CA is especially powerful in comparing rows and columns directly with each others. Instead of using normal Euclidean distance (Pythagorean distance to best-fit line) to find Inertia like PCA, CA uses chi-square distance to measure the distances among each data point.
Contingency table is a two-way table that summarize the relationship between several categorical variables. A contingency table is a special type of frequency distribution table, where two variables are shown simultaneously.(StatisticHowTo)
General Processes of PCA:
• First, follow PCA procedures to find and project on new components.
• Second, CA requires us to calculate the chi-square distance between the data points. This accomplish 2 major goals:
(1) The variance between rows and columns will be similar. This allows them to be both mapped together
(2) Create the factor scores for both rows and columns.
• Third, CA finds Inertia by using Generalized SVD to decompose the chi-square X^2 into orthogonal components. Important concepts: row/col profile, profile matrix, Mass, Weight. • Fourth, CA projects row and col Factor Scores onto new plane in either Symmetric or Asymmetric plots.
CA Symmetric and Asymmetric maps and their interpretations:
Generally, CA looks at the proximity of data points (after projection) to determine how similar their profiles are. This intuition is mostly true when comparing rows to rows or col to col but not row to col or vice versa. For this we will need to use asymmetric plot and biplots where either row and col Factor Scores are normalized. Important concept: Simplex
• Either row or column factor scores are normalized. • Is a simplex • Can interpret all ways round (col-col, row-row, col-row)
• Both row or column factor scores are normalized. • Is NOT a simplex • Cannot interpret (col-row)
(Quiz on CA)
Other important aids:
• Scree plot: describes how much inertia is explained by each component. Note that this is not a “tell all - be all” plot used to decide which component is important or not. We should use scree plot as a exploratory tool only. • Permutation test for eigenvalues: tests if eigenvalues are reliable. • Correlation Plot: describes the linear relationships between variables (use when applicable). This is another exploratory tool we should use before doing CA. Note that in CA, we use the the Inertia to plot correlation. • Contribution: describes the effect strength (compared to average contribution) of each variable and observations. It is usually plotted in bar plots along with Bootstrap Ratios.
CA is an elegant iteration of PCA that is most useful to analyze nominal datasets. The intuition and techniques behind CA has root in PCA. One point of note when using CA is to determine whether to use Symmetric or Assymetric plot to portray the relationship (both inter and intra) between rows, columns. Overall, I see CA as a great exploratory techniques that emphasizes on differentiating both rows and columns at the same time. It is useful to see what hidden effects and relationships there are between observations and variables. Using simplex, it is also very good at summarizing the data and explaining topical effects by reducing data existing in multiple dimensions into 2-3 important dimensions.
126.96.36.199 Let’s do data analysis using CA!