Chapter 8 Correspondence Analysis

Data table: CA is used to analyze one table data that has qualitative variables.

Goal:

To understand the relationships between two nominal variables(columns) using a contingency table. The independence (or correlation) between the variables is tested using Chi square statistics. Then it performs decomposition using GSVD to identify dimensions where dispersion is the least and eliminate them.

CA performs a simultaneous analysis of rows and columns.

Key ideas

  • Chi Square distance : explains if the rows and the columns are independent or not. Note If the variable is quantitative R square tells us the same.

  • Chi square = SUM( Mass * (Distance)^2) Alternative for a F test. (effect * degrees of freedom)

  • Weights are assigned to Vertices and Masses are assigned to rows. (This is interchangable). The mass of each row is the proportion of this row in the total of the table. The weight of each column reflects its importance for discriminating between the variables. So the weight of a column reflects the information this columns provides to the identification of a given row.

  • We weight the rarity higher than the common ones

Interpretations

  1. Symmetric Plot: Cannot compare variables and observations together
     because both rows and columns are normalized differently. So you look at
     the distance of the points from the origin.
  
  2. Assymetric plot: can compare variables and observations together.The plot
     is asymmetrically scaled, because it is the joint display of profile and
     vertex points. 

Sourcing functions

Sourcing Function for Bootstrap. Since Data sample is quite large we use alternate method for permutation and bootstrap ratios.

8.1 Dataset: French Author Punctuation Data

Data: Number of times each writer uses three punctuation marks: the period, the comma, and all the other marks (i.e., interrogation mark, exclamation mark, colon, and semicolon).

Rows: 84 french Authors

Cols: 3 punctutation types (Period, Comma, Other)

8.2 Looking at Data Pattern

  1. Sort the columns to see the authors that use which punctuation the most
  2. Heat Map

NOTE: chi-square is in counts, but CA analyzed probabilities (i.e., the profiles). So, we need to divide the chi-square statistics by the total sum of the data. Also, the chi-square statistic adds the chi-squares in all cells and give one number.

In CA, however, we keep the pattern of chi-squares instead of adding all of them up.

8.2.1 Heat Map

The dark red color shows higher presence of the respective punctuation by the author. The red bar in the comma row represents author Zolo. However, overall the gradiant of colors do not change much hinting that there is not much information to be extracted in the data.

8.3 CA Analysis

8.3.1 SCREE PLOT

Scree plot with significant eigenvalues depicts two significant dimensions worth looking at. Dimension 1 explains about 60% of variance and dimension 2 about 40%

8.3.2 Asymmetric plot:

The plot is asymmetrically scaled. The row points closer to a particular vertex is considered as driven by that vertex.

All the authors seem to be centered around the origin not revesling much about their relationship with respect to their punctuation usage.

It might help to look at the Symmetric Plot instead

8.3.3 Symmetric Plots

8.3.3.1 Design as per Location

The row factor scores colored as per the birth place of the authors shows a null effect. All the confidence interval of the means seem to overlap without a clear distinction.

8.3.3.2 Design as per writing Style

The Authors are categorzed as per their most common writing style - Novels, Poetry, Mixed, Other.

NOTE: The design parameter is changed in the epCA function to color the different writing types.

## [1] "Mix"     "Poetry"  "Prose"   "science"
resCA.Writing<- epCA(X, symmetric = TRUE, DESIGN = data_FrAuthors$`Writing Type` ,graphs = FALSE)

Fi   <- resCA.Writing$ExPosition.Data$fi
Fj   <- resCA.Writing$ExPosition.Data$fj

colnames(Fi) <- paste0("Dimension ", 1:ncol(Fi))
colnames(Fj) <- paste0("Dimension ", 1:ncol(Fj))

Writing.Plot <-createFactorMapIJ(Fi,
                                  Fj,
                                  text.cex.i = 3,
                                  col.points.i = resCA.Writing$Plotting.Data$fi.col,
                                  col.labels.i = "maroon",
                                  title = "French Authors - Symmetric Map with Writing")

Writing.label<-createxyLabels.gen(1,2,
                                   lambda = resCA.Writing$ExPosition.Data$eigs,
                                   tau = round(resCA.Writing$ExPosition.Data$t),
                                   axisName = "Component "
) 

labels4CA_Writing <- createxyLabels(resCA = resCA.Writing)


# get index for the first row of each group
grp.ind <- order(data_FrAuthors$`Writing Type`)[!duplicated(sort(data_FrAuthors$`Writing Type`))]
grp.col <- resCA.Writing$Plotting.Data$fi.col[grp.ind] # get the color

grp.name <- data_FrAuthors$`Writing Type`[grp.ind] # get the corresponding groups
names(grp.col) <- grp.name





group.mean <- aggregate(resCA.Writing$ExPosition.Data$fi,
                        by = list(data_FrAuthors$`Writing Type`), # must be a list
                        mean)

rownames(group.mean) <- group.mean[,1] # Use the first column as row names
fi.mean <- group.mean[,-1] # Exclude the first column

fi.mean.plot <- createFactorMap(fi.mean,
                                alpha.points = 0.8,
                                col.points = grp.col[rownames(fi.mean)],
                                col.labels = grp.col[rownames(fi.mean)],
                                pch = 17,
                                cex = 3,
                                text.cex = 3)




Writing.Plot$baseMap+Writing.label+Writing.Plot$I_points+
labels4CA_Writing+ fi.mean.plot$zeMap_dots + fi.mean.plot$zeMap_text +Writing.Plot$J_points+Writing.Plot$J_labels

Tolerance interval

All the gorups tend to overlap by large extend.

Confidence interval

There bootrap intervals also seem to overlap showing a null effect overall.

8.3.4 Contribution Barplots

Row:

Dim 1: Sue Eugene has a very large contribution

Dim 2: Zola Emilie (commas) and Dumas Alexandre (other)

Columns:

Dim 1: Explained mostly by the punctuation Period

Dim 2: Segregates Comma and other

## Warning: Setting row names on a tibble is deprecated.

8.3.5 Bootstrap Ratio Barplots

The Bootstrap ratio barplot show that the contributions are significantly stable and also brings up a few other rows and column as significant.

8.4 Conclusion

The heatmap did not reveal much differences between the use of punctuations in general amongst the authors.

The CA analysis resulted in showing a null effect with all the three different type of design (Birth year, Location, Writing Style) while looking at the Symmetric Maps.

However with respect to the raw data without any design, it could be observed that Sue Eugene contributed heavily towards the use of periods in her works wheras the second dimension reveals that the frequency od commas is highly influenced by Zola Emile. And Other punctuations in general was largely influenced by the author Duman Alexandre.