# Chapter 8 Correspondence Analysis

Data table: CA is used to analyze one table data that has qualitative variables.

Goal:

To understand the relationships between two nominal variables(columns) using a contingency table. The independence (or correlation) between the variables is tested using Chi square statistics. Then it performs decomposition using GSVD to identify dimensions where dispersion is the least and eliminate them.

CA performs a simultaneous analysis of rows and columns.

Key ideas

• Chi Square distance : explains if the rows and the columns are independent or not. Note If the variable is quantitative R square tells us the same.

• Chi square = SUM( Mass * (Distance)^2) Alternative for a F test. (effect * degrees of freedom)

• Weights are assigned to Vertices and Masses are assigned to rows. (This is interchangable). The mass of each row is the proportion of this row in the total of the table. The weight of each column reflects its importance for discriminating between the variables. So the weight of a column reflects the information this columns provides to the identification of a given row.

• We weight the rarity higher than the common ones

Interpretations

``````  1. Symmetric Plot: Cannot compare variables and observations together
because both rows and columns are normalized differently. So you look at
the distance of the points from the origin.

2. Assymetric plot: can compare variables and observations together.The plot
is asymmetrically scaled, because it is the joint display of profile and
vertex points. ``````

Sourcing functions

Sourcing Function for Bootstrap. Since Data sample is quite large we use alternate method for permutation and bootstrap ratios.

``````dir4functions <- 'C:\\Users\\KIRTHANA\\Desktop\\UT dallas\\Sem3\\RM3\\CA\\FrenchAuthor\\'

# This is the name of my directory where my R-file is stored
workingDir    <- 'C:\\Users\\KIRTHANA\\Desktop\\UT dallas\\Sem3\\RM3\\CA\\FrenchAuthor\\'

# the functions are saved in the file: InferencesMultinom4CA.R
# If  you have used another name change this name below:
file4functions <- 'InferencesMultinom4CA.R'
#
# need ExPosition
library(ExPosition)
# load the functions by "sourcing" the file
source(paste0(dir4functions,file4functions))``````

## 8.1 Dataset: French Author Punctuation Data

Data: Number of times each writer uses three punctuation marks: the period, the comma, and all the other marks (i.e., interrogation mark, exclamation mark, colon, and semicolon).

Rows: 84 french Authors

Cols: 3 punctutation types (Period, Comma, Other)

## 8.2 Looking at Data Pattern

1. Sort the columns to see the authors that use which punctuation the most
2. Heat Map

NOTE: chi-square is in counts, but CA analyzed probabilities (i.e., the profiles). So, we need to divide the chi-square statistics by the total sum of the data. Also, the chi-square statistic adds the chi-squares in all cells and give one number.

In CA, however, we keep the pattern of chi-squares instead of adding all of them up.

### 8.2.1 Heat Map

The dark red color shows higher presence of the respective punctuation by the author. The red bar in the comma row represents author Zolo. However, overall the gradiant of colors do not change much hinting that there is not much information to be extracted in the data. ## 8.3 CA Analysis

``````# run a plain CA with ExPosition
resCA4Authors <- epCA(X, graphs = FALSE)

# Permutation of eigenvalues-----------------------------------------------------

res4PermTest <- Perm4MultCA(X = X, nIter = 1000)

# BOOTSTRAP --------------------------------------------------------------------
# to find the "significant eigenvalues"
# use the Malinvaud-Saporta test.

Res4Malinvaud <- MalinvaudQ4CA(Data = X,
permutedEigenValues =
res4PermTest\$permEigenvalues)

# Get the boostrap estimates
res4Boot <- Boot4MultCA(X = X,
Fi = resCA4Authors\$ExPosition.Data\$fi,
Fj = resCA4Authors\$ExPosition.Data\$fj,
delta = resCA4Authors\$ExPosition.Data\$pdq\$Dv,
nf2keep = 3,
nIter = 1000,
critical.value = 2,
eig = TRUE,
alphaLevel = .05)``````
``````#run CA Asymetric
resCA.asym <- epCA(X, symmetric = FALSE, graphs = FALSE)

#run CA Symmetric
resCA.sym <- epCA(X, symmetric = TRUE, graphs = FALSE)``````

### 8.3.1 SCREE PLOT

Scree plot with significant eigenvalues depicts two significant dimensions worth looking at. Dimension 1 explains about 60% of variance and dimension 2 about 40% ### 8.3.2 Asymmetric plot:

The plot is asymmetrically scaled. The row points closer to a particular vertex is considered as driven by that vertex.

All the authors seem to be centered around the origin not revesling much about their relationship with respect to their punctuation usage.

``````# factor scores
Fi.a <-resCA.asym\$ExPosition.Data\$fi
Fj.a <- resCA.asym\$ExPosition.Data\$fj
Fi   <- resCA.sym\$ExPosition.Data\$fi
Fj   <- resCA.sym\$ExPosition.Data\$fj

# constraints -----
# first get the constraints correct
constraints.sym  <- minmaxHelper(mat1 = Fi, mat2  = Fj)
constraints.asym <- minmaxHelper(mat1 = Fi, mat2  = Fj.a)
#constraints.sup  <- minmaxHelper(mat1 = rbind(Fi, HA.sup\$fii),
#                                 mat2  = rbind(Fj, punct.sup\$fjj) )

# Get some colors ----
color4Authors <-prettyGraphsColorSelection(n.colors = nrow(Fi))
# baseMaps ----
colnames(Fi) <- paste("Dimension ", 1:ncol(Fi))
colnames(Fj) <- paste("Dimension ", 1:ncol(Fj))
colnames(Fj.a) <- paste("Dimension ", 1:ncol(Fj.a))

#Create Factor Maps
ca.plot <- createFactorMapIJ(resCA.asym\$ExPosition.Data\$fi,
resCA.asym\$ExPosition.Data\$fj,
text.cex.i = 3,
col.points.i = "orange",
col.labels.i = "maroon",
title = "French Authors - asymmetric")
ca.label <- createxyLabels.gen(1,2,
lambda = resCA.asym\$ExPosition.Data\$eigs,
tau = round(resCA.asym\$ExPosition.Data\$t),
axisName = "Component "
)

# Make the simplex visible
zePoly.J <-  PTCA4CATA::ggdrawPolygon(Fj.a,
color = 'green',
size = .2,
fill =  'green',
alpha = .1)
# Labels
labels4CA <- createxyLabels(resCA = resCA.asym)

ca.plot\$baseMap + ca.label + ca.plot\$I_points + ca.plot\$J_points + ca.plot\$J_labels+ zePoly.J`````` It might help to look at the Symmetric Plot instead

### 8.3.3 Symmetric Plots

#### 8.3.3.1 Design as per Location

The row factor scores colored as per the birth place of the authors shows a null effect. All the confidence interval of the means seem to overlap without a clear distinction.

#### 8.3.3.2 Design as per writing Style

The Authors are categorzed as per their most common writing style - Novels, Poetry, Mixed, Other.

NOTE: The design parameter is changed in the epCA function to color the different writing types.

``unique(data_FrAuthors\$`Writing Type`)``
``##  "Mix"     "Poetry"  "Prose"   "science"``
``````resCA.Writing<- epCA(X, symmetric = TRUE, DESIGN = data_FrAuthors\$`Writing Type` ,graphs = FALSE)

Fi   <- resCA.Writing\$ExPosition.Data\$fi
Fj   <- resCA.Writing\$ExPosition.Data\$fj

colnames(Fi) <- paste0("Dimension ", 1:ncol(Fi))
colnames(Fj) <- paste0("Dimension ", 1:ncol(Fj))

Writing.Plot <-createFactorMapIJ(Fi,
Fj,
text.cex.i = 3,
col.points.i = resCA.Writing\$Plotting.Data\$fi.col,
col.labels.i = "maroon",
title = "French Authors - Symmetric Map with Writing")

Writing.label<-createxyLabels.gen(1,2,
lambda = resCA.Writing\$ExPosition.Data\$eigs,
tau = round(resCA.Writing\$ExPosition.Data\$t),
axisName = "Component "
)

labels4CA_Writing <- createxyLabels(resCA = resCA.Writing)

# get index for the first row of each group
grp.ind <- order(data_FrAuthors\$`Writing Type`)[!duplicated(sort(data_FrAuthors\$`Writing Type`))]
grp.col <- resCA.Writing\$Plotting.Data\$fi.col[grp.ind] # get the color

grp.name <- data_FrAuthors\$`Writing Type`[grp.ind] # get the corresponding groups
names(grp.col) <- grp.name

group.mean <- aggregate(resCA.Writing\$ExPosition.Data\$fi,
by = list(data_FrAuthors\$`Writing Type`), # must be a list
mean)

rownames(group.mean) <- group.mean[,1] # Use the first column as row names
fi.mean <- group.mean[,-1] # Exclude the first column

fi.mean.plot <- createFactorMap(fi.mean,
alpha.points = 0.8,
col.points = grp.col[rownames(fi.mean)],
col.labels = grp.col[rownames(fi.mean)],
pch = 17,
cex = 3,
text.cex = 3)

Writing.Plot\$baseMap+Writing.label+Writing.Plot\$I_points+
labels4CA_Writing+ fi.mean.plot\$zeMap_dots + fi.mean.plot\$zeMap_text +Writing.Plot\$J_points+Writing.Plot\$J_labels`````` Tolerance interval

All the gorups tend to overlap by large extend.

``````TIplot <- MakeToleranceIntervals(Fi,
design = as.factor(data_FrAuthors\$`Writing Type`),
# line below is needed
names.of.factors =  c("Dim1","Dim2"), # needed
col = grp.col[rownames(fi.mean)],
line.size = .50,
line.type = 3,
alpha.ellipse = .2,
alpha.line    = .4,
p.level       = .95)

fi.WithMeanTI <- Writing.Plot\$baseMap + Writing.label + Writing.Plot\$I_points+
Writing.Plot\$J_points+ fi.mean.plot\$zeMap_dots +
fi.mean.plot\$zeMap_text + TIplot  #Location.Plot\$J_labels

fi.WithMeanTI`````` Confidence interval

There bootrap intervals also seem to overlap showing a null effect overall.

``````fi.boot <- Boot4Mean(Fi,
design = data_FrAuthors\$`Writing Type`,
niter = 1000)

bootCI4mean <- MakeCIEllipses(fi.boot\$BootCube[,c(1:2),], # get the first two components
col = grp.col[rownames(fi.mean)])

fi.WithMeanCI <- Writing.Plot\$baseMap + bootCI4mean + Writing.label+
Writing.Plot\$I_points+Writing.Plot\$J_points+
Writing.Plot\$J_labels + fi.mean.plot\$zeMap_dots +
fi.mean.plot\$zeMap_text

fi.WithMeanCI`````` #### 8.3.3.3 Design as per Birth year

There color gradient goes darker red representing older years while lighter shades of yellow represent recent years.

The interpretation is not quite clear. However the bright yellow rows representing the recent years are closer to the punctuation of commas.

``````# conditional formating the years
#install.packages("RColorBrewer")
library(RColorBrewer)

birth_year<-data_FrAuthors[,5]
birth_year<-data_FrAuthors[,5]
mypal <- colorRampPalette( c( "yellow", "green" ) )(20)

f <- function(x,n=20){
heat.colors(n)[cut(x,n)]
}

y <-f(as.numeric(birth_year\$`Birth Year`))

resCA.sym <- epCA(X, symmetric = TRUE, graphs = FALSE)

ca.symplot <- createFactorMapIJ(resCA.sym\$ExPosition.Data\$fi,
resCA.sym\$ExPosition.Data\$fj,
text.cex.i = 3,
col.points.i = y,
col.labels.i = "maroon",
title = "French Authors - Asymmetric Map"
)

ca.symlabel <- createxyLabels.gen(1,2,
lambda = resCA.sym\$ExPosition.Data\$eigs,
tau = round(resCA.sym\$ExPosition.Data\$t),
axisName = "Component "
)

labels4CAsym <- createxyLabels(resCA = resCA.sym)

ca.symplot\$baseMap + ca.symlabel + ca.symplot\$I_points + ca.symplot\$J_points +
ca.symplot\$J_labels +ggtitle('French Authors - Symmetric Map ') + labels4CAsym`````` ``````gridExtra::grid.arrange(
ca.symplot\$baseMap + ca.symlabel + ca.symplot\$I_points,
ca.symplot\$baseMap + ca.symlabel + ca.symplot\$J_points + ca.symplot\$J_labels,
nrow = 1, ncol = 2 )`````` ``````# Looking at the labels of the authors
ca.symplot\$baseMap + ca.symlabel + ca.symplot\$I_points + ca.symplot\$I_labels`````` ### 8.3.4 Contribution Barplots

Row:

Dim 1: Sue Eugene has a very large contribution

Dim 2: Zola Emilie (commas) and Dumas Alexandre (other)

Columns:

Dim 1: Explained mostly by the punctuation Period

Dim 2: Segregates Comma and other

``````X<-data_FrAuthors[,c(2:4)]
row.names(X) <- data_FrAuthors\$Authors ``````
``## Warning: Setting row names on a tibble is deprecated.``
``````resCA.sym <- epCA(X, symmetric = TRUE, graphs = FALSE)

signed.ctrI <- resCA.sym\$ExPosition.Data\$ci * sign(resCA.sym\$ExPosition.Data\$fi)
signed.ctrJ <- resCA.sym\$ExPosition.Data\$cj * sign(resCA.sym\$ExPosition.Data\$fj)

# plot contributions of rows for component 1
ctrI.1 <- PrettyBarPlot2(signed.ctrI[,1],
threshold = 1 / NROW(signed.ctrI),
font.size = 4,
color4bar =
gplots::col2hex(resCA.sym\$Plotting.Data\$fi.col),# we need hex code
signifOnly = TRUE,
ylab = 'Contributions',
ylim = c(1.2*min(signed.ctrI), 1.2*max(signed.ctrI))
) + ggtitle("Component 1", subtitle = 'rows')

ctrI.1`````` ``````# plot contributions of columns for component 1
ctrJ.1 <- PrettyBarPlot2(signed.ctrJ[,1],
threshold = 1 / NROW(signed.ctrJ),
font.size = 4,
signifOnly = TRUE,
color4bar = gplots::col2hex(resCA.sym\$Plotting.Data\$fj.col), # we need hex code
ylab = 'Contributions',
ylim = c(1.2*min(signed.ctrJ), 1.2*max(signed.ctrJ))
) + ggtitle("", subtitle = 'columns')

ctrJ.1`````` ``````# plot contributions of rows for component 2
ctrI.2 <- PrettyBarPlot2(signed.ctrI[,2],
threshold = 1 / NROW(signed.ctrI),
font.size = 4,
signifOnly = TRUE,
color4bar = gplots::col2hex(resCA.sym\$Plotting.Data\$fi.col), # we need hex code
ylab = 'Contributions',
ylim = c(1.2*min(signed.ctrI), 1.2*max(signed.ctrI))
) + ggtitle("Component 2", subtitle = 'rows')

ctrI.2`````` ``````# plot contributions of columns for component 2
ctrJ.2 <- PrettyBarPlot2(signed.ctrJ[,2],
threshold = 1 / NROW(signed.ctrJ),
font.size = 4,
signifOnly = TRUE,
color4bar = gplots::col2hex(resCA.sym\$Plotting.Data\$fj.col), # we need hex code
ylab = 'Contributions',
ylim = c(1.2*min(signed.ctrJ), 1.2*max(signed.ctrJ))
) + ggtitle("", subtitle = 'columns')``````

### 8.3.5 Bootstrap Ratio Barplots

The Bootstrap ratio barplot show that the contributions are significantly stable and also brings up a few other rows and column as significant.

``````BR.I <- res4Boot\$bootRatios.i
BR.J <- res4Boot\$bootRatios.j

# Plot the bootstrap ratios for Dimension 1
threshold = 2,
font.size = 4,
signifOnly = TRUE,
color4bar = gplots::col2hex(resCA.sym\$Plotting.Data\$fi.col), # we need hex code
ylab = 'Bootstrap ratios'
) + ggtitle(paste0('Component ', laDim), subtitle = 'rows')

ba001.BR1.I`````` ``````ba002.BR1.J <- PrettyBarPlot2(BR.J[,laDim],
threshold = 2,
font.size = 4,
signifOnly = TRUE,
color4bar = gplots::col2hex(resCA.sym\$Plotting.Data\$fj.col), # we need hex code
ylab = 'Bootstrap ratios'
) + ggtitle("", subtitle = 'columns')

ba002.BR1.J`````` ``````# Plot the bootstrap ratios for Dimension 2
threshold = 2,
font.size = 4,
signifOnly = TRUE,
color4bar = gplots::col2hex(resCA.sym\$Plotting.Data\$fi.col), # we need hex code
ylab = 'Bootstrap ratios'
) + ggtitle(paste0('Component ', laDim), subtitle = 'rows')

ba002.BR1.J`````` ``````ba004.BR2.J <- PrettyBarPlot2(BR.J[,laDim],
threshold = 2,
font.size = 4,
signifOnly = TRUE,
color4bar = gplots::col2hex(resCA.sym\$Plotting.Data\$fj.col), # we need hex code
ylab = 'Bootstrap ratios'
) + ggtitle("", subtitle = 'columns')

ba004.BR2.J`````` ## 8.4 Conclusion

The heatmap did not reveal much differences between the use of punctuations in general amongst the authors.

The CA analysis resulted in showing a null effect with all the three different type of design (Birth year, Location, Writing Style) while looking at the Symmetric Maps.

However with respect to the raw data without any design, it could be observed that Sue Eugene contributed heavily towards the use of periods in her works wheras the second dimension reveals that the frequency od commas is highly influenced by Zola Emile. And Other punctuations in general was largely influenced by the author Duman Alexandre.