# Chapter 7 PCA analysis

## 7.1 PCA and Kinship visualization

In this section we will examine population structure using the PCAtools R package (Blighe and Lun 2020) and visualize the Kinship matrix calculated from the previous section using the pheatmap R package (Kolde 2019).

Other packages used in this section include rMVP (Yin et al. 2020), data.table (Dowle and Srinivasan 2020), tidyverse (Wickham 2019) and pheatmap (Kolde 2019)

``````library(PCAtools)
library(rMVP)
library(data.table)
library(tidyverse)
library(pheatmap)``````

Set working directory to where rMVP formatted files are in workshop materials

``setwd("~/PGRP_mapping_workshop/data/genotypic_and_phenotypic_data/")``

Load SNP matrix, phenotypes and Kinship matrix

``````## Make SNP matrix with 5000 SNPs
snp_file = attach.big.matrix("WIDIV_942.geno.desc")
snp_file = snp_file[,]
dim(snp_file)

## Read in Subpopulation information for 942 lines
phenos = left_join(phenos,subpop)

Kin = attach.big.matrix("WIDIV_942.kin.desc")
Kin = Kin[,]

## Set colnames and rownames
colnames(snp_file) = phenos\$Taxa
rownames(phenos) = phenos\$Taxa``````

Run PCA

``````## Run PCA including phenotypes as metadata
p_snp = pca(snp_file,#SNPs in rows, samples in columns
center = TRUE,
scale = TRUE,

Put results of the first 3 PCs into a dataframe

``````pca.dat = data.frame(Genotype = p_snp\$metadata\$Taxa,
PC1 = p_snp\$rotated\$PC1,
PC2 = p_snp\$rotated\$PC2,
PC3 = p_snp\$rotated\$PC3,
stringsAsFactors = FALSE)``````

Make a scatterplot of the first 3 PCs

``````## Scatterplots of top 3 PCs
pc.percent = p_snp\$variance
sum(pc.percent[1:3])
lbls = paste("PC", 1:3, "\n", format(pc.percent[1:3], digits=2), "%", sep="")
my_cols = c("#b15928", "#000000", "#6a3d9a","#cab2d6","#ff7f00","#fdbf6f","#e31a1c","#fb9a99","#33a02c","#b2df8a","#1f78b4","#a6cee3")
pairs(pca.dat[,2:4], col=my_cols[pca.dat\$Subpopulation], labels=lbls,pch = 19,  cex = 0.9,lower.panel = NULL)

setwd("~/PGRP_mapping_workshop/figures/pcs_and_kinship/")
pdf("WIDIV_942_PCA_Matrix.pdf", width = 10, height = 9)
pairs(pca.dat[,2:4], col=my_cols[pca.dat\$Subpopulation], labels=lbls,pch = 19,  cex = 1.1,lower.panel = NULL)
dev.off()``````

Correlation of simulated phenotypes with forst 3 PCs

``````peigen = eigencorplot(p_snp,
components = p_snp\$components[1:3],
metavars = c('Color_low_h2','Color_high_h2'),
col = c('#d7191c', '#fdae61', '#ffffbf', '#abdda4', '#2b83ba'),
cexCorval = 1.2,
fontCorval = 2,
posLab = 'all',
rotLabX = 45,
scale = TRUE,
plotRsquared = FALSE,
corFUN = 'pearson',
corUSE = 'complete.obs',
signifSymbols = c('***', '**', '*', ''),
signifCutpoints = c(0, 0.001, 0.01, 0.05, 1))

pdf("WIDIV_942_PCA_phenos_cor.pdf", width = 6, height = 3)
peigen
dev.off()``````

Make a heatmap of the Kinship matrix

``````pheatmap(Kin)
pdf("WIDIV_942_Kinship_Matrix.pdf", width = 10, height = 9)
pheatmap(Kin)
dev.off()``````

### References

Blighe, Kevin, and Aaron Lun. 2020. PCAtools: Everything Principal Components Analysis. https://github.com/kevinblighe/PCAtools.

Dowle, Matt, and Arun Srinivasan. 2020. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.

Kolde, Raivo. 2019. Pheatmap: Pretty Heatmaps. https://CRAN.R-project.org/package=pheatmap.