Chapter 4 Practical session 4
In this session we will have a hands-on exploration of GW-PCA and its application to STx data. What can we learn from this novel technique?
4.1 Geographically Weighted Principal Components Analysis (GWPCA)
A standard PCA can pick out the key multivariate modes of variability in the data. Looking at outlying values of the principal components of these data gives us an idea of unusual sites (in terms of combinations of gene expression profiles - and to a certain extent of combinations of cell types in each spot). Next, geographically weighted PCA can be used to find spatial multivariate outliers. Sounds complicated, but really all this means is it identifies sites that have an unusual multi-way combination of gene expression in relation to their immediate geographical neighbours. It might be that the values observed at these sites as a combination is not uncommon in the tissue as a whole - but is very unusual in its locality.
To find such outliers the procedure is relatively simple - instead of doing a PCA on the tissue as a whole, for each sample we do a PCA on data falling into a window centred on the location of that spot. In that way we can check whether the spot is like its neighbours or not, from a multivariate viewpoint.
The procedure we will follow in this practical carries out a geographically weighted PCA. In short, it runs a ‘windowed’ PCA around each of the spots.
4.3 Load Quality Controled and Normalised data
First of all, we need to load the data we prepared in the previous practical.
<- readRDS(file = "./data/to_load/practical03_sfe.rds")
sfe <- readRDS(file = "./data/to_load/practical03_topHVGs.rds") top_hvgs
4.4 Parameter prearation for GWPCA
The gwpca
method uses princomp
internally to run the PCAs - this function does not allow the number of variables (genes) to be greater than the number of samples (spots). This imposes a hard requirement on the data pre-processing. We have, however, already identified the highly variable genes in our sample, and for this case, there are fewer genes than spots.
Some other parameterisation is neccessary and these required parameters (as we have used for this dataset) are illustrated here:
## Get the gene names that are going to be evaluated
= top_hvgs
vars ## Set a fixed bandwidth
= 6*sfe@metadata[["spotDiameter"]][["JBO019"]][["spot_diameter_fullres"]]
bw ## Set the number of components to be retained
= 20
k ## Set the kernel to be used
= "gaussian"
kernel ## Set the Minkowski distance power: p = 2 --> Euclidean
= 2
p ## Is the bandwidth adaptive?: No because spots are fixed
= FALSE
adaptive ## Cross-Validate GWPCA?
= TRUE
cv ## Calculate PCA scores?
= FALSE
scores ## Run a robust GWPCA?
= FALSE
robust ## Make a cluster for parallel computing (otherwise GWPCA is slow!)
<- parallel::makeCluster(parallelly::availableCores() - 1, type = 'FORK') my.cl
The bandwidth defines a radius around each spot - every spot that falls inside this radius is considered a neighbour. We can set bandwidth as a fixed value (as here) or we can select the bandwidth automatically. Without going into detail here, this is achieved by a form of cross validation, where each observation is omitted, and it is attempted to reconstruct the values on the basis of principal components, derived from the other observations. The bandwidth achieving the optimal results is the one selected. For a complete explanation, see Harris, Brunsdon, and Charlton (2011). The function bw.gwpca
from GWmodel
can be used to computes this.
- NOTE: Larger bandwidths imply bigger moving spatial windows, which in turn imply smoother spatially varying outputs.
4.5 Run GWPCA
Here we present the invocation to run GWPCA, however because this process is computationally intensive and time-consuming, we do not suggest running it on posit.cloud. We have pre-computed the result and provide it for you to load.
# DO NOT RUN THIS CHUNK
<- gwpcaSTE(sfe = sfe,
pcagw assay = "logcounts",
vars = vars,
p = p,
k = k,
bw = bw,
kernel = kernel,
adaptive = adaptive,
scores = scores,
robust = robust,
cv = cv,
future = FALSE,
strategy = "cluster",
workers = my.cl,
verbose = FALSE)
saveRDS(pcagw, file = "./data/to_load/practical04_pcagw.rds")
Because GWPCA can take some time to run, we ran it for you and below you can load the output:
<- readRDS(file = "./data/to_load/practical04_pcagw.rds") pcagw
4.6 Plot global PCA results
In the next steps we will take a look inside the output from the gwpca
function and we are going to extract some basic information. Since GWPCA consists of multiple local PCAs, it is good to know how many PCs makes sense to look at. We can do so by running a global PCA and plotting a scree plot:
plotGWPCA_global(gwpca = pcagw,
comps = 1:10,
type = "scree",
point_args = list(size = 3, colour = "red"),
line_args = list(linewidth = 1, colour = "dodgerblue"))
In a Principal Component Analysis (PCA), the first three principal components may explain less than 15% of the variance in the data if the data is highly dispersed or if there is a large amount of noise in the data. This means that the first three principal components are not capturing a significant portion of the variability in the data. This could be due to a lack of clear structure in the data or a lack of meaningful patterns that can be captured by the PCA. Alternatively, it could be due to the presence of many irrelevant features or variables in the data that are not contributing to the overall variance. This is one more of the reasons why GWPCA is more appropriate for STx data. Because, it may be true that the global PCs are not strong but locally this can change.
4.7 Identify the leading genes in each location
The genes with the highest loading scores (where loading score = correlation between variable and component) at each location can be thought of as the “leading genes” - i.e. those with the most explanatory power with respect to the variability of gene expression at that location. These leading genes can be a local indicator of relevant biology.
Here we look at leading genes in 2 ways - (1) by finding the single gene with the highest loading at each location; (2) by finding sets of the top 4 genes by loading score, where the order of those genes does not matter (so the ordered set A,B,C,D is considered the same as D,B,A,C).
## Extract leading genes
<- gwpca_LeadingGene(gwpca = pcagw,
pcagw sfe = sfe,
pc_nos = 1:4,
type = "single",
names = "gene_names")
## 16 leading genes found for PC1
## The leading genes in PC1 are:
## ADH1A C7 CRP CYP3A4 GLUL GSTA2 HAMP HBA2
## 2 11 4 365 7 1 13 33
## IGLL5 MALAT1 MTRNR2L12 MTRNR2L8 NNMT PTGDS SAA1 SDS
## 87 39 153 181 23 73 36 133
## 21 leading genes found for PC2
## The leading genes in PC2 are:
## C7 CAT CFHR1 CRP CYP3A4 GLUL HBA2 HBB
## 3 6 38 39 149 83 2 37
## IGFBP3 IGFBP7 IGJ IGLL5 MALAT1 MTRNR2L10 MTRNR2L12 MTRNR2L8
## 49 39 34 246 80 10 78 124
## NNMT SAA1 SDS TAGLN UGT2B7
## 42 12 69 20 1
## 24 leading genes found for PC3
## The leading genes in PC3 are:
## AEBP1 C7 CAT CFHR1 CRP CYP3A4 GLUL HBA2
## 2 2 27 20 5 20 17 27
## HBB IGFBP3 IGFBP7 IGJ IGLL5 MALAT1 MTRNR2L10 MTRNR2L12
## 150 41 77 6 399 136 6 61
## MTRNR2L8 MYL9 NNMT SAA1 SCGB3A1 SDS TAGLN UGT2B7
## 25 9 24 6 56 15 26 4
## 25 leading genes found for PC4
## The leading genes in PC4 are:
## AEBP1 CAT CFHR1 CRP FXYD2 GLUL GSTA2 HBA2
## 1 53 15 7 7 33 3 2
## HBB IGFBP3 IGFBP7 IGJ IGLL5 MALAT1 MTRNR2L10 MTRNR2L12
## 181 100 51 60 281 201 5 16
## MTRNR2L8 MYL9 NNMT ORM2 SAA1 SDS SPINK1 TAGLN
## 16 5 55 6 6 37 12 4
## UGT2B7
## 4
<- gwpca_LeadingGene(gwpca = pcagw,
pcagw sfe = sfe,
pc_nos = 1:4,
genes_n = 4,
type = "multi",
method = "membership",
names = "gene_names")
## The number of individual leading genes groups found for PC1 is: 110
## These groups are: Too many to print them!
## The number of individual leading genes groups found for PC2 is: 240
## These groups are: Too many to print them!
## The number of individual leading genes groups found for PC3 is: 310
## These groups are: Too many to print them!
## The number of individual leading genes groups found for PC4 is: 421
## These groups are: Too many to print them!
We can also plot these leading genes on the spot map - as each location by definition has (potentially) a different leading gene.
## Plot leading genes
plotGWPCA_leadingG(gwpca = pcagw,
comps = 1:2,
type = "single",
arrange = FALSE)
plotGWPCA_leadingG(gwpca = pcagw,
comps = 1,
type = "multi",
arrange = FALSE)
The “multi” plot here is problematic, because there are too many groups of genes to be able to print a legible legend. The alternative below is provided to highlight gene groups that are found in at least 12 spots.
### Plot multi type (alternative)
## The data
<- pcagw$leadingGeneMulti
leadingGsMulti ## The Legend labels
<- data.frame(table(leadingGsMulti[1])) %>%
spot_labels ::rename(LeadingGs = colnames(leadingGsMulti)[1],
dplyrcount = Freq) %>%
::arrange(desc(count)) %>%
dplyrmutate(show = ifelse(count > 12, TRUE, FALSE))
## The legend breaks:
<- spot_labels %>%
spot_breaks ::filter(show == TRUE) %>%
dplyr::arrange(LeadingGs) %>%
dplyr::select(LeadingGs) %>%
dplyr"LeadingGs"]] %>%
.[[as.vector()
## The colours:
<- sum(spot_labels$show)
col_No <- getColours(col_No)
colour_values names(colour_values) <- spot_labels$LeadingGs[spot_labels$show]
<- "PC1"
pc
## The Plot:
ggplot() +
geom_sf(data = leadingGsMulti,
aes(geometry = geometry$geometry,
fill = .data[[pc]]),
colour = "grey30",
show.legend = TRUE) +
scale_fill_manual(values = colour_values,
breaks = spot_breaks,
na.value = "gray95") +
labs(title = NULL,
fill = "Group of\nLeading\nGenes") +
theme_void() +
theme(legend.position = "bottom", legend.text = element_text(size=6)) +
guides(fill = guide_legend(ncol = 3, byrow = TRUE))
4.8 Percentage of Total Variation (PTV)
Another useful diagnostic for PCA is the percentage of variability in the data explained by each of the components. Locally, this can be achieved by looking at the local.PV
component of pcagw
; this is written as pcagw$local.PV
. This is an 1161 by 20 matrix - where 1161 is the number of observations and 20 is the number of components (k
). For each location, the 20 columns correspond to the percentage of the total variance explained by each of the principal components at that location. If, say, the first two components contributed 90% of the total variance, then it is reasonable to assume that much of the variability in the data can be seen by just looking at these two components. Because this is geographically weighted PCA, this quantity varies across the map.
## Calculate the PTV for multiple Components
<- gwpca_PropVar(gwpca = pcagw, n_comp = 2:10, sfe = sfe) pcagw
## Comps_01 Comps_02 Comps_03 Comps_04
## Min. : 6.279 Min. :11.67 Min. :16.43 Min. :20.69
## 1st Qu.: 9.483 1st Qu.:16.13 1st Qu.:21.24 1st Qu.:25.69
## Median :16.782 Median :25.54 Median :30.37 Median :34.46
## Mean :17.370 Mean :25.92 Mean :31.35 Mean :35.49
## 3rd Qu.:22.534 3rd Qu.:32.87 3rd Qu.:39.42 3rd Qu.:43.81
## Max. :38.254 Max. :46.50 Max. :54.25 Max. :57.51
## Comps_05 Comps_06 Comps_07 Comps_08
## Min. :24.64 Min. :28.28 Min. :31.49 Min. :34.26
## 1st Qu.:29.65 1st Qu.:33.13 1st Qu.:36.54 1st Qu.:39.53
## Median :37.79 Median :40.86 Median :43.53 Median :46.17
## Mean :38.98 Mean :42.07 Mean :44.84 Mean :47.38
## 3rd Qu.:47.17 3rd Qu.:49.78 3rd Qu.:52.16 3rd Qu.:54.19
## Max. :60.60 Max. :62.97 Max. :65.04 Max. :67.03
## Comps_09 Comps_10
## Min. :36.76 Min. :39.15
## 1st Qu.:42.34 1st Qu.:45.05
## Median :48.60 Median :50.96
## Mean :49.73 Mean :51.91
## 3rd Qu.:56.07 3rd Qu.:57.77
## Max. :68.83 Max. :70.39
## Plot PTV
plotGWPCA_ptv(gwpca = pcagw,
comps = 1:10,
type = "violin")
## Map PTV
plotGWPCA_ptv(gwpca = pcagw,
comps = 1:6,
type = "map")
4.9 Identify discrepancies
Global PCA can be used to identify multivariate outliers. Extending this, it is also possible to use local PCA (i.e., GWPCA) to identify local outliers. One way of doing this links back to the cross-validation idea that can be used to select a bandwidth. Recall that this is based on a score of how well each observation can be reconstructed on the basis of local PCs. The score measures the total discrepancies of true data values from the reconstructed ones - and the bandwidth chosen is the one minimising this. However, the total discrepancy score is the sum of the individual discrepancies. A very large individual discrepancy associated with an observation suggests it is very different - in a multidimensional way, to the observations near to it.
## Plot the discrepancies as boxplot
plotGWPCA_discr(pcagw, type = "box")
## Plot the discrepancies map
plotGWPCA_discr(pcagw, type = "map")
## Get location data for the discrepancies
<- getDiscrepancyLocData(sfe = sfe,
discrepancy_loc_dt gwpca = pcagw,
sample_id = "JBO019")
Another possibility to understand the nature of the outlier is a parallel coordinates heatmap. Here, each observation neighbouring the location that has been found to be an outlier is shown as a column with the genes in rows. Since here we are investigating local outliers, one particular observation is highlighted in red - the outlier - and the remaining ones in grey, but with the intensity of the grey fading according to their distance from the red observation. This enables you to see what characteristic the red observation has that means it as outlying from its neighbours. The plot can be created using STExplorerDev::plotGWPCA_discrHeatmap
:
head(discrepancy_loc_dt)
## barcodes coords.pxl_col_in_fullres
## AAGTGCCTTGACTGTA-1 AAGTGCCTTGACTGTA-1 11086
## ACCCGGATGACGCATC-1 ACCCGGATGACGCATC-1 9908
## ACCTCCGTTATTCACC-1 ACCTCCGTTATTCACC-1 9113
## AGATGATGGAGTCTGG-1 AGATGATGGAGTCTGG-1 9117
## AGGTATAATTGATAGT-1 AGGTATAATTGATAGT-1 9312
## AGTGAACAAACTTCTC-1 AGTGAACAAACTTCTC-1 11088
## coords.pxl_row_in_fullres discScore
## AAGTGCCTTGACTGTA-1 5148 22793.56
## ACCCGGATGACGCATC-1 5607 24165.26
## ACCTCCGTTATTCACC-1 4255 27035.77
## AGATGATGGAGTCTGG-1 4933 27423.00
## AGGTATAATTGATAGT-1 4593 23839.25
## AGTGAACAAACTTCTC-1 5374 25111.79
## geometry
## AAGTGCCTTGACTGTA-1 POLYGON ((11020.03 5034.788...
## ACCCGGATGACGCATC-1 POLYGON ((9841.741 5494.291...
## ACCTCCGTTATTCACC-1 POLYGON ((9046.741 4142.291...
## AGATGATGGAGTCTGG-1 POLYGON ((9050.741 4820.291...
## AGGTATAATTGATAGT-1 POLYGON ((9245.076 4480.294...
## AGTGAACAAACTTCTC-1 POLYGON ((11020.91 5261.585...
<- discrepancy_loc_dt$barcodes[1:2]
focus = 3*sfe@metadata[["spotDiameter"]][["JBO019"]][["spot_diameter_fullres"]]
bw
# Plot the heatmap to visualise the genes that make this location an outlier
plotGWPCA_discrHeatmap(sfe = sfe,
assay = "logcounts",
vars = NULL,
focus = focus,
dMetric = "euclidean",
sample_id = "JBO019",
bw = bw,
mean.diff = 1,
show.vars = "top",
scale = "row",
gene.names = TRUE,
color = rev(colorRampPalette(brewer.pal(11, "RdBu"))(1000)),
fontsize_row = 3)
<- getDiscrepancyGeneData(sfe = sfe,
discrepancy_gene_dt assay = "logcounts",
vars = NULL,
focus = focus[2],
dMetric = "euclidean",
sample_id = "JBO019",
bw = bw,
mean.diff = 1,
show.vars = "top",
exportExpression = TRUE)
head(discrepancy_gene_dt)
## AACCCTACTGTCAATA-1 ACATGGCGCCAAAGTA-1 ACCCGGATGACGCATC-1
## ENSG00000078808 1.9800764 1.7845077 0.000000
## ENSG00000157916 1.1230410 2.3434121 0.000000
## ENSG00000171603 0.6681445 0.8599719 1.667949
## ENSG00000162496 1.1230410 0.8599719 0.000000
## ENSG00000074964 1.4683664 0.0000000 1.964768
## ENSG00000158828 1.1230410 0.8599719 0.000000
## ACGATCATCTTGTAAA-1 AGGGTCGATGCGAACT-1 ATAGTTCCACCCACTC-1
## ENSG00000078808 1.434549 0.0000000 0.0000000
## ENSG00000157916 1.094451 1.1491034 0.7197148
## ENSG00000171603 0.000000 0.0000000 0.0000000
## ENSG00000162496 1.094451 2.2182483 1.5560955
## ENSG00000074964 1.709571 0.6860569 0.7197148
## ENSG00000158828 1.434549 2.0159260 1.1976848
## ATATCAACCTACAGAG-1 CATCTTACACCACCTC-1 CCATCTCACCAGTGAA-1
## ENSG00000078808 1.848204 2.383137 0.7918769
## ENSG00000157916 1.356318 1.489110 2.2192853
## ENSG00000171603 0.000000 0.000000 1.3002121
## ENSG00000162496 2.214350 0.000000 2.2192853
## ENSG00000074964 1.028715 1.140629 0.7918769
## ENSG00000158828 2.632509 1.140629 0.7918769
## CCGATCTCAACCTTAT-1 CGCACGTGCGCTATCA-1 CGCTAGAGACCGCTGC-1
## ENSG00000078808 1.1456336 2.0109592 1.070644
## ENSG00000157916 1.7760473 2.4713829 1.070644
## ENSG00000171603 0.0000000 0.0000000 0.000000
## ENSG00000162496 0.6836662 1.7105588 1.678406
## ENSG00000074964 1.1456336 0.8136275 1.070644
## ENSG00000158828 0.6836662 1.3307021 1.070644
## CTAGGTCTGAAGGAAT-1 GAAAGAACAGCGTTAT-1 GCAGACCCAGCACGTA-1
## ENSG00000078808 1.1275395 2.0533150 2.1052992
## ENSG00000157916 0.0000000 1.5312450 0.8684635
## ENSG00000171603 0.6712291 0.7049757 0.0000000
## ENSG00000162496 1.1275395 2.5947724 1.4067734
## ENSG00000074964 1.4736763 1.5312450 1.4067734
## ENSG00000158828 1.1275395 1.5312450 0.8684635
## GCGCAAGAGCGCGCTG-1 GCTCGCTCATGTCCAA-1 GCTGTTGCTACCGAAC-1
## ENSG00000078808 2.505794 1.4756527 2.004239
## ENSG00000157916 1.117631 1.8764253 1.588731
## ENSG00000171603 0.000000 0.0000000 1.325320
## ENSG00000162496 2.173035 2.1897262 1.588731
## ENSG00000074964 0.000000 0.9188039 1.002827
## ENSG00000158828 1.739765 0.9188039 1.811392
## GGTTCTACTCGTCTGA-1 GTGCGACAGGGAGTGT-1 TAGACTACCTAGCGTT-1
## ENSG00000078808 1.7977169 1.633005 1.421361
## ENSG00000157916 1.4066078 1.633005 0.000000
## ENSG00000171603 0.8683433 0.000000 0.000000
## ENSG00000162496 2.3583475 1.263669 2.123256
## ENSG00000074964 0.8683433 2.170771 0.000000
## ENSG00000158828 1.4066078 1.633005 1.695032
## TAGGGAGCTTGGGATG-1 TATAGATGGTCGCAGT-1 TATATCCCTGGGAGGA-1
## ENSG00000078808 2.2316954 0.0000000 0.7102708
## ENSG00000157916 0.7985659 1.9335702 2.6061791
## ENSG00000171603 0.0000000 0.0000000 0.0000000
## ENSG00000162496 0.7985659 1.9335702 1.1841040
## ENSG00000074964 0.7985659 0.9560551 0.7102708
## ENSG00000158828 0.0000000 1.5260667 1.1841040
## TATTACCATCCTGCTT-1 TCATTTAAGTCTCCGA-1 TCTGGGAACCTTTGAA-1
## ENSG00000078808 1.9382158 2.1150669 2.3926955
## ENSG00000157916 1.0928336 1.2220876 1.4967274
## ENSG00000171603 0.6475105 0.0000000 0.6846746
## ENSG00000162496 2.4677235 1.5846069 2.0131754
## ENSG00000074964 0.6475105 1.2220876 0.6846746
## ENSG00000158828 1.0928336 0.7367522 1.4967274
## TGACATCGAGCGGACC-1 TGACGATGCACTAGAA-1 TTGAATTCACGTGAGG-1
## ENSG00000078808 1.9076262 2.4843308 1.8133018
## ENSG00000157916 1.2485083 1.3402294 2.0507006
## ENSG00000171603 0.7552962 1.0152672 0.0000000
## ENSG00000162496 2.7012885 1.3402294 2.0507006
## ENSG00000074964 1.2485083 0.5951586 0.7036449
## ENSG00000158828 1.6153854 0.0000000 1.1745524
## TTGACGCTCCATGAGC-1 gene_name
## ENSG00000078808 2.296259 SDF4
## ENSG00000157916 1.851153 RER1
## ENSG00000171603 0.000000 CLSTN1
## ENSG00000162496 1.851153 DHRS3
## ENSG00000074964 0.000000 ARHGEF10L
## ENSG00000158828 1.851153 PINK1
4.10 Final Summary
In this practical we have shown the utility of a geospatial method, GWPCA, to explore the variability of an STx dataset at the local level. By assessing features of the output of this method, we can learn things about the spatial distribution of biologically relevant gene expression.
Hopefully this, alongside the other practicals today, have given you a basic grounding in how to work with STx data and some of the practical considerations of doing so. Although we have demonstrated all of these methods with 10X Genomics Visium data, there is no reason why they are not applicable to any STx method, such as Slide-Seq or Stereo-Seq.
You can learn more about the application of geospatial methods to this liver dataset by coming to see our poster: B-122 in Poster Session B - Tuesday, July 25, between 18:00 CEST and 19:00 CEST.