Chapter 3 Genotypic data filtering
3.1 Downloading RNA-seq marker data
The data we are using comes from (Mazaheri et al. 2019), which includes ~900K RNA-seq derived SNP markers for 942 diverse maize inbred lines.
The data is in Hapmap format. More details about Hapmap format can be found here
3.2 SNP filtering
Lets start by using TASSEL https://www.maizegenetics.net/tassel (Bradbury et al. 2007) for SNP data filtering. There are many plugins available for different types of analyses, and I encourage you to read more by following the link.
The commands below are run when using the standalone version of TASSEL. These tasks can also be accomplished using the Graphical User Interphase (GUI) of TASSEL.
Set path to where your TASSEL standalone is installed
tassel=/home/jrodriguez36/tassel-5-standalone/run_pipeline.pl
Change directory to where data was downloaded to
cd ~/PGRP_mapping_workshop/data/genotypic_and_phenotypic_data/
Filter for Minor allele frequency > 0.05 and set heterozygous calls to NA
perl $tassel -Xms2G -Xmx32G -h widiv_942g_899784SNPs_imputed_filteredGenos_noRTA_AGPv4.hmp.txt -filterAlign -filterAlignMinFreq 0.05 -homozygous -export filtered_widiv_942g_899784SNPs_imputed_filteredGenos_noRTA_AGPv4 -exportType Hapmap
The “-Xms2G” and “-Xmx32G” flags set how much RAM (min and max) TASSEL will be allocated to use. You can change depending on your machines hardware.
The “-h” flag tells TASSEL that the data we are loading is in Hapmap format.
The “-filterAlign -filterAlignMinFreq 0.05” flags are for filtering SNPs with MAF below 0.05.
The “-homozygous” flag converts any heterozygous values to unknown. (since we are working with inbred lines this makes sense to use here).
The “-export” flag tells TASSEL to export the output.
The “-exportType” flag tells TASSEL which format to export the data as.
3.3 SNP subsetting
For the remainder of the exercises, we are working with a subset of ~5000 markers by sampling ~500 SNP markers from each of the 10 chromosomes in maize.
R Packages used in this section are: data.table (Dowle and Srinivasan 2021) and tidyverse (Wickham 2021).
Load packages
library(data.table)
library(tidyverse)
Set working directory to where subset hapmap file is in workshop materials
setwd("~/PGRP_mapping_workshop/data/genotypic_and_phenotypic_data/")
Read in SNP data
= fread("filtered_widiv_942g_899784SNPs_imputed_filteredGenos_noRTA_AGPv4.hmp.txt") snps
We will use the p1 gene in maize (Zhang and Peterson 2005) as an example for many of the mapping exercises. In the genotypic data, there are no SNPs within the transcription start and stop sites of the p1 gene annotation, but we will keep SNPs within 1Mb of this gene.
Keep SNPs within 1Mb P1
= fread("snps_in_genes1Mb.txt")
snps_in_genes = snps_in_genes[snps_in_genes$gene=="Zm00001d028854",] p1_snps
Subset these SNPs from SNP file
= snps[snps$`rs#`%in%p1_snps$snp] p1_snp_matrix
Write the p1 SNP matrix file
fwrite(p1_snp_matrix,
file = "p1_snps_subset_widiv_942g_899784SNPs_imputed_filteredGenos_noRTA_AGPv4.hmp.txt",
col.names = TRUE, sep = "\t", quote = FALSE,na = "NA")
Select 500 SNPs per chromosome
set.seed(1234)
= snps %>%
snps_subset group_by(chrom) %>%
sample_n(500)
Remove any of the p1 SNPs which were selected
= snps_subset[!snps_subset$`rs#`%in%p1_snp_matrix$`rs#`,] snps_subset
Row bind the p1 snp matrix and the 5000 selected SNPs
= rbind(p1_snp_matrix,snps_subset) snps_subset
Sort by chromosome and position
= snps_subset %>%
snps_subset arrange(chrom,pos)
Write the subset SNP file
fwrite(snps_subset,
file = "subset_widiv_942g_899784SNPs_imputed_filteredGenos_noRTA_AGPv4.hmp.txt",
col.names = TRUE, sep = "\t", quote = FALSE,na = "NA")