Chapter 3 Genotypic data filtering

3.1 Downloading RNA-seq marker data

The data we are using comes from (Mazaheri et al. 2019), which includes ~900K RNA-seq derived SNP markers for 942 diverse maize inbred lines.

The data is in Hapmap format. More details about Hapmap format can be found here

3.2 SNP filtering

Lets start by using TASSEL https://www.maizegenetics.net/tassel (Bradbury et al. 2007) for SNP data filtering. There are many plugins available for different types of analyses, and I encourage you to read more by following the link.

The commands below are run when using the standalone version of TASSEL. These tasks can also be accomplished using the Graphical User Interphase (GUI) of TASSEL.

Set path to where your TASSEL standalone is installed

tassel=/home/jrodriguez36/tassel-5-standalone/run_pipeline.pl

Change directory to where data was downloaded to

cd ~/PGRP_mapping_workshop/data/genotypic_and_phenotypic_data/

Filter for Minor allele frequency > 0.05 and set heterozygous calls to NA

perl $tassel -Xms2G -Xmx32G -h widiv_942g_899784SNPs_imputed_filteredGenos_noRTA_AGPv4.hmp.txt -filterAlign -filterAlignMinFreq 0.05 -homozygous -export filtered_widiv_942g_899784SNPs_imputed_filteredGenos_noRTA_AGPv4 -exportType Hapmap

The “-Xms2G” and “-Xmx32G” flags set how much RAM (min and max) TASSEL will be allocated to use. You can change depending on your machines hardware.

The “-h” flag tells TASSEL that the data we are loading is in Hapmap format.

The “-filterAlign -filterAlignMinFreq 0.05” flags are for filtering SNPs with MAF below 0.05.

The “-homozygous” flag converts any heterozygous values to unknown. (since we are working with inbred lines this makes sense to use here).

The “-export” flag tells TASSEL to export the output.

The “-exportType” flag tells TASSEL which format to export the data as.

3.3 SNP subsetting

For the remainder of the exercises, we are working with a subset of ~5000 markers by sampling ~500 SNP markers from each of the 10 chromosomes in maize.

R Packages used in this section are: data.table (Dowle and Srinivasan 2021) and tidyverse (Wickham 2021).

Load packages

library(data.table)
library(tidyverse)

Set working directory to where subset hapmap file is in workshop materials

setwd("~/PGRP_mapping_workshop/data/genotypic_and_phenotypic_data/")

Read in SNP data

snps = fread("filtered_widiv_942g_899784SNPs_imputed_filteredGenos_noRTA_AGPv4.hmp.txt")

We will use the p1 gene in maize (Zhang and Peterson 2005) as an example for many of the mapping exercises. In the genotypic data, there are no SNPs within the transcription start and stop sites of the p1 gene annotation, but we will keep SNPs within 1Mb of this gene.

Keep SNPs within 1Mb P1

snps_in_genes = fread("snps_in_genes1Mb.txt")
p1_snps = snps_in_genes[snps_in_genes$gene=="Zm00001d028854",]

Subset these SNPs from SNP file

p1_snp_matrix = snps[snps$`rs#`%in%p1_snps$snp]

Write the p1 SNP matrix file

fwrite(p1_snp_matrix,
       file  = "p1_snps_subset_widiv_942g_899784SNPs_imputed_filteredGenos_noRTA_AGPv4.hmp.txt",
       col.names = TRUE, sep = "\t", quote = FALSE,na = "NA")

Select 500 SNPs per chromosome

set.seed(1234)
snps_subset = snps %>%
  group_by(chrom) %>%
  sample_n(500)

Remove any of the p1 SNPs which were selected

snps_subset = snps_subset[!snps_subset$`rs#`%in%p1_snp_matrix$`rs#`,]

Row bind the p1 snp matrix and the 5000 selected SNPs

snps_subset = rbind(p1_snp_matrix,snps_subset)

Sort by chromosome and position

snps_subset = snps_subset %>% 
  arrange(chrom,pos)

Write the subset SNP file

fwrite(snps_subset,
       file  = "subset_widiv_942g_899784SNPs_imputed_filteredGenos_noRTA_AGPv4.hmp.txt",
       col.names = TRUE, sep = "\t", quote = FALSE,na = "NA")

References

Bradbury, Peter J., Zhiwu Zhang, Dallas E. Kroon, Terry M. Casstevens, Yogesh Ramdoss, and Edward S. Buckler. 2007. TASSEL: software for association mapping of complex traits in diverse samples.” Bioinformatics 23 (19): 2633–35. https://doi.org/10.1093/bioinformatics/btm308.
Dowle, Matt, and Arun Srinivasan. 2021. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.
Mazaheri, Mona, Marlies Heckwolf, Brieanne Vaillancourt, Joseph L. Gage, Brett Burdo, Sven Heckwolf, Kerrie Barry, et al. 2019. “Genome-Wide Association Analysis of Stalk Biomass and Anatomical Traits in Maize.” BMC Plant Biology 19 (1): 45. https://doi.org/10.1186/s12870-019-1653-x.
Wickham, Hadley. 2021. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.
Zhang, Feng, and Thomas Peterson. 2005. “Comparisons of Maize Pericarp Color1 Alleles Reveal Paralogous Gene Recombination and an Organ-Specific Enhancer Region.” The Plant Cell 17 (3): 903–14.