Chapter 3 Genotypic data filtering

3.1 Downloading RNA-seq marker data

The data we are using comes from (Mazaheri et al. 2019), which includes ~900K RNA-seq derived SNP markers for 942 diverse maize inbred lines.

The data is in Hapmap format. More details about Hapmap format can be found here

3.2 SNP filtering

Lets start by using TASSEL (Bradbury et al. 2007) for SNP data filtering. There are many plugins available for different types of analyses, and I encourage you to read more by following the link.

The commands below are run when using the standalone version of TASSEL. These tasks can also be accomplished using the Graphical User Interphase (GUI) of TASSEL.

Set path to where your TASSEL standalone is installed

Change directory to where data was downloaded to

Filter for Minor allele frequency > 0.05 and set heterozygous calls to NA

The “-Xms2G” and “-Xmx32G” flags set how much RAM (min and max) TASSEL will be allocated to use. You can change depending on your machines hardware.

The “-h” flag tells TASSEL that the data we are loading is in Hapmap format.

The “-filterAlign -filterAlignMinFreq 0.05” flags are for filtering SNPs with MAF below 0.05.

The “-homozygous” flag converts any heterozygous values to unknown. (since we are working with inbred lines this makes sense to use here).

The “-export” flag tells TASSEL to export the output.

The “-exportType” flag tells TASSEL which format to export the data as.

3.3 SNP subsetting

For the remainder of the exercises, we are working with a subset of ~5000 markers by sampling ~500 SNP markers from each of the 10 chromosomes in maize.

R Packages used in this section are: data.table (Dowle and Srinivasan 2020) and tidyverse (Wickham 2019).

Load packages

Set working directory to where subset hapmap file is in workshop materials

Read in SNP data

We will use the p1 gene in maize (Zhang and Peterson 2005) as an example for many of the mapping exercises. In the genotypic data, there are no SNPs within the transcription start and stop sites of the p1 gene annotation, but we will keep SNPs within 1Mb of this gene.

Keep SNPs within 1Mb P1

Subset these SNPs from SNP file

Write the p1 SNP matrix file

Select 500 SNPs per chromosome

Remove any of the p1 SNPs which were selected

Row bind the p1 snp matrix and the 5000 selected SNPs

Sort by chromosome and position

Write the subset SNP file


Bradbury, Peter J., Zhiwu Zhang, Dallas E. Kroon, Terry M. Casstevens, Yogesh Ramdoss, and Edward S. Buckler. 2007. “TASSEL: software for association mapping of complex traits in diverse samples.” Bioinformatics 23 (19): 2633–5.

Dowle, Matt, and Arun Srinivasan. 2020. Data.table: Extension of ‘Data.frame‘.

Mazaheri, Mona, Marlies Heckwolf, Brieanne Vaillancourt, Joseph L. Gage, Brett Burdo, Sven Heckwolf, Kerrie Barry, et al. 2019. “Genome-Wide Association Analysis of Stalk Biomass and Anatomical Traits in Maize.” BMC Plant Biology 19 (1): 45.

Wickham, Hadley. 2019. Tidyverse: Easily Install and Load the Tidyverse.

Zhang, Feng, and Thomas Peterson. 2005. “Comparisons of Maize Pericarp Color1 Alleles Reveal Paralogous Gene Recombination and an Organ-Specific Enhancer Region.” The Plant Cell 17 (3): 903–14.