4 Finding Genes with TSS-seq
The transcriptional start site or TSS marks the position of the first (+1) templated nucleotide of a transcript. In this chapter, you will learn how TSS sequencing can be used to locate the TSS. To follow along with the text and to answer the “Test Your Understanding” questions, start with the “TSS-seq” session link. You can also use this link as a starting point to view TSS-seq data for any gene of interest.
4.1 How TSS-seq is done?
TSS-seq is a special type of RNA sequencing that involves sequencing DNA that originates only from the 5’ end of the mRNAs present in a sample at the time of extraction. Review the RNA sequencing protocol from the previous chapter. In brief, RNA sequencing involves the sequencing of DNA fragments that originate from full length mRNA thereby producing sequence reads that correspond to all exons. How can RNA sequencing be modified so that the only fragments sequenced come from the 5’ end of the mRNA? Recall that mRNA has two unique features: 1) a 5’ cap and 2) a 3’ poly A tail. The 5’ cap is exploited in TSS-seq to pinpoint the transcriptional start site (TSS) as it is added to the beginning of a message (mRNA). By sequencing only the portion of mRNA that is next to the 5’ cap you can one identify the TSS!
The steps necessary to prepare mRNA for TSS-seq are outlined in Figure 4.1. First, poly-A containing RNA (mRNA) is extracted from as described previously (Figure 3.2). Due to the fragility of RNA, however, RNA extractions typically include full length mRNA (with a 5’ cap) and partially degraded mRNA (with a 5’ phosphate instead). How can one distinguish full length mRNA from partially degraded fragments? Scientists have devised a clever method.
First, add an enzyme that removes 5’ phosphates from all partially degraded RNA molecules in the tube. Full-length, capped mRNAs do not possess a 5’ phosphate and are thus not affected by this enzyme. Second, add an enzyme that removes the 5’ cap from full length mRNAs. Now, full-length mRNAs that were once capped possess a 5’ phosphate while partially degraded mRNAs do not! Third, add RNA ligase27 to attach a single-stranded RNA oligo28 to the 5’ end of only those RNAs that have 5’ phosphates. Finally, convert all the RNA in the tube to DNA (cDNA) using reverse transcriptase and random primer (as described previously). Importantly, only those cDNAs that once had a 5’ cap are tagged with the specific primer sequence at their 3’ end (NOTE: Do you understand why this primer sequence is now at the 3’ end of the DNA?). This sequence tag can now be exploited for sequencing as the sequencing reaction is primed with a DNA oligo that hybridizes to the sequence tag. The short sequence reads that are collected (about 35 bp in length) are then computationally aligned to the genome.
4.2 Why TSS-seq is useful
As we just saw, RNA-Seq data can identify exons, segments of the genome that are transcribed and remain in mRNA (introns are transcribed but removed from mRNA). Often, RNA-seq data will also reveal how exons are linked together (with split alignments). That said, it is not always sufficient for gene identification purposes. To identify an individual transcription unit (a single gene) with more confidence, it is better to know the position of the exons and the TSS. To see this for yourself, review RNAseq data corresponding to a small region of the genome (Figure 4.2). The NCBI Refseq gene prediction track is hidden on purpose. As you can see, the histogram and sequence read alignments including the split alignments are displayed. How many genes do you think are in this region? What is your rationale?
In Figure 4.2, there are two main groups of split alignments. It would be reasonable to guess that there are two genes in this region. To see the true number of genes click here. This link takes you to the same genome region as in Figure 4.2 but with the gene prediction and base position tracks visible. If you guessed wrong, do you see why you were misled?
4.3 How TSS-seq Data is Visualized
The UCSC Genome Browser contains numerous evidence tracks displaying TSS-seq data. We will focus on one that was generated by the FANTOM5 Consortium of labs in Japan (FANTOM stands for Functional Annotation of the Mammalian Genome). Click the link to learn more about FANTOM5. To view this evidence track click the TSS-seq session link.
Your genome browser should now display the “Max Counts of CAGE29 reads” evidence track for BBS1. This evidence track displays TSS-seq data in histogram form (Figure 4.3). The track settings are configured such that that the Y-axis is set at an absolute value: 350 read counts. Setting the Y-axis at 350 is sufficient for most genes. For highly expressed genes it is better to set the Y-axis to “auto-scale to data view”. Specifically, go to the track settings page. Scan down until you see “Data view scaling:”. Change the pulldown menu from “use vertical viewing range setting” to “auto-scale to data view”. Then click submit. If you do this for BBS1, the Y-axis should end at 69.0667. Before you move on, change the Y-axis back to “use vertical viewing range setting”.
Zoom out 10X. The genome browser window is now focused on a larger portion of the genome surrounding BBS1 (Figure 4.4). Notice that there are both red peaks and blue peaks. There are also tall peaks and tiny peaks.
Recall that genes can be found on the top strand or the bottom strand as RNA polymerase can use either strand as template for transcription. If RNA polymerase uses the bottom strand as template then it moves from left to right (5’ to 3’) and produces an mRNA that matches the top strand sequence and is complementary to the bottom strand. If RNA polymerase uses the top strand as template then it moves from right to left (also 5’ to 3’ relative to the bottom strand) and produces an mRNA that matches the bottom strand sequence. Histrogram peaks displayed in red represent the TSS of top strand genes. Histogram peaks displayed in blue represent the TSS of bottom strand genes.
The taller a histrogram peak at a given location, the more sequence reads that start at that spot. The more read counts, the more reliable the data and the more likely the data represents a true TSS. In fact, the tiny peaks observed in Figure 4.4 (find each red asterix) may reflect background noise30 and may not represent a true TSS. That said, the absence of a peak at a location where one was predicted to exist does not mean that the gene prediction is incorrect. This might be an instance where the gene was not expressed in the tissue sample that was used to generate the TSS-seq data. There is a common saying, “Absence of evidence is not evidence of absence”.
Now zoom in to the TSS for BBS1 (Figure 4.5). You can see that the position in the genome where the majority of experimentally defined TSSs are located is at the predicted transcriptional start site. This is not always the case. Also, notice that there are a range of TSSs for BBS1. This is normal.
Again, this is a great example of an evidence track that provides unbiased experimental evidence for the existence of a gene. It provides the position of all transcriptional start sites (TSS) within the genome. Each TSS identified by TSS-seq was not predicted to exist but was experimentally determined without prior knowledge (no bias).
That said, TSSseq data is not always sufficient to determine the number of individual genes in a given region of the genome. For example, review the TSS-seq histogram data in (Figure 4.6) and try to predict the number of genes in this region. I have omitted the gene prediction track to illustrate a point.Now click here to see how many genes there really are in the genomic region shown in Figure 4.6. How does this differ from what you thought? Do you see why you were misled? Notice that the link has been configured to include the gene prediction track along with the RNA-seq and TSS-seq data tracks. Do you see how the RNA-seq data combined with the TSS-seq data provides a more complete picture of the true number of genes present? After all, there are genes that utilize alternative transcription initiation sites (Review (Chapter 2), “One gene, many splice variants”)!
4.3.1 Test Your Understanding
Some of the TYU questions will require the TSS-seq data displayed in the figure below (the gene prediction track is hidden). To enlarge click on the image.
Click on the link above to see the questions.
4.3.2 For Discussion
Click link to see the discussion questions.
4.4 Take Home Messages
- TSS-seq data is similar to RNA-seq data but with TSS-seq only the mRNA segment closest to the 5’ cap is sequenced.
- Whereas TSS-seq data identifies transcription start sites (TSS) of expressed genes, RNA-seq identifies exons.
- Finally, TSS-seq data combined with RNA-seq data is able to identify each transcription unit with more confidence.
© 2024, Maria Gallegos. All rights reserved.
RNA ligase is similar to DNA ligase in that both require a 5’ phosphate and a 3’ hydroxyl to form a phosphodiester bond between two nucleic acid chains. The main difference: RNA ligase links two RNA chains together while DNA ligase links two DNA chains together↩︎
Short single-stranded RNA chain↩︎
CAGE is short for ”Cap Analysis Gene Expression”↩︎
Molecular biology is messy. Each enzymatic step listed in Figure 4.1 is not likely to achieve 100% success. For example, partial mRNAs that fail to have their 5’ phosphates removed can be sequenced along side all the true full-length mRNAs.↩︎