4 Finding Genes with TSS-seq

In this chapter, you will learn how TSS sequencing can also be used to determine where genes are located in the genome. To follow along with the text and to answer the “Test Your Understanding” questions, start with the “TSS-seq” session link. You can also use this link as a starting point to view TSS-seq data for any gene of interest.

4.1 How TSS-seq is done.

TSS stands for Transcriptional Start Site, also known as the +1 position of transcription²⁶. In this chapter, you will learn that TSS-seq is a technique where you perform RNA sequencing near transcriptional start sites. You will also learn how the sequence data is displayed in the UCSC Genome Browser and how this data can be used to surmise where the beginning of genes are located in the genome. You will also see how TSS-seq together with RNA-seq (Chapter 3) provides a more complete picture of where individual genes are located.

How might one identify the transcriptional start sites of most genes in the genome without prior knowledge of where those start sites are located? Recall that mRNA has two unique features: 1) a 5’ cap and 2) a 3’ poly A tail. While both can be exploited to purify mRNA away from all other RNA types, the 5’ cap can be used specifically to pinpoint the transcriptional start site (TSS) as it is only added to the beginning of a message (mRNA) By sequencing only that portion of the mRNA that is next to the 5’ cap you can identify the TSS! Dubbed TSS-seq, these sequence reads are then aligned to a reference genome as is done for RNA-seq.

The steps necessary to prepare mRNA for TSS-seq are outlined in Figure 4.1. First, poly-A containing RNA (mRNA) is extracted from an organism, tissue or cell of interest as described previously (Figure 3.2). Due to the fragility of RNA in general, RNA extractions typically include full length mRNA (with a 5’ cap) and partially degraded mRNA (with a 5’ phosphate instead). You do not want to sequence the 5’ ends of partially degraded mRNA as these molecules do not pinpoint the location of the TSS but instead map to the interior of a gene.

TSSseq protocol. A) All mRNA possessing a poly A tail is separted from total RNA. The resulting sample contains both full length mRNA with a 5' cap and partially degraded mRNA with a 5' phosphate (P-). B) An enzyme that removes 5' phosphates is added. See the P change to OH. C) An enzyme is added that removes the 5' cap structure. See the black circle change to a P. D) RNA ligase attaches an RNA oligo (green rectangle) to all RNAs that had a 5' phosphate in C). E) The addition of reverse transcriptase and random oligos produce cDNA fragments. Those cDNA fragments that once had a 5' cap all have the same sequence at their 3' ends.

Figure 4.1: TSSseq protocol. A) All mRNA possessing a poly A tail is separted from total RNA. The resulting sample contains both full length mRNA with a 5’ cap and partially degraded mRNA with a 5’ phosphate (P-). B) An enzyme that removes 5’ phosphates is added. See the P change to OH. C) An enzyme is added that removes the 5’ cap structure. See the black circle change to a P. D) RNA ligase attaches an RNA oligo (green rectangle) to all RNAs that had a 5’ phosphate in C). E) The addition of reverse transcriptase and random oligos produce cDNA fragments. Those cDNA fragments that once had a 5’ cap all have the same sequence at their 3’ ends.

How can one distinguish full length mRNA from partially degraded fragments? Scientists have devised a clever method: First, they add an enzyme that removes 5’ phosphates from all partially degraded RNA molecules in the tube. Full-length, capped mRNAs do not possess a 5’ phosphate and are thus impervious to²⁷ this step. Next, they add an enzyme that removes the 5’ cap from full length mRNAs. Now the full-length mRNAs that were once capped possess a 5’ phosphate instead²⁸! Next, RNA ligase²⁹ is used to attach a single-stranded RNA oligo³⁰ to the 5’ end of only those RNAs that have 5’ phosphates. Finally, all the RNA in the tube is converted into cDNA using a reverse transcriptase and random primer (as described previously). Importantly, only those cDNAs that once had a 5’ cap are tagged with the specific primer sequence at their 3’ end (NOTE: Do you understand why this primer sequence is now at the 3’ end of the DNA?). This 3’ sequence tag can now be exploited for sequencing as the sequencing reaction is primed with a DNA oligo that base pairs³¹ to the sequence tag. The short sequence reads that are collected (about 35 bp in length) are then computationally aligned to the genome and counted as described previously (Chapter 3).

4.2 Why TSS-seq is useful

As we just saw, RNA-Seq data can identify segments of the genome that are transcribed and remain in mRNA (introns are transcribed but removed from mRNA). In other words, this type of data is excellent for identifying the position of exons. Often, RNA-seq data will also reveal how exons are linked together (with split alignments). That said, it is not always sufficient for gene identification purposes. To identify an individual transcription unit (a single gene) with more confidence, it is better to know the position of the exons and the TSS. To see this for yourself, review RNAseq data corresponding to a small region of the genome (Figure 4.2). The NCBI Refseq gene prediction track is hidden on purpose. As you can see, the histogram and sequence read alignments including the split alignments are displayed. How many genes do you think are in this region? What is your rationale?

RNAseq data corresponding to a small region in the genome with the base position and gene prediction tracks hidden. Guess how many genes you think map to this region of the human genome. Continue reading below to discover the answer.

Figure 4.2: RNAseq data corresponding to a small region in the genome with the base position and gene prediction tracks hidden. Guess how many genes you think map to this region of the human genome. Continue reading below to discover the answer.

In Figure 4.2, there are two main groups of split alignments. It would be reasonable to guess that there are two genes in this region. To see the true number of genes click here. This link takes you to the same genome region as in Figure 4.2 but with the gene prediction and base position tracks visible. If you guessed wrong, do you see why you were misled?

4.3 Understanding the TSS-seq Evidence Track

The UCSC Genome Browser contains numerous evidence tracks displaying TSS-seq data. We will focus on one that was generated by the FANTOM5 Consortium of labs in Japan (FANTOM stands for Functional Annotation of the Mammalian Genome). Click the link to learn more about FANTOM5. FANTOM5 is a custom evidence track that I added by clicking on “Add Custom Tracks” then searching for “FANTOM5”. To view this evidence track click the link for the TSS-seq session. You can also use this link as a starting point for questions related to TSS-seq data for any gene.

Your genome browser should now display the “Max Counts of CAGE³² reads” evidence track for BBS1. This evidence track displays TSS-seq data in histogram form only (Figure 4.3). The track settings are configured such that that the Y-axis is set at 350 read counts. This can be changed at the track settings page for this evidence track. Setting the Y-axis at 350 is sufficient for most genes. For some highly expressed genes it is better to set the Y-axis to ““. Go to the track settings page. Scan down until you see”Data view scaling:“. Change the pulldown menu from”use vertical viewing range setting” to “autoscale to data view”. Then click submit. If you do this for BBS1, the Y-axis should end at 69.0667. Before you move on, change the Y-axis back to “use vertical viewing range setting”.

Figure 4.3: TSSseq data for BBS1.

Zoom out 10X. The genome browser window is now focused on a larger portion of the genome surrounding BBS1 (Figure 4.4). Notice that there are both red peaks and blue peaks. There are also tall peaks and tiny peaks.

Recall that genes can be found on the top strand or the bottom strand as RNA polymerase can use either strand as template for transcription. If RNA polymerase uses the bottom strand as template then it moves from left to right (5’ to 3’) and produces an mRNA that matches the top strand sequence and is complementary to the bottom strand. If RNA polymerase uses the top strand as template then it moves from right to left (also 5’ to 3’ relative to the bottom strand) and produces an mRNA that matches the bottom strand sequence. Histrogram peaks displayed in red represent the TSS of top strand genes. Histogram peaks displayed in blue represent the TSS of bottom strand genes.

The taller a histrogram peak at a given location, the more sequence reads that map to that region. The more read counts, the more reliable the data and the more likely the data represents a true TSS. In fact, the tiny peaks observed in Figure 4.4 (find each red asterix) likely reflect background noise³³ and may not represent the position of a TSS at all. That said, the absence of a peak at a location where one was predicted to exist does not mean that the gene prediction is incorrect. This might be an instance where a gene was not expressed in the tissue sample that was used to generate the TSS-seq data. There is a common saying, “Absence of evidence is not evidence of absence”.

TSSseq data of a larger region surrounding BBS1. Each asterix (my annotation) highlights sequence reads that are not likely to represent true transcriptional start sites.

Figure 4.4: TSSseq data of a larger region surrounding BBS1. Each asterix (my annotation) highlights sequence reads that are not likely to represent true transcriptional start sites.

Now zoom in to the TSS for BBS1 (Figure 4.5). You can see that the majority of experimentally defined TSSs are located at the predicted transcriptional start site. This is not always the case. Also, notice that there are a range of TSSs for BBS1. This is normal.

A close up view of the TSS-seq histogram data surrounding the predicted **transcriptional** start site for BBS1 (as defined by the NCBI refseq evidence track). The **translational** start site of BBS1 is shown in green.

Figure 4.5: A close up view of the TSS-seq histogram data surrounding the predicted transcriptional start site for BBS1 (as defined by the NCBI refseq evidence track). The translational start site of BBS1 is shown in green.

If you zoom in even farther you can determine the exact nucleotide position of the TSS used most often in BBS1. The first templated ribonucleotide of BBS1 is the “G” at position 66,278,106 on chromosome 11 (Figure 4.6). To see how the nucleotide position is aligned to the nucleotide and the TSS-seq histogram data in the UCSC genome browser window, see the same image but annotated (Figure 4.7).

An even closer view of the TSS-seq histogram data surrounding the predicted transcriptional start site for BBS1. At this zoom level, the genome sequence *and* individual nucleotide positions are notable

Figure 4.6: An even closer view of the TSS-seq histogram data surrounding the predicted transcriptional start site for BBS1. At this zoom level, the genome sequence and individual nucleotide positions are notable

Figure 4.7: An annotated version of the same view of BBS1 as above.

Again, this is a great example of an evidence track that provides unbiased experimental evidence for the existence of a gene. It provides the position of all transcriptional start sites (TSS) within the genome. Each TSS identified by TSS-seq was not predicted to exist but was experimentally determined without prior knowledge (no bias).

That said, TSSseq data is not sufficient to determine the number of individual genes in a given region of the genome. For example, review the histrogram data in (Figure 4.8) and try to predict the number of genes in this region. Again, I have omitted the gene prediction track. Moreover, there are individual genes that utilize alternative transcription initiation sites (Review (Chapter 2), “One gene, many splice variants”)!

This is an image of the TSSseq evidence track for a small region of the human genome. All other tracks have been hidden. Count the number of transcription starts sites then guess how many genes you think these TSS peaks correspond to.

Figure 4.8: This is an image of the TSSseq evidence track for a small region of the human genome. All other tracks have been hidden. Count the number of transcription starts sites then guess how many genes you think these TSS peaks correspond to.

Only when a sufficient amount of RNAseq data combined with TSSseq data is available can one determine the number of genes in a given genomic region with confidence. Click here to review the gene prediction track and RNAseq data for the genomic region shown in Figure 4.8 to see how the RNAseq data combined with the TSS seq data provides a more complete picture of the true number of genes present. How does this differ from what you thought? Do you see why you were misled?

4.3.1 Test Your Understanding

The first three questions require the TSS-seq data displayed in the figure below.

How many prominent transcription start sites (>50 Max Count Reads) can be found in the genome region displayed above?
Among the TSS peaks displayed in the figure above, how many use the top strand as a template for transcription?
Among the TSS peaks displayed in the figure above, how many involve RNA polymerase traveling from left to right?

Search for a gene called, DPF2. Then zoom into the region where transcription initiation occurs MOST prominently.

Where is this experimentally defined TSS peak relative to the predicted TSS for DPF2? To find the experimentally defined TSS, look at the TSS-seq data. By predicted TSS, look at the NCBI Refseq gene prediction track³⁴.
Now determine the exact nucleotide position where transcription initiation occurs most often for DPF2 (enter a specific number).
Now determine the first templated nucleotide (G A U or C) for DPF2. Choose the one that is observed most often.

4.3.2 Test Your Understanding

Search for a gene called, ANO9 then zoom into the region where transcription initiation occurs MOST prominently.

Where is this experimentally defined TSS peak relative to the predicted TSS for ANO9? To find the experimentally defined TSS, look at the TSS-seq data. By predicted TSS, look at the NCBI Refseq gene prediction track³⁵.
Now determine the exact nucleotide position where transcription initiation occurs most often for ANO9 (enter a specific number).
Now determine the templated nucleotide (G A U or C) for ANO9 that occurs most often.

Search for a gene called, ZDHHC24 then zoom into the region where transcription initiation occurs MOST prominently

Where is this experimentally defined TSS peak relative to the predicted TSS for ZDHHC24? To find the experimentally defined TSS, look at the TSS-seq data. By predicted TSS, look at the NCBI Refseq gene prediction track³⁶.
Now determine the exact nucleotide position where transcription initiation occurs most often for ZDHHC24 (enter a specific number).
Now determine the templated nucleotide (G A U or C) for ZDHHC24 that occurs most often.

4.3.3 For Discussion

What surprised you the most as you reviewed the TSS-seq data (there are no correct answers)?
Explain in your own words, why I say that RNA-seq and TSS-seq are unbiased molecular techniques that can be used to pinpoint the position of genes.
Explain in your own words, why TSS-seq data alone can be misleading when attempting to identify genes in a given region of the genome.
Explain in your own words, why RNA-seq data alone can be misleading when attempting to identify genes in a given region of the genome.

4.4 Take Home Messages

TSS-seq data is similar to RNA-seq data but with TSS-seq only the mRNA segment closest to the 5’ cap is sequenced.
Thus, TSS-seq data identifies transcription start sites (TSS).
Finally, TSS-seq data combined with RNA-seq data is able to identify each transcription unit with more confidence.