4 Finding Genes with TSS-seq

The transcriptional start site or TSS marks the position of the first (+1) templated nucleotide of a transcript. In this chapter, you will learn how TSS sequencing can be used to locate the TSS. To follow along with the text and to answer the “Test Your Understanding” questions, start with the “TSS-seq” session link. You can also use this link as a starting point to view TSS-seq data for any gene of interest.

4.1 How TSS-seq is done?

TSS-seq is a special type of RNA sequencing that involves sequencing DNA that originates only from the 5’ end of the mRNAs present in a sample at the time of extraction. Review the RNA sequencing protocol from the previous chapter. In brief, RNA sequencing involves the sequencing of DNA fragments that originate from full length mRNA fragments thereby producing sequence reads that correspond to all exons. How can RNA sequencing be modified so that the only fragments sequenced come from the 5’ end of the mRNA only? Recall that mRNA has two unique features: 1) a 5’ cap and 2) a 3’ poly A tail. While both can be exploited to purify mRNA away from all other RNA types, the 5’ cap can be used specifically to pinpoint the transcriptional start site (TSS) as it is only added to the beginning of a message (mRNA) By sequencing only that portion of the mRNA that is next to the 5’ cap you can identify the TSS!

The steps necessary to prepare mRNA for TSS-seq are outlined in Figure 4.1. First, poly-A containing RNA (mRNA) is extracted from an organism, tissue or cell of interest as described previously (Figure 3.2). Due to the fragility of RNA in general, RNA extractions typically include full length mRNA (with a 5’ cap) and partially degraded mRNA (with a 5’ phosphate instead). You do not want to sequence the 5’ ends of partially degraded mRNA as these molecules do not pinpoint the location of the TSS but instead map to the interior of a gene.


TSSseq protocol. A) All mRNA possessing a poly A tail is separted from total RNA. The resulting sample contains both full length mRNA with a 5' cap and partially degraded mRNA with a 5' phosphate (P-). B) An enzyme that removes 5' phosphates is added. See the P change to OH. C) An enzyme is added that removes the 5' cap structure. See the black circle change to a P. D) RNA ligase attaches an RNA oligo (green rectangle) to all RNAs that had a 5' phosphate in C). E) The addition of reverse transcriptase and random oligos produce cDNA fragments. Those cDNA fragments that once had a 5' cap all have the same sequence at their 3' ends.

Figure 4.1: TSSseq protocol. A) All mRNA possessing a poly A tail is separted from total RNA. The resulting sample contains both full length mRNA with a 5’ cap and partially degraded mRNA with a 5’ phosphate (P-). B) An enzyme that removes 5’ phosphates is added. See the P change to OH. C) An enzyme is added that removes the 5’ cap structure. See the black circle change to a P. D) RNA ligase attaches an RNA oligo (green rectangle) to all RNAs that had a 5’ phosphate in C). E) The addition of reverse transcriptase and random oligos produce cDNA fragments. Those cDNA fragments that once had a 5’ cap all have the same sequence at their 3’ ends.


How can one distinguish full length mRNA from partially degraded fragments? Scientists have devised a clever method: First, they add an enzyme that removes 5’ phosphates from all partially degraded RNA molecules in the tube. Full-length, capped mRNAs do not possess a 5’ phosphate and are thus impervious to27 this step. Next, they add an enzyme that removes the 5’ cap from full length mRNAs. Now the full-length mRNAs that were once capped possess a 5’ phosphate instead28! Next, RNA ligase29 is used to attach a single-stranded RNA oligo30 to the 5’ end of only those RNAs that have 5’ phosphates. Finally, all the RNA in the tube is converted into cDNA using a reverse transcriptase and random primer (as described previously). Importantly, only those cDNAs that once had a 5’ cap are tagged with the specific primer sequence at their 3’ end (NOTE: Do you understand why this primer sequence is now at the 3’ end of the DNA?). This 3’ sequence tag can now be exploited for sequencing as the sequencing reaction is primed with a DNA oligo that base pairs31 to the sequence tag. The short sequence reads that are collected (about 35 bp in length) are then computationally aligned to the genome.

4.2 Why TSS-seq is useful

As we just saw, RNA-Seq data can identify exons, segments of the genome that are transcribed and remain in mRNA (introns are transcribed but removed from mRNA). Often, RNA-seq data will also reveal how exons are linked together (with split alignments). That said, it is not always sufficient for gene identification purposes. To identify an individual transcription unit (a single gene) with more confidence, it is better to know the position of the exons and the TSS. To see this for yourself, review RNAseq data corresponding to a small region of the genome (Figure 4.2). The NCBI Refseq gene prediction track is hidden on purpose. As you can see, the histogram and sequence read alignments including the split alignments are displayed. How many genes do you think are in this region? What is your rationale?


RNAseq data corresponding to a small region in the genome with the base position and gene prediction tracks hidden. Guess how many genes you think map to this region of the human genome. Continue reading below to discover the answer.

Figure 4.2: RNAseq data corresponding to a small region in the genome with the base position and gene prediction tracks hidden. Guess how many genes you think map to this region of the human genome. Continue reading below to discover the answer.


In Figure 4.2, there are two main groups of split alignments. It would be reasonable to guess that there are two genes in this region. To see the true number of genes click here. This link takes you to the same genome region as in Figure 4.2 but with the gene prediction and base position tracks visible. If you guessed wrong, do you see why you were misled?

4.3 How TSS-seq Data is Visualized

The UCSC Genome Browser contains numerous evidence tracks displaying TSS-seq data. We will focus on one that was generated by the FANTOM5 Consortium of labs in Japan (FANTOM stands for Functional Annotation of the Mammalian Genome). Click the link to learn more about FANTOM5. To view this evidence track click the TSS-seq session link.

Your genome browser should now display the “Max Counts of CAGE32 reads” evidence track for BBS1. This evidence track displays TSS-seq data in histogram form (Figure 4.3). The track settings are configured such that that the Y-axis is set at an absolute value: 350 read counts. Setting the Y-axis at 350 is sufficient for most genes. For highly expressed genes it is better to set the Y-axis to “auto-scale to data view”. Specifically, go to the track settings page. Scan down until you see “Data view scaling:”. Change the pulldown menu from “use vertical viewing range setting” to “auto-scale to data view”. Then click submit. If you do this for BBS1, the Y-axis should end at 69.0667. Before you move on, change the Y-axis back to “use vertical viewing range setting”.


TSSseq data for BBS1.

Figure 4.3: TSSseq data for BBS1.


Zoom out 10X. The genome browser window is now focused on a larger portion of the genome surrounding BBS1 (Figure 4.4). Notice that there are both red peaks and blue peaks. There are also tall peaks and tiny peaks.

Recall that genes can be found on the top strand or the bottom strand as RNA polymerase can use either strand as template for transcription. If RNA polymerase uses the bottom strand as template then it moves from left to right (5’ to 3’) and produces an mRNA that matches the top strand sequence and is complementary to the bottom strand. If RNA polymerase uses the top strand as template then it moves from right to left (also 5’ to 3’ relative to the bottom strand) and produces an mRNA that matches the bottom strand sequence. Histrogram peaks displayed in red represent the TSS of top strand genes. Histogram peaks displayed in blue represent the TSS of bottom strand genes.

The taller a histrogram peak at a given location, the more sequence reads that start at that spot. The more read counts, the more reliable the data and the more likely the data represents a true TSS. In fact, the tiny peaks observed in Figure 4.4 (find each red asterix) may reflect background noise33 and may not represent a true TSS. That said, the absence of a peak at a location where one was predicted to exist does not mean that the gene prediction is incorrect. This might be an instance where the gene was not expressed in the tissue sample that was used to generate the TSS-seq data. There is a common saying, “Absence of evidence is not evidence of absence”.


TSSseq data of a larger region surrounding BBS1. Each asterix (my annotation) highlights sequence reads that are not likely to represent true transcriptional start sites.

Figure 4.4: TSSseq data of a larger region surrounding BBS1. Each asterix (my annotation) highlights sequence reads that are not likely to represent true transcriptional start sites.


Now zoom in to the TSS for BBS1 (Figure 4.5). You can see that the position in the genome where the majority of experimentally defined TSSs are located is at the predicted transcriptional start site. This is not always the case. Also, notice that there are a range of TSSs for BBS1. This is normal.


A close up view of the TSS-seq histogram data surrounding the predicted **transcriptional** start site for BBS1 (as defined by the NCBI refseq evidence track). The **translational** start site of BBS1 is shown in green.

Figure 4.5: A close up view of the TSS-seq histogram data surrounding the predicted transcriptional start site for BBS1 (as defined by the NCBI refseq evidence track). The translational start site of BBS1 is shown in green.


Again, this is a great example of an evidence track that provides unbiased experimental evidence for the existence of a gene. It provides the position of all transcriptional start sites (TSS) within the genome. Each TSS identified by TSS-seq was not predicted to exist but was experimentally determined without prior knowledge (no bias).

That said, TSSseq data is not sufficient to determine the number of individual genes in a given region of the genome. For example, review the histrogram data in (Figure 4.6) and try to predict the number of genes in this region. Again, I have omitted the gene prediction track. Moreover, there are individual genes that utilize alternative transcription initiation sites (Review (Chapter 2), “One gene, many splice variants”)!


This is an image of the TSSseq evidence track for a small region of the human genome. All other tracks have been hidden. Count the number of transcription starts sites then guess how many genes you think these TSS peaks correspond to.

Figure 4.6: This is an image of the TSSseq evidence track for a small region of the human genome. All other tracks have been hidden. Count the number of transcription starts sites then guess how many genes you think these TSS peaks correspond to.


Only when a sufficient amount of RNAseq data combined with TSSseq data is available can one determine the number of genes in a given genomic region with confidence. Click here to review the gene prediction track and RNAseq data for the genomic region shown in Figure 4.6 to see how the RNAseq data combined with the TSS seq data provides a more complete picture of the true number of genes present. How does this differ from what you thought? Do you see why you were misled?

4.3.1 Test Your Understanding

The first two questions require the TSS-seq data displayed in the figure below (the gene prediction track is hidden).

  • How many prominent transcription start sites (>50 Max Count Reads) can be found in the genome region displayed above?
  • Among the TSS peaks displayed in the figure above, how many use the top strand as a template for transcription?

For the following questions, you will search for the indicated gene using the Search Window. To view all the transcript variants associated with the gene you may have to zoom out. Review the orientation wedges to determine which end is predicted to harbor the transcriptional start sites (TSS). This is where you will look for the TSS-seq data (if present).

  • Search for a gene called, PC. PC is predicted to produce three transcript variants (two long, one short). Which predicted transcriptional start site(s) are supported by TSS-seq data?
  • Search for a gene called, ACTN3. ACTN3 is predicted to produce two transcript variants (one long, one short). Which predicted transcriptional start site(s) are supported by TSS-seq data?
  • Search for a gene called, AIP. AIP is predicted to produce three transcript variants (two long, one short). Which predicted transcriptional start site(s) are supported by TSS-seq data?
  • Search for a gene called, DCLK1. DCLK1 is predicted to produce five transcript variants (three long, two short). Which predicted transcriptional start site(s) are supported by TSS-seq data?

4.3.2 For Discussion

  1. Explain in your own words, why TSS-seq data alone can be misleading when attempting to count the number of genes present in a given region of the genome.
  2. Explain in your own words, why RNA-seq data alone can be misleading when attempting to count the number of genes in a given region of the genome.

4.4 Take Home Messages

  • TSS-seq data is similar to RNA-seq data but with TSS-seq only the mRNA segment closest to the 5’ cap is sequenced.
  • Thus, TSS-seq data identifies transcription start sites (TSS).
  • Finally, TSS-seq data combined with RNA-seq data is able to identify each transcription unit with more confidence.

© 2023, Maria Gallegos. All rights reserved.


  1. not affected by↩︎

  2. At this point, full length mRNA possess a 5’ phosphate while partially degraded mRNA posses a 5’ hydroxyl - so they are still different and distringuishable↩︎

  3. RNA ligase is similar to DNA ligase in that both require a 5’ phosphate and a 3’ hydroxyl to form a phosphodiester bond between two nucleic acid chains. The main difference: RNA ligase links two RNA chains together while DNA ligase links two DNA chains together↩︎

  4. Short single-stranded RNA chain↩︎

  5. The phrase “base pairs” here is used as a verb to mean “hybridizes to” or “anneals to”↩︎

  6. CAGE is short for ”Cap Analysis Gene Expression”↩︎

  7. Molecular biology is messy. Each enzymatic step listed in Figure 4.1 is not likely to achieve 100% success. For example, partial mRNAs that fail to have their 5’ phosphates removed can be sequenced along side all the true full-length mRNAs.↩︎