5 DNA motifs for transcription initiation

The expression of a gene starts with transcription. During transcription, one contiguous segment of genomic DNA is used to make a single RNA transcript. But how does the transcription machinery know where transcription should begin? This chapter reviews what we know about DNA sequences that promote transcription initiation and how these sequences provide bioinformatic evidence (albeit weak) for the position of genes in the genome. To follow along in the text and to answer “Test Your Understanding” questions, use the “TSS-Seq” Session link.

5.1 The Core Promoter

There are five different RNA polymerases that catalyze transcription in eukaryotes (RNA Pol I, II, III IV and V). Transcription of protein coding genes, like BBS1, require RNA Pol II and will be our focus. Transcription begins once RNA pol II binds near a transcriptional start site (TSS) of a gene. But RNA polymerase II on its own cannot recognize a TSS. Transcription initiation also requires a number of so-called General Transcription Factors or GTFs ³⁷ and a core promoter sequence.

A core promoter is defined as the minimal DNA sequence that directs accurate initiation of transcription. There are two main types of core promoters: focused and dispersed (Danino et al. 2015). A focused core promoter (also called a “sharp peak” or “narrow peak” promoter) contains a single predominant TSS that is confined to a small number of nucleotides. A dispersed promoter, by contrast, contains a large number of transcriptional start sites of equal potency that are dispersed over a 50 to 100 nucleotide region. This type of promoter is also called a “broad peak” or “wide peak” promoter. Both terms “sharp peak” and “broad peak” essentially describe the shape of the TSS-seq histogram data (See Chapter 4). In reality, TSS-seq data and other genome-wide studies of transcription initiation indicate that these two types of core promoters are in fact two ends of a continuum. In other words, promoter types cannot be categorized easily and also include promoters of mixed character (i.e. “broad with peak”) (Figure 5.1).

The left schematic from Vo ngoc et al. 2017 illustrates the three main types of promoters that are found in animals: Focused, Dispersed and Mixed. The right image from Carninci et al. 2006 displays TSS seq data for a number of genes that illustrate each type of promoter.

Figure 5.1: The left schematic from Vo ngoc et al. 2017 illustrates the three main types of promoters that are found in animals: Focused, Dispersed and Mixed. The right image from Carninci et al. 2006 displays TSS seq data for a number of genes that illustrate each type of promoter.

Focused core promoters were the first described and are the best understood. In humans, they are about 80 nt in length and flank³⁸ the TSS, the so-called +1 position of transcription (Figure 5.2). Each includes a set of short, DNA sequences called core promoter motifs³⁹. These DNA motifs serve as binding sites for a subset of GTFs (namely TFIID and TFIIB). Once TFIID and TFIIB bind a focused core promoter, they recruit and stabilize other GTFs which together recruit and stabilize RNA polymerase II to the TSS. This large, multiprotien complex (called the preinitiation complex or PIC) initiates basal levels of transcription⁴⁰. In fact, GTFs are also called Basal Transcription Factors.

A schematic of a focused core promoter and the GTFs/RNA polymerase II that bind to it. The horizontal line is the genomic DNA. **TATA**, **Inr**, **MTE** and **DPE** are DNA sequence motifs positioned along the core promoter as shown. The **Inr** spans the TSS (+1) while the **TATA-box** is upstream and the **MTE** and **DPE** motifs are downstream. Image from Danino et al. 2015.

Figure 5.2: A schematic of a focused core promoter and the GTFs/RNA polymerase II that bind to it. The horizontal line is the genomic DNA. TATA, Inr, MTE and DPE are DNA sequence motifs positioned along the core promoter as shown. The Inr spans the TSS (+1) while the TATA-box is upstream and the MTE and DPE motifs are downstream. Image from Danino et al. 2015.

The first core promoter motif identified was the TATA-box. The TATA-box recruits TFIID to the core promoter. Initially, the TATA-box was thought to be an essential motif that all core promoters possess (Read any textbook). We now know they are only present in a minority of core promoters. For example, only 24% of human genes have a TATA-box. The core promoter motif found most often is the Initiator (Inr). This DNA motif spans the TSS and also recruits TFIID. That said, nearly half of human promoters lack both a TATA-box and an Inr! The take home message? There are no universal sequence motif required for transcription initiation in Eukaryotes. Not only that, but the sequence of each core promoter motif (i.e. Inr) is somewhat variable. For example, the TATA-box in ACTA2 is TATATAA while the TATA-box in HERPUD1 is TATAAAA (ACTA2 and HERPUD1 are names of two distinct human genes).

5.2 What is a Consensus Sequence?

By aligning a large number of well-characterized sequence motifs, one can search for sequence patterns and/or regions of sequence conservation⁴¹. Any region of conservation or pattern observed can then be written as a consensus sequence. A consensus sequence (also known as a canonical sequence) is defined as the “the most frequent residues, either nucleotide or amino acid, that is found at each position in a multiple sequence alignment” (Wikipedia). For a simple example, see Figure 5.3. Notice that the multiple sequence alignment contains only G, A, T and C while the consensus sequence contains additional letters (Y, N and R). Y,N and R belong to an agreed upon list of IUPAC nucleotide codes also known as IUPAC ambiguity codes. Here, Y = C or T, N = any nucleotide and R = G or A.

This is a simple multiple sequence alignment (MSA) that includes only 4 sequences. The consensus sequence corresponding to this alignment is written directly below. It was created by examining the observed frequency of each nucleotide present in each column of sequence in the MSA.

Figure 5.3: This is a simple multiple sequence alignment (MSA) that includes only 4 sequences. The consensus sequence corresponding to this alignment is written directly below. It was created by examining the observed frequency of each nucleotide present in each column of sequence in the MSA.

Figure 5.4 lists the consensus sequences for each core promoter motif bound by TFIID in mammals. In this figure the Inr consensus sequence is listed as BBCA(+1)BW (where B = C, G or T; W= A or T and the A is at the +1 position of the TSS). That said, there have been so many exceptions to this consensus sequence that some argue it should be reduced to YR(+1) where Y = C or T and R = G or A and is at the +1 position of the TSS (Haberle and Stark 2018)⁴²!

Figure 5.4: A list of the consensus sequences for each core promoter motif bound by TFIID in mammals.

5.3 Searching for a consensus sequence

You can search for a consensus sequence containing IUPAC codes using an evidence track called, “Short Match”. Before you open Short Match, open the saved “TSS-Seq” Session then zoom in to view the sequence surrounding the TSS for BBS1. Now scroll down to find an evidence track entitled, “Short Match” (1) within the section entitled, “Mapping and Sequencing” (Figure 5.5). Change this evidence track from “hide” to “pack” and click “refresh” (2). A new evidence track will open (3). Click on the gray rectangle at left to open the track settings page for this track (4).

How to search for a consensus sequence: 1) change the **Short Match** evidence track from **hide** to **pack**. 2) Click **Refresh**. 3) A new evidence track will open. 4) Click on the gray rectangle to open the track settings page. Now see figure below.

Figure 5.5: How to search for a consensus sequence: 1) change the Short Match evidence track from hide to pack. 2) Click Refresh. 3) A new evidence track will open. 4) Click on the gray rectangle to open the track settings page. Now see figure below.

Finally, input any sequence (i.e. ATG) into the search window (1) provided at the Track Settings page for Short Match (Figure 5.6) then hit “Submit” (2).

Figure 5.6: 1) Enter a consensus sequence in the window provided. 2) Click Submit.

In Figure 5.7, I searched for the Inr consensus sequence, BBCA(+1)BW, in the genomic region surrounding the human gene (CCT2) TSS. Each match is displayed in the Short Match evidence track as a thick black line. The position (i.e. 69,979,214) and orientation (- or +) of each match is written on the left of each line. In this example, the putative⁴³ Inr is the one highlighted with a red asterisk. What is my evidence? This is the only Inr consensus sequence that spans the predicted and experimentally-defined TSS at position 69,979,236. It is also found on the plus strand of the genomic DNA and CCT2 is a plus strand gene. Finally, the “A” of the consensus sequence is positioned at the predicted and experimentally defined TSS (See red circle in Figure 5.7).

Horizonatal black bars within the 'Short Match' evidence track describe the position of each consensus sequence (BBCA(+1)BW) found within this genomic region. Information about the precise nucleotide position and orientation is provided on the left (see boxed consensus on the left). The putative Inr for CCT2 is highlighted with a red asterix. This consensus sequence is found on the plus strand (same as CCT2) and is positioned directly below the experimentally defined TSS at 69,979,236 with the A at the putative +1 position.

Figure 5.7: Horizonatal black bars within the ‘Short Match’ evidence track describe the position of each consensus sequence (BBCA(+1)BW) found within this genomic region. Information about the precise nucleotide position and orientation is provided on the left (see boxed consensus on the left). The putative Inr for CCT2 is highlighted with a red asterix. This consensus sequence is found on the plus strand (same as CCT2) and is positioned directly below the experimentally defined TSS at 69,979,236 with the A at the putative +1 position.

Now look what happens when I search for the degenerate Inr consensus sequence (YR). This is an extremely short consensus sequence and so many are expected to be present in any genomic region just by chance. That said, ALL the motifs found by Short Match are labeled as + strand!!! You should know intuitively this is impossible. There MUST be YR sequences on the bottom strand, too. In fact, Y (C/T) is complementary to R (G/A). Thus, where a YR sequence is found on the plus strand a YR sequence will always be found on the bottom strand. You can see this for yourself in Figure 5.8. Remember to read the bottom strand from right to left (5’ to 3’). For example, a CA dinucleotide in the top strand is a YR consensus sequence but so is its complementary sequence in the bottom strand: TG.

Select YR consensus sequences identified by the Short Match evidence track are boxed in red. Recall, Y = C or T while R=G or A. If you read the top strand sequence (from left to right) you can confirm that each is a YR consensus sequence. If you read the sequence (from right to left) for the exact same position in the minus strand, you will see it is also a YR sequence. Why do they label all YR Short Matches as plus strand? Convenience? I don't know.

Figure 5.8: Select YR consensus sequences identified by the Short Match evidence track are boxed in red. Recall, Y = C or T while R=G or A. If you read the top strand sequence (from left to right) you can confirm that each is a YR consensus sequence. If you read the sequence (from right to left) for the exact same position in the minus strand, you will see it is also a YR sequence. Why do they label all YR Short Matches as plus strand? Convenience? I don’t know.

5.3.1 Test your understanding

Use IUPAC nucleotide codes to rewrite the following consensus sequence: T/A, G/A/C, A, G/A, C/T, T, G/A/T/C, T (where “/” = “or”).

For the following questions start with the “TSS-Seq” session link, zoom into the region containing the BBS1 TSS then change the Short Match evidence track from “hide” to “pack”. Remember the core promoter is defined as the region that includes the TSS plus 40 nt upstream and downstream (80 nt in total).

Use the ShortMatch evidence track to search for a canonical TATA-box consensus sequence (TATAWAW) within the BBS1 core promoter region. If one or more are found,
– Are any on the plus strand as expected?
– Are any of the plus strand motifs positioned approximately 30 nt upstream of the predicted TSS as expected?
Use the Short Match evidence track to search for a canonical Inr consensus sequence (BBCABW) within the BBS1 core promoter region. If one or more are found,
– Are any on the plus strand as expected?
– Do any of the plus strand motifs span the TSS as expected?
If a BBCABW consensus sequence is NOT found spanning the the TSS of BBS1, is there a degenerate Inr consensus sequence (YR) spanning the TSS as expected?
Use the Short Match evidence track to search for a canonical MTE consensus sequence (CSARCSSAACGS) within the BBS1 core promoter region. If one or more are found,
– Are any on the plus strand as expected?
– Are any of the plus strand motifs positioned downstream of the TSS as expected?
Use the Short Match evidence track to search for a canonical DPE consensus sequence (DSWYVY) within the BBS1 core promoter region. If one or more are found,
– Are any on the plus strand as expected?
– Are any of the plus strand motifs positioned downstream of the TSS as expected?

5.3.2 Test your understanding

For the following questions start with the “TSS-Seq” session link, zoom into the region containing the MYH3 TSS then change the Short Match evidence track from “hide” to “pack”. Remember the core promoter is defined as the region that includes the TSS plus 40 nt upstream and downstream (80 nt in total). Also, MYH3 is a minus strand gene.

Use the ShortMatch evidence track to search for a canonical TATA-box consensus sequence (TATAWAW) within the MYH3 core promoter region. If one or more are found,
– Are any on the minus strand as expected?
– Are any of the minus strand motifs positioned approximately 30 nt upstream of the predicted TSS as expected?
Use the Short Match evidence track to search for a canonical Inr consensus sequence (BBCABW) within the MYH3 core promoter region. If one or more are found,
– Are any on the minus strand as expected?
– Do any of the minus strand motifs span the TSS as expected?
If a BBCABW consensus sequence is NOT found spanning the the TSS of MYH3, is there a degenerate Inr consensus sequence (YR) spanning the TSS as expected?
Use the Short Match evidence track to search for a canonical MTE consensus sequence (CSARCSSAACGS) within the MYH3 core promoter region. If one or more are found,
– Are any on the minus strand as expected?
– Are any of the minus strand motifs positioned downstream of the TSS as expected?
Use the Short Match evidence track to search for a canonical DPE consensus sequence (DSWYVY) within the MYH3 core promoter region. If one or more are found,
– Are any on the minus strand as expected?
– Are any of the minus strand motifs positioned downstream of the TSS as expected?

5.4 Sequence Logos

Sequence conservation, trends and patterns revealed by a multiple sequence alignment (MSA) can also be visualized with a “sequence logo” where the predominant residue is drawn as the tallest and placed at the top among all the residues found at a given position in an alignment. For example, let’s say at position 3 of an MSA there is a T in every single sequence (For a simple example see Figure 5.9). In a sequence logo, this would be displayed as a T of maximum height (illustrating that the T in that position is invariant and important). Now let’s say position 6 has an A in 75% of the sequences and a T in the remaining 25%. At this position of a sequence logo you would see an A on top of the T, the A would be proportionally larger than the T but the overall height of the two letters combined would be shorter than the T in position 3. And in the extreme case where G, A, T or C are found in equal proportion that position in the sequence logo would be left blank (see position 4 in Figure 5.9)! This indicates that that position of the alignment is utterly uninformative. In a traditional consensus sequence using IUPAC codes, this would be written as “N” for any nucleotide. One can also draw a sequence logo using frequency for the Y-axis. This type of sequence logo is more intuitive but many think it is less able to emphasize important sequence trends. What do you think?

Figure 5.9: The hypothetical multiple sequence alignment is redrawn as a consensus sequence including IUPAC codes, or as a sequence logo with either information bits or frequency as the Y axis.

In 2006, Carninci et al. performed an unbiased, systematic analysis of all core promoter sequences identified by TSS-seq data obtained from RNA extracted multiple human tissue types. In their analysis confirmed the diversity of promoter types classifying them into four discrete categories (Figure 5.1) including the two extremes: Single Predominant Peak (SP) and Broad Peak (BR). They aligned core promoter sequences by category, placing the +1 position of the TSS in a single column then adding the surrounding sequences to create a large multiple sequence alignment. They then created a sequence logo for each multiple sequence alignment (MSA). Two are displayed in Figure 5.10. As you can see, the SP promoters are more likely to contain a TATA-box-like sequence about 30 nt upstream of the so-called Inr and there is a strong bias for a purine (G or A) at the +1 position of the TSS. Broad Peak (BR) promoters were found to be similar to SP promoters only in that they have a a strong bias for a purine at the +1 position of the TSS but they clearly lack a TATA-box and are enriched overall with Gs and Cs. Take home message: There are no universal promoter motifs. There may be sequence trends but there is a tremendous amount of sequence diversity at the core promoter. Clearly it would be very difficult to identify genes based solely on the presence of conserved core promoter sequence motifs.

What I have not discussed in this chapter is how a cell “knows” when and which tissue a gene should be transcribed. That requires far more than the core promoter sequence and can sometimes involves DNA sequence on other chromosomes! This is a topic for another course.

Sequence Logos for Single Predominent Peak (SP) and Broad Peak (BR) promoter types. The position of the TATA-box and Inr are highlighted in the SP class of promoters. The +1 represents the TSS.

Figure 5.10: Sequence Logos for Single Predominent Peak (SP) and Broad Peak (BR) promoter types. The position of the TATA-box and Inr are highlighted in the SP class of promoters. The +1 represents the TSS.

5.4.1 Test Your Understanding

Below is a sequence logo drawn from a multiple sequence alignment (MSA) involving a subset of well-characterized metazoan TATA-boxes⁴⁴ (Haberle and Stark, 2018).

In this manual, I write the consensus sequence for the TATA-box as TATAWAW. It has also been written as TATAWAAR with an A in the 7th position (Kadonaga 2012). Based on the sequence logo above, which consensus sequence is a better match (ignore the “R” in TATAWAAR when you answer this question as the sequence logo above is only 7 nt long)
In both consensus sequences (TATAWAW and TATAWAAR), an A is placed in the fourth position. Given the sequence logo above what other letter is sometimes present in that position?
In the fifth position of both consensus sequences is a W. This implies that a T or A are possible at this position. Given the sequence logo how frequently is the T vs the A observed?