6 RNA motifs for processing and translation

In this chapter, you will review the cis-acting sequences that are required for pre-mRNA processing and mRNA translation. Again, we will focus on BBS1 as our model gene to find out if BBS1 has all the cis-acting sequences necessary for these events. To follow along with the text and to answer “Test Your Understanding” questions, you can use the “Gene Structure” session link.

6.1 The Addition of a 5’ guanosine cap

Transcripts created by RNA polymerase II are transformed into messenger RNA (mRNA) as they are synthesized. This transformation process is called pre-mRNA processing. The first step involves the addition of a 5’ cap structure. Specifically, a 7-methyl guanosine becomes covalently attached to the first templated ribonucleotide of all pre-mRNA transcripts as they emerge (5’ end first) from the transcription bubble (Figure 6.1). Importantly, there are no conserved (cis-acting) sequences required for this event to occur. Now, let’s review the mRNA sequence of a random human gene: SPTBN2 (NM_006946.4). Notice that this mRNA (displayed in FASTA format) begins with an A. This is the first templated ribonucleotide for SPTBN2 mRNA. My point: All mRNA transcripts will have a 7-methylguanosine cap at the 5’ end but you will not see it in the mRNA sequence provided by NCBI or other sequence databases online.


Original image from Peter J. Russell, iGenetics: Copyright © Pearson Education, Inc., publishing as Benjamin Cummings. Additional annotation added here for clarity.

Figure 6.1: Original image from Peter J. Russell, iGenetics: Copyright © Pearson Education, Inc., publishing as Benjamin Cummings. Additional annotation added here for clarity.


6.2 Splicing

The second step in pre-mRNA processing is the removal of or splicing out of introns and the merging or joining of the flanking exons on either side. Introns are spliced out by the spliceosome. In brief, specific components of the spliceosome first bind to the 5’ splice site then the 3’ splice site. Then the spliceosome pulls the two adjacent exons together to facilitate excision of the intron and fusion of the two exons (Figure 6.2).


From Schneider et al. 2010. The spliceosome includes the U1, U2, U4, U5 and U6 small nuclear ribonucleoproteins (snRNPs). SR proteins (in red) and splicing enhancer sequences within the exon are also utilized for accurate splicing. THis is particularly important with long introns.

Figure 6.2: From Schneider et al. 2010. The spliceosome includes the U1, U2, U4, U5 and U6 small nuclear ribonucleoproteins (snRNPs). SR proteins (in red) and splicing enhancer sequences within the exon are also utilized for accurate splicing. THis is particularly important with long introns.


To identify cis-acting sequences required for splicing, scientists have created multiple sequence alignments involving the 5’ and 3’ ends of of introns. The result of one such analysis is displayed below as a sequence logo (Figure 6.3). This particular sequence logo was created by aligning the 5’ and 3’ ends of all human introns below a certain length. This sequence logo clearly illustrates that the first and last two nucleotides of nearly all eukaryotic introns are GT and AG, respectively. The GT (or GU as it is found in the unspliced RNA) is at the 5’ end of the intron and is called the 5’ splice site (SS) or splice donor site. The AG is at the 3’ end of the intron and is called the 3’ splice site (SS) or splice acceptor site. Moreover, additional sequences surrounding the 5’ and 3’ splice sites are enriched for certain nucleotides. This sequence conservation indicates that these sequences are critical for splicing. In fact, we now know that the snRNA components of the spliceosome hybridize to the 5’ and 3’ splice sites. Moreover, mutations that map to the 5’ and 3’ splice sites in genes known to cause human disease are typically pathogenic38.


This sequence logo was created by aligning 5' and 3' splice site junctions of *all* human introns below a certain size. The Y axis represents frequency. The intron is bracketed as shown. Modified from Lim and Burge 2001.

Figure 6.3: This sequence logo was created by aligning 5’ and 3’ splice site junctions of all human introns below a certain size. The Y axis represents frequency. The intron is bracketed as shown. Modified from Lim and Burge 2001.

6.2.1 Test Your Understanding


Click the link above to see the TYU questions.


6.3 Intron position - Anything is possible

A common misconception about splicing is that introns are positioned between codons. Not true. When introns first “invaded” genomes they did so randomly, “blind” to the concept of a codon. In fact, the first intron of BBS1 splits codon 16 after the second nucleotide of the codon (See Figure 6.4). What are the consequences in the display? Notice that the last codons of exon one code for amino acids C-G-A-E-S. You can also find this sequence in the +3 reading frame except there is an R instead of the final S! This is because the last nucleotide of the codon that codes for R is in the intron! This will be spliced out! Because of splicing, the last nucleotide of codon 16 is at the beginning of exon 2! You can use the three reading frame display to figure out what would happen if intron one of BBS1 was not spliced out. Given that the ribosome cannot distinguish exon from intron (it’s just RNA), it would continue to translate the +3 reading frame as if it were coding sequence. As a result, a single incorrect amino acid (R) would be added to the growing polypeptide chain before a stop codon would be encountered! This underscores the importance that splicing be precise. One mistake and the reading frame of an mRNA will likely get thrown out of whack. This explains why splice site mutations are often pathogenic in human disease genes.


This is a close up view of the exon/intron junction of exon 1 for BBS1. Notice there is an S (serine) in the BBS1 polypeptide at the end of exon 1 while there is an R in the +3 reading frame! The descrepcancy is due to the fact that the last nucleotide of the codon that codes for R is in the intron which will be spliced out. The real last nucleotide of this codon is at the beginning of exon 2!

Figure 6.4: This is a close up view of the exon/intron junction of exon 1 for BBS1. Notice there is an S (serine) in the BBS1 polypeptide at the end of exon 1 while there is an R in the +3 reading frame! The descrepcancy is due to the fact that the last nucleotide of the codon that codes for R is in the intron which will be spliced out. The real last nucleotide of this codon is at the beginning of exon 2!


If your gene prediction track does not have the amino acids displayed (See 1 and compare to the figure above), then you will need to configure it anew as described here. 2) Right click on the gene prediction schematic and a pull down menu will appear. Choose 'Configure Refseq Curated'. 3) A new yellow box will appear. Click on the pulldown menu for Color track by codons and choose 'genomic codons'. 4) Click 'Apply' then 5) Click 'OK'

Figure 6.5: If your gene prediction track does not have the amino acids displayed (See 1 and compare to the figure above), then you will need to configure it anew as described here. 2) Right click on the gene prediction schematic and a pull down menu will appear. Choose ‘Configure Refseq Curated’. 3) A new yellow box will appear. Click on the pulldown menu for Color track by codons and choose ‘genomic codons’. 4) Click ‘Apply’ then 5) Click ‘OK’


6.3.1 Test Your Understanding


Click the link above to see the TYU questions.

NOTE: These TYU questions require that the amino acids coded by the coding exons are displayed in the gene prediction track as shown in Figure 6.4. If you do not see them, there is something amiss with your gene prediction track settings. See Figure 6.5 to learn how to resolve this problem.


6.4 Cleavage and Polyadenylation

Termination of transcription involves two coupled reactions: 1) RNA cleavage to release the RNA transcript from the RNA polymerase machinery and 2) addition of a poly(A) tail39 to the 3’ end of the newly released message. Proper cleavage creates a transcript of the correct length. Addition of the poly A tail helps transport the mRNA out to the cytoplasm and improves stability and translation efficiency.

Like splicing, cleavage and polyadenylation requires specific RNA sequence motifs. These RNA motifs recruit the cleavage and polyadenylation machinery40 to the RNA. Sequence motifs important for cleavage flank41 the cleavage site (Figure 6.6a). The best conserved motif is called the Poly Adenylation Signal (or PAS), a sequence consisting of six ribonucleotides positioned 15-30 nucleotides upstream of the cleavage site within the 3’ UTR (NOTE: The cleavage site can be found by looking for end of the transcript). A sequence logo of aligned PAS sequences from humans and flies illustrates this high level of conservation (Figure 6.6b) and reveals the consensus sequence as it would be found in the nontemplate strand of the genome42: AWTAAA (where W = T/A).
a) A schematic of an RNA (in red) just after cleavage but before polyadennylation. The 5' end of the RNA extends to the bottom left. An open triangle points to the site of cleavage. Notice that the PAS (AAUAAA) is upstream of the cleavage site. The G/U-rich region is downstream. CTD, CFI, CFII, CPSF, CSTF and PAP make up the cleavage and polyadenylation machinery. The **PAP** is the **P**oly **A** **P**olymerase, the enzyme that adds adenine ribonucleotides to the 3' end of the message. b) These sequence logos illustrate how conserved the PAS really is even between humans (top) and flies (bottom).

Figure 6.6: a) A schematic of an RNA (in red) just after cleavage but before polyadennylation. The 5’ end of the RNA extends to the bottom left. An open triangle points to the site of cleavage. Notice that the PAS (AAUAAA) is upstream of the cleavage site. The G/U-rich region is downstream. CTD, CFI, CFII, CPSF, CSTF and PAP make up the cleavage and polyadenylation machinery. The PAP is the Poly A Polymerase, the enzyme that adds adenine ribonucleotides to the 3’ end of the message. b) These sequence logos illustrate how conserved the PAS really is even between humans (top) and flies (bottom).


6.4.1 Test Your Understanding


Click the link above to see the TYU questions.


Take home message: Splicing, cleavage and polyadenlyation are distinct from cap addition. The capping enzyme does not require a specific DNA or RNA sequence for 5’ cap addition to occur. On the other hand, splicing, cleavage and polyadenylation do require specific sequence motifs. These sequences are present in the nontemplate or sense strand of the genome43 but are recognized by the RNA processing machinery as RNA.

6.5 Translation Initiation

Once processed, mRNA exits the nucleus to be used as a template to create a polypeptide via a process called translation. Translation is mediated in part by the ribosome44. In eukaryotes, the small ribosomal subunit and associated factors first assemble at the mRNA cap structure then scan along the 5’ UTR until a start codon is found. Choosing the correct start codon is critical as it determines the reading frame and thus the polypeptide sequence! During the scanning process, ribosomal proteins within the small ribosomal subunit search for then interact with specific nucleotides both upstream and downstream of the start codon (Llacer et al. 2018). In other words, context matters.

The importance of context in start codon selection by the ribosome was first suggested by Marilyn Kozak. She aligned 699 well-characterized start codons derived from a set of vertebrate genes to search for sequence conservation upstream and downstream of the ATG (Kozak 1987). She then converted this large multiple sequence alignment into a nucleotide frequency table45 (Figure 6.7A). The most frequent nucleotides she found at each position from -6 to +4 is called the Kozak consensus sequence: GCCACCATGG.

Now we have sequence for entire genomes! In 2015, Cenik et al. created a sequence logo for a similar region surrounding the start codon for all protein coding genes in the human genome (Figure 6.7B). The results are surprisingly similar. Again, the fact that sequence conservation exists suggests that the sequence surrounding the ATG plays an important role in translation initiation.


A) A multiple sequence alignment of 699 vertebrate genes displayed as a nucleotide frequency table (The ATG is not included since it is invariant). This schematic is from Kozak, 1987 with the most frequently observed nucleotides at positions -6 to +4 highlighted in green. B) This sequence logo was created 28 yeats later from the complete set of human protein coding genes. Notice the similarities!

Figure 6.7: A) A multiple sequence alignment of 699 vertebrate genes displayed as a nucleotide frequency table (The ATG is not included since it is invariant). This schematic is from Kozak, 1987 with the most frequently observed nucleotides at positions -6 to +4 highlighted in green. B) This sequence logo was created 28 yeats later from the complete set of human protein coding genes. Notice the similarities!


A recent study in zebrafish supports this hypothesis. In this study, Grzegorski et al. inserted an optimal translation initiation sequence (GCCACCATGG) into a reporter gene46 then measured the efficiency of translation by visualizing the amound of reporter gene product made. This was the control. They then compared the expression of this control reporter gene to ones with other translation initiation sequences (i.e. GCAAACATGG, GCAGTCATGG, CTTTCTATGC or CGGTGTATGC). They discovered that a reporter gene fused to a translational initiation sequence found less frequently in the genome than the canonical Kozak consensus sequence (GCCACCATGG) was translated to lower levels than the control. By contrast, a reporter gene containing a translational initiation sequence found more frequently in the genome than the Kozak consensus was translated to higher levels than the control. Their results are shown in Figure 6.8.


Figure taken from Grzegorski et al. 2014. The Y-axis represents expression levels of the experimental reporter gene divided by expression levels of the control reporter gene containing the canonical kozak consensus sequence

Figure 6.8: Figure taken from Grzegorski et al. 2014. The Y-axis represents expression levels of the experimental reporter gene divided by expression levels of the control reporter gene containing the canonical kozak consensus sequence


Additional studies found that nucleotides in two highly conserved positions exert the strongest effect: a G residue following the ATG codon (position +4) and a purine three nucleotides upstream (position -3). Thus, overall, a good start codon is one found in the following context: RNNATGG (Where R is a purine and N is any nucleotide). Whereas an adequate start codon is RNNATGY or YNNATGG (where Y is a pyrimidine) (Kozak 1997).

6.5.1 Test Your Understanding


Click the link above to see the TYU questions.

6.6 uORFs

uORF stands for upstream Open Reading Frame (pronounced you-orf). A gene is said to have a uORF if it has a start codon in the 5’ UTR that is followed by an in-frame stop codon OR if it has is a start codon in the 5’ UTR that is out-of-frame with the main start codon (Figure 6.9).


Paraphrased from Calvo et al. 2009. A) A schematic representation of an mRNA transcript with two uORFs (red arrows), one fully upstream and one overlapping the main coding sequence (black arrow). uORFs were defined by Calvo et al. as a start codon (AUG) in the 5' UTR followed by an in-frame stop codon (arrowhead) that precedes the **end** of the main coding sequence. Calvo et al. 2009 further refines the definition to only include uORFs that code for a minimum peptide length of 3 amino acids (or a coding sequence minimum length of 9 nt). (B) Number of uORFs in human and mouse RefSeq transcripts.

Figure 6.9: Paraphrased from Calvo et al. 2009. A) A schematic representation of an mRNA transcript with two uORFs (red arrows), one fully upstream and one overlapping the main coding sequence (black arrow). uORFs were defined by Calvo et al. as a start codon (AUG) in the 5’ UTR followed by an in-frame stop codon (arrowhead) that precedes the end of the main coding sequence. Calvo et al. 2009 further refines the definition to only include uORFs that code for a minimum peptide length of 3 amino acids (or a coding sequence minimum length of 9 nt). (B) Number of uORFs in human and mouse RefSeq transcripts.


uORFs, in general, have the capacity to reduce translation initiation from the main start codon (Calvo et al. 2009). Since they are found in a large number of protein coding genes (nearly 50% of human genes are thought to have a uORF), this is thought to be “by design”. In other words, it is thought that the presence of a uORF has an important purpose and is subject to positive selection during evolution as a way to keep gene expression levels appropriately low. Two examples are described in Figure 6.10.


Paraphrased from Kozak, 2005. Small upstream ORFs or uORFs are thought to down-regulate translation by imposing an inefficient reinitiation mechanism. This constraint on translation ensures against harmful overproduction of potent or toxic proteins. Two examples are described here. A) The presence of an overlapping uORF allows only low-level production of proinsulin in chick embryos. In the adult pancreas where more proinsulin protein is needed, a more efficiently translated form of mRNA is produced via a downstream promoter. B) The human mdm2 oncogene has the ability to cause cancer when it is overexpressed. Overexpression of the mdm2 oncogene in tumor cells is caused by a switch in the transcriptional start site which eliminates two small uORFs, thereby elevating translation 20-fold. The first example illustrates a naturally occuring, developmental switch in gene expression. The second example only occurs in the disease state. In fact, the expression from the crytpic downstream promoter results from the presence of a p53-responsive promoter region that is preferentially utilized when p53 levels increase, a common occurance in tumor cells (Landers 1997)

Figure 6.10: Paraphrased from Kozak, 2005. Small upstream ORFs or uORFs are thought to down-regulate translation by imposing an inefficient reinitiation mechanism. This constraint on translation ensures against harmful overproduction of potent or toxic proteins. Two examples are described here. A) The presence of an overlapping uORF allows only low-level production of proinsulin in chick embryos. In the adult pancreas where more proinsulin protein is needed, a more efficiently translated form of mRNA is produced via a downstream promoter. B) The human mdm2 oncogene has the ability to cause cancer when it is overexpressed. Overexpression of the mdm2 oncogene in tumor cells is caused by a switch in the transcriptional start site which eliminates two small uORFs, thereby elevating translation 20-fold. The first example illustrates a naturally occuring, developmental switch in gene expression. The second example only occurs in the disease state. In fact, the expression from the crytpic downstream promoter results from the presence of a p53-responsive promoter region that is preferentially utilized when p53 levels increase, a common occurance in tumor cells (Landers 1997)


6.7 Recognizing uORFs

To determine if a gene of interest (GOI) has a uORF in its 5’ UTR, you need to focus your genome browser on the 5’ UTR of your GOI then make sure your base prediction track is set to full. So long as you are zoomed in close enough to see the green boxes (start codons) and red boxes (stop codons) in the three reading frames, you can recognize most uORFs. For an example of what a uORF would look like, see Figure 6.11. At some point you may come across a gene with an intron in the 5’ UTR. If you find an upstream start codon in the 5’ UTR, this likely means that the gene has a uORF. To be sure, you would need to find the positive evidence. This could be complicated by the fact that the intron might split the uORF.


MDM2 has two uORFs in the 5' UTR as highlighted here in the bottom image. Both uORFs by definition are upstream of the predicted start codon as highlighted in the top image. Notice a uORF is defined as a start codon in the 5' UTR that is followed by a stop codon **in the same reading frame**.

Figure 6.11: MDM2 has two uORFs in the 5’ UTR as highlighted here in the bottom image. Both uORFs by definition are upstream of the predicted start codon as highlighted in the top image. Notice a uORF is defined as a start codon in the 5’ UTR that is followed by a stop codon in the same reading frame.


6.7.1 Test Your Understanding


Click the link above to see the TYU questions.

6.8 Homework

If your last name begins with A-M click the link provided to download your homework assignment as a DOCX file or a PDF file. The file includes additional instructions and a template for the MSA, frequency table and consensus sequence.

If your last name begins with N-Z click the link provided to download your homework assignment as a DOCX file or a PDF file. The file includes additional instructions and a template for the MSA, frequency table and consensus sequence.

In brief, create a multiple sequence alignment (MSA) that includes either the 5’ splice site and surrounding sequence of BBS1 or the 3’ splice site and surrounding sequence of BBS1 depending on your last name. HINT: See Figure 6.12 to see how much surrounding sequence to include and how to easily jump from exon to exon. Use the template provided (see file) to create the MSA, frequency table and consensus sequence using IUPAC codes. Finally, use the “WEB LOGO” sequence logo generator to create a sequence logo of your multiple sequence alignment. Upload both pages of your homework as a single PDF to CANVAS by the due date.

Also, see Figure 6.13 for additional tips on how to use “WEB LOGO” to create a sequence logo.


An example of the nucleotides you will include in your multiple sequence alignment (MSA) are boxed. For example, if you are creating an MSA of the 5' splice site and surrounding sequence you will always include the last two nucleotides of the upstream exon and the first 5 nucleotides of the intron (See box at left). If you are creating an MSA of the 3' splice site and surrounding sequence you will always include the last five nucleotides of the intron and the first 2 nucleotides of the downstreanm exon (See box at right). To quickly jump to the next intron-exon junction, click on the open arrowhead on the far right (green arrows point to the open arrowheads). Another way to quickly view 5’ and 3’ splice site sequences is by clicking on the gene schematic in the NCBI Refseq track and then clicking on the 'View details of parts of alignment within browser window' link. No additional instructions given for this method.

Figure 6.12: An example of the nucleotides you will include in your multiple sequence alignment (MSA) are boxed. For example, if you are creating an MSA of the 5’ splice site and surrounding sequence you will always include the last two nucleotides of the upstream exon and the first 5 nucleotides of the intron (See box at left). If you are creating an MSA of the 3’ splice site and surrounding sequence you will always include the last five nucleotides of the intron and the first 2 nucleotides of the downstreanm exon (See box at right). To quickly jump to the next intron-exon junction, click on the open arrowhead on the far right (green arrows point to the open arrowheads). Another way to quickly view 5’ and 3’ splice site sequences is by clicking on the gene schematic in the NCBI Refseq track and then clicking on the ‘View details of parts of alignment within browser window’ link. No additional instructions given for this method.


To use the Weblogo site successfully, follow the instructions as outlined above.

Figure 6.13: To use the Weblogo site successfully, follow the instructions as outlined above.


© 2024, Maria Gallegos. All rights reserved.


  1. disease causing↩︎

  2. polyadenylation↩︎

  3. a large multiprotein complex↩︎

  4. are found on either side of↩︎

  5. For a plus strand gene like BBS1, the nontemplate strand of the genome is the plus strand↩︎

  6. For a plus strand gene, like BBS1, the nontemplate or sense strand is the plus strand.↩︎

  7. The ribosome is comprised of two large complexes (one large and one small) made up of protein and RNA. Translation also requires tRNAs and of course, amino acids!↩︎

  8. Sequence logos were not “invented” until 1990 by Schneider and Stephens↩︎

  9. In molecular biology, reporter genes are those that can easily be visualized by eye when translated into protein. For example, lacZ is a good reporter because its gene product, B-galactosidase, turns one of its substrates blue. Importantly, there is a linear relationship between the amount of blue pigment produced and the amount of B-galactosidase produced. GFP is also a good reporter which can be visualized by fluorescence microscopy. Reporter genes are used for a large variety of purposes but in this study it was used to measure translation efficiency↩︎