6 RNA motifs for processing and translation
An RNA transcript cannot be translated as is. It must be modified first to produce a messenger RNA (mRNA). This modification process is called pre-mRNA processing and includes:
- The addition of a 5’ methyl guanosine cap (5’ cap).
- The removal of introns (splicing).
- RNA cleavage (to terminate transcription) and
- The addition of a poly(A) tail.
In fact, pol II transcription and RNA processing are closely coordinated with mRNA transport out of the nucleus. This ensures that only fully processed mRNAs make it to the cytoplasm for translation. In this chapter, you will learn about RNA sequence motifs that are required for pre-mRNA processing (in the nucleus) and translation (in the cytoplasm). Again we will focus on BBS1 as our model gene to answer the question: Does BBS1 have all the sequence elements required for these processes. Let’s see. To follow along with the text and to answer “Test Your Understanding” questions, you can use the “Gene Structure” session link.
6.1 The Addition of a 5’ guanosine cap
The first step in pre-mRNA processing involves the addition of a 5’ cap structure. This modified guanosine ribonucleotide becomes covalently attached to the first templated ribonucleotide of all pre-mRNA transcripts once the 5’ end of the nascent41 RNA chain dissociates from the genomic DNA (Figure 6.1). There are no specific sequences (sequence motifs) required for this event to occur. Interestingly, this modification is not reflected in the sequence databases online. Let’s review the mRNA sequence of a random human gene: SPTBN2 (NM_006946.4). Notice that this mRNA (displayed in FASTA format) begins with an A. This is because this is the first templated ribonucleotide for SPTBN2.
6.2 Splicing
The second step in pre-mRNA processing is the removal of or splicing out of introns and the merging or joining of the flanking exons. Introns are spliced out by the splicing machinery (the spliceosome, a large complex of proteins and RNAs). To initiate splicing, the spliceosome must recognize and bind to the 5’ and 3’ ends of each intron in the RNA transcript42. Then the spliceosome pulls the two adjacent exons together to facilitate the excision of the intron and fusion of the two exons (Figure 6.2).
To visualize specific RNA motifs required for splicing, scientists aligned the 5’ and 3’ ends of thousands of introns one on top of the other to create a multiple sequence alignment. This allows patterns of sequence conservation to emerge. The result of one such analysis is displayed below as a sequence logo (Figure 6.3). This particular sequence logo was created by aligning the 5’ and 3’ ends of all human introns below a certain length. Do you see which nucleotides are invariant? This sequence logo clearly illustrates that splicing DOES require specific sequences for the spliceosome to recognize the 5’ and 3’ ends of the intron.
6.2.1 Test Your Understanding
Use the sequence logo in Figure 6.3 to answer the following questions (Note: The Y axis for this logo represents frequency):
- Based on what you know so far about how sequence is typically displayed, where is the 5’ splice site? On the left or right side of the figure?
- This sequence logo demonstrates that there are two nucleotides that are found at the 5’ end of nearly 100% of all introns. This is called the 5’ splice site (5’ SS or 5’ splice donor). Based on this sequence logo, what is the sequence of the 5’ splice site?
- This sequence logo demonstrates that there are two nucleotides that are found at the 3’ end of nearly 100% of all introns. This is called the 3’ splice site (3’ SS or 3’ splice acceptor). Based on this sequence logo, what is the sequence of the 3’ splice site?
- Based on this sequence logo, what is the probability of observing a G immediately 5’ (upstream) of the 5’ splice site?
- Based on this sequence logo, what is the probability of observing a T immediately 3’ (downstream) of the 5’ splice site?
- Based on this sequence logo, what is the probability of observing a A immediately 3’ (downstream) of the 3’ splice site?
As you discovered, the first and last two nucleotides of nearly all eukaryotic introns are GT and AG, respectively. The GT (or GU as it is found in the unspliced RNA) is at the 5’ end of the intron and is called the 5’ splice site (SS) or splice donor site. The AG is at the 3’ end of the intron and is called the 3’ splice site (SS) or splice acceptor site. This nearly universal sequence conservation at these two sites suggests the following: 1) The process of splicing arose early during the evolution of the first Eukaryote and 2) The GT and AG sequences are required for proper splicing. In fact, we now know that RNA components of the spliceosome hybridize to the 5’ and 3’ splice sites. Moreover, mutations that map to the 5’ and 3’ splice sites in genes known to cause human disease are typically pathogenic43.
6.3 Intron position - Anything is possible
A common misconception about splicing is that introns are positioned between codons. Not true. When introns first “invaded” the genome during early eukaryotic evolution, introns were “blind” to the concept of a codon. In fact, the first intron of BBS1 splits the last codon of exon 1 at position +244. How can you see this for yourself in the UCSC genome browser? First, open the link to Gene Structure Session then change the “Base Position” Track to “Full” (See “Reading Frames” in Module One for a reminder how to do this). Now zoom way in to view the last few codons of exon 1 (Figure 6.4). They code for amino acids C-G-A-E-S (If you do you not see the amino acids within the gene prediction track as shown in Figure 6.4, your track settings may have changed. (Figure 6.5) describes how to fix that problem.
Now review the three possible reading frames for this segment of genomic DNA, you can see that the BBS1 amino acids match the +3 reading frame (last row). Now compare the amino acid sequence in the +3 reading frame to the amino acid sequence in BBS1. There is an S in BBS1 protein after the E while there is an R in the +3 reading frame! This is because only the first two nucleotides (A-G) of the codon coding for Serine comes from exon 1. The last nucleotide of this codon is found at the beginning of exon 2! Again, one would say that the intron splits this codon at position +2 (XX-intron-X, where XXX is a single codon).
6.3.1 Test Your Understanding
- The last codon in exon 2 of BBS1 is split by intron 2. This split codon normally codes for alanine. What is the sequence of this codon once splicing is complete?
- At what position is the last codon of exon 2 split by intron 2? The +1 or the +2?
- If splicing were to fail (intron 2 is not removed), what amino acid would be added after the “L” in the sequence S-A-C-L at end of exon 2?
- If splicing were to fail (intron 2 is not removed), how many amino acids would be added after the “L” before a stop codon is encountered? (NOTE: This is the same L as the one mentioned in the above question)
- Where is intron 3 of BBS1 positioned (at position +1 of a codon, at position +2 of a codon or between codons)?
- If splicing were to fail (intron 3 is not removed), how many incorrect amino acids would be added until a stop codon is encountered?
- Where is intron 7 of BBS1 positioned (at position +1 of a codon, at position +2 of a codon or between codons)?
- If splicing were to fail (intron 7 is not removed), how many incorrect amino acids would be added until a stop codon is encountered?
In general, a codon can be split by an intron at any position: the +1 position, the +2 position or between codons. In Chapter 2, you also learned that exons within a single gene can jump from one reading frame to another. The act of splicing brings these exons back into a single reading frame. Thus, it is critical that splicing is accurate. One mistake and the reading frame of an mRNA will get out of whack. The protein sequence will change and will most likely be shorter than normal. This explains why splice site mutations are often pathogenic in human disease genes. 86
6.4 Cleavage and Polyadenylation
Termination of transcription involves two coupled reactions: 1) RNA cleavage to release the RNA transcript from the RNA polymerase machinery and 2) addition of a poly(A) tail45 to the 3’ end of the newly released message. Proper cleavage creates a transcript of the correct length. Addition of the poly A tail helps transport the mRNA out to the cytoplasm and improves stability and translation efficiency.
Like splicing, cleavage and polyadenylation requires specific RNA sequence motifs. These RNA motifs recruit the cleavage and polyadenylation machinery46 to the RNA. Sequence motifs important for cleavage flank47 the cleavage site (Figure 6.6a). The best conserved motif is called the Poly Adenylation Signal (or PAS), a sequence consisting of six ribonucleotides positioned 15-30 nucleotides upstream of the cleavage site within the 3’ UTR (NOTE: The cleavage site can be found by looking for end of the transcript). A sequence logo of aligned PAS sequences from humans and flies illustrates this high level of conservation (Figure 6.6b) and reveals the consensus sequence as it would be found in the nontemplate strand of the genome48: AWTAAA (where W = T/A).6.4.1 Test Your Understanding
The following questions ask you to locate then describe the PAS for a variety of genes. To begin, open the “Gene structure” session and use the Short Match evidence track to search for a PAS consensus sequence (AWTAAA).
- Search the 3’ UTR of BBS1 for a PAS consensus sequence (AWTAAA). Write out your PAS sequence from 5’ to 3’ as it would be found in the nontemplate strand of the genome.
- How many nucleotides are there between the putative BBS1 PAS identified above and the cleavage site (for the putative position of cleavage, review the gene prediction evidence track).
- Is the putative BBS1 PAS upstream or downstream of the cleavage site?
- Is the putative BBS1 PAS located where it should be?
- Search the 3’ UTR of ZDHHC24 (long isoform) for a PAS consensus sequence (AWTAAA). Don’t forget! This is a minus strand gene nearby BBS1. Write out your PAS sequence from 5’ to 3’ as it would be found in the nontemplate strand of the genome.)
- How many nucleotides are there between the putative ZDHHC24 (long isoform) PAS and the cleavage site (for the position of cleavage, review the gene prediction evidence track).
- Is the putative ZDHHC24 (long isoform) PAS upstream or downstream of the cleavage site?
- Is the putative ZDHHC24 (long isoform) PAS located where it should be?
- Search the 3’ UTR of ZDHHC24 (short isoform) for a PAS consensus sequence (AWTAAA). Write out your PAS sequence from 5’ to 3’ as it would be found in the nontemplate strand of the genome.)
- How many nucleotides are there between the putative ZDHHC24 (short isoform) PAS and the cleavage site (for the position of cleavage, review the gene prediction evidence track).
- Is the putative ZDHHC24 (short isoform) PAS upstream or downstream of the cleavage site?
- Is the putative ZDHHC24 (short isoform) PAS located where it should be?
- Find the MYH3 gene and check which strand it is on. Then search the 3’ UTR of MYH3 for a PAS consensus sequence (AWTAAA). Write out your PAS sequence from 5’ to 3’ as it would be found in the nontemplate strand of the genome.)
- How many nucleotides are there between the putative MYH3 PAS and the cleavage site (for the position of cleavage, review the gene prediction evidence track).
- Is the putative MYH3 PAS upstream or downstream of the cleavage site?
- Is the putative MYH3 PAS located where it should be?
Take home message: Splicing, cleavage and polyadenlyation are distinct from cap addition. The capping enzyme does not require a specific DNA or RNA sequence for 5’ cap addition to occur. On the other hand, splicing, cleavage and polyadenylation do require specific sequence motifs. These sequences are present in the nontemplate or sense strand of the genome49 but are recognized by the RNA processing machinery as RNA.
6.5 Translation Initiation
Once processed, mRNA exits the nucleus to be used as a template to create a polypeptide via a process called translation. Translation is mediated in part by the ribosome50. In eukaryotes, the small ribosomal subunit and associated factors first assemble at the mRNA cap structure then scan along the 5’ UTR until a start codon is found. Choosing the correct start codon is critical as it determines the reading frame and thus the polypeptide sequence! During the scanning process, ribosomal proteins within the small ribosomal subunit search for then interact with specific nucleotides both upstream and downstream of the start codon (Llacer et al. 2018). In other words, context matters.
The importance of context in start codon selection by the ribosome was first suggested by Marilyn Kozak. She aligned 699 well-characterized start codons derived from a set of vertebrate genes to search for sequence conservation upstream and downstream of the ATG (Kozak 1987). She then converted this large multiple sequence alignment into a nucleotide frequency table51 (Figure 6.7A). The most frequent nucleotides she found at each position from -6 to +4 is called the Kozak consensus sequence: GCCACCATGG.
Now we have sequence for entire genomes! In 2015, Cenik et al. created a sequence logo for a similar region surrounding the start codon for all protein coding genes in the human genome (Figure 6.7B). The results are surprisingly similar. Again, the fact that sequence conservation exists suggests that the sequence surrounding the ATG plays an important role in translation initiation.
A recent study in zebrafish supports this hypothesis. In this study, Grzegorski et al. inserted an optimal translation initiation sequence (GCCACCATGG) into a reporter gene52 then measured the efficiency of translation by visualizing the amound of reporter gene product made. This was the control. They then compared the expression of this control reporter gene to ones with other translation initiation sequences (i.e. GCAAACATGG, GCAGTCATGG, CTTTCTATGC or CGGTGTATGC). They discovered that a reporter gene fused to a translational initiation sequence found less frequently in the genome than the canonical Kozak consensus sequence (GCCACCATGG) was translated to lower levels than the control. By contrast, a reporter gene containing a translational initiation sequence found more frequently in the genome than the Kozak consensus was translated to higher levels than the control. Their results are shown in Figure 6.8.
Additional studies found that nucleotides in two highly conserved positions exert the strongest effect: a G residue following the ATG codon (position +4) and a purine three nucleotides upstream (position -3). Thus, overall, a good start codon is one found in the following context: RNNATGG (Where R is a purine and N is any nucleotide). Whereas an adequate start codon is RNNATGY or YNNATGG (where Y is a pyrimidine) (Kozak 1997).
6.5.1 Test Your Understanding
The following questions ask you to evaluate the predicted start codon for a variety of genes. To identify the predicted start codon, review the gene prediction track.
- Is the predicted BBS1 start codon found in a good context (RNNATGG), adequate context (RNNATGY or YNNATGG) or neither?
- Now, find the DPP3 gene just upstream of BBS1. Is the predicted DPP3 start codon found in a good context (RNNATGG), adequate context (RNNATGY or YNNATGG) or neither?
- Now, find the ZDHHC24 gene just downstream of BBS1. Is the predicted ZDHHC24 start codon found in a good context (RNNATGG), adequate context (RNNATGY or YNNATGG) or neither?
- Now, find the MYH3 gene (use the search window). Is the predicted MYH3 start codon found in a good context (RNNATGG), adequate context (RNNATGY or YNNATGG) or neither?
6.6 uORFs
uORF stands for upstream Open Reading Frame (pronounced you-orf). A gene is said to have a uORF if it has a start codon in the putative 5’ UTR followed by an in-frame stop codon that precedes the end of the main coding sequence or so-called primary ORF (Figure 6.9).
uORFs, in general, have the capacity to reduce translation initiation from the bonafide start codon (Calvo et al. 2009). Since they are found in a large number of protein coding genes (nearly 50% of human genes are thought to have a uORF), this is thought to be “by design”. In other words, it is thought that the presence of a uORF has an important purpose and is subject to positive selection during evolution as a way to keep gene expression levels appropriately low. Two examples are described in Figure 6.10.
6.7 Recognizing uORFs
To determine if a gene of interest (GOI) has a uORF in its 5’ UTR, you need to focus your genome browser on the 5’ UTR of your GOI then make sure your base prediction track is set to full. So long as you are zoomed in close enough to see the green boxes (start codons) and red boxes (stop codons) in the three reading frames, you can recognize uORFs. For an example of what a uORF would look like, see Figure 6.11. At some point you may come across a gene with an intron in the 5’ UTR (it’s rare). Just keep in mind that these 5’ UTRs are more difficult to assess because the uORF may switch reading frames if it spans the intron.
6.7.1 Test Your Understanding
- Does BBS1 have a uORF in the 5’ UTR?
- If yes, is the uORF found in a good context (RNNATGG), adequate context (RNNATGY or YNNATGG) or neither?
- Search for the gene, TAS2R3. It has a uORF in its 5’ UTR that is known to impact translation of the primary ORF (Calvo et al. 2009). Is the ATG of the uORF found in a good context (RNNATGG), adequate context (RNNATGY or YNNATGG) or neither?
- Search for the gene, SFXN3. It has a uORF in its 5’ UTR that is known to impact translation of the primary ORF (Calvo et al. 2009). Is the ATG of the uORF found in a good context (RNNATGG), adequate context (RNNATGY or YNNATGG) or neither?
- Search for the gene, ADH5. It has a uORF in its 5’ UTR that is known to impact translation of the primary ORF (Calvo et al. 2009). There are two start codons in-frame with a downstream stop codon within the 5’ UTR. Evaluate the context of these two start codons to determine if these start codons are found in a good context (RNNATGG), adequate context (RNNATGY or YNNATGG) or neither.
- Search for the gene, UCP2. It has a uORF in its 5’ UTR that is known to impact translation of the primary ORF (Calvo et al. 2009). There are three start codons in frame with a downstream stop codon. Evaluate the context of these three start codons to determine if these start codons are found in a good context (RNNATGG), adequate context (RNNATGY or YNNATGG) or neither.
6.8 For Discussion
- How did you go about identifying a uORF when you answered your “Test Your Understanding” questions above?
6.9 Homework
Create a multiple sequence alignment (MSA), a consensus sequence and a sequence logo.
Click this link to download a DOCX file or a PDF file for instructions and a template of the table. Also, review the figures below for additional information on how to identify the sequence you will be retrieving to create your MSA and how to use the WEB LOGO website to create a sequence logo.
© 2023, Maria Gallegos. All rights reserved.
new, just coming into existence↩︎
also known as splice site junctions↩︎
disease causing↩︎
A codon is 3 nucleotides in length. When an intron splits a codon at the +2 position this means that the intron is positioned after the second nucleotide of the codon. An intron can split a codon at the +1 or the +2 position. When an intron is positioned between codons one can say is at the 0 or +3 position. A decision has to be made about which should be used↩︎
polyadenylation↩︎
a large multiprotein complex↩︎
are found on either side of↩︎
For a plus strand gene like BBS1, the nontemplate strand of the genome is the plus strand↩︎
For a plus strand gene, like BBS1, the nontemplate or sense strand is the plus strand.↩︎
The ribosome is comprised of two large complexes (one large and one small) made up of protein and RNA. Translation also requires tRNAs and of course, amino acids!↩︎
Sequence logos were not “invented” until 1990 by Schneider and Stephens↩︎
In molecular biology, reporter genes are those that can easily be visualized by eye when translated into protein. For example, lacZ is a good reporter because its gene product, B-galactosidase, turns one of its substrates blue. Importantly, there is a linear relationship between the amount of blue pigment produced and the amount of B-galactosidase produced. GFP is also a good reporter which can be visualized by fluorescence microscopy. Reporter genes are used for a large variety of purposes but in this study it was used to measure translation efficiency↩︎