2 Understanding Gene Structure

To follow along with the text and to answer the “Test Your Understanding” questions click the BBS1 session link. You can also use this link as a starting point to view the gene structure for any gene of interest, simply by typing the name of your gene of interest in the search window.

2.1 Exons and Introns

During gene expression, one continuous strand of genomic DNA is used as a template to produce a single-stranded RNA transcript. This process is called transcription. The sequence of this RNA transcript is identical to the nontemplate strand of genomic DNA (although RNA has a uracil in place of the thymine base). Before the transcript is exported out into the cytoplasm, it is processed including the removal of internal segments called introns (or intervening regions). All the RNA segments that remain are called exons (or expressed regions). Exons are then welded back together in order. To view the exon/intron organization of BBS1, click on the BBS1 session link, and review the schematic within the NCBI RefSeq Gene Prediction Evidence Track. Notice, the BBS1 gene consists of “boxes” connected by lines (Figure 2.1). The boxes represent exons (two examples are highlighted with a red asterix). The lines represent introns (see bracket). Now, hover your mouse over various parts of the gene schematic within the Browser window. If you pause long enough a popup window will appear indicating which intron or exon you are pointing to! The popup window also reveals the location of the beginning of a gene (i.e. location of exon 1). This is the part of the gene that is transcribed first. For BBS1, exon 1 is located on the left side of the gene schematic.

Introns also the same polarity. For BBS1, the beginning of each intron is also on the left side. This is the splice donor end of the intron. The other end contains the splice acceptor (you will learn more about intron structure in chapter 6).


Understanding the gene prediction evidence track schematic. Boxes represent exons (two examples are bracketed and highlighted with a red asterix). The lines represent introns (one example is bracketed and highlighted with a purple asterix)

Figure 2.1: Understanding the gene prediction evidence track schematic. Boxes represent exons (two examples are bracketed and highlighted with a red asterix). The lines represent introns (one example is bracketed and highlighted with a purple asterix)


2.1.1 Test your understanding

  • How many exons does BBS1 contain?
  • How many introns does BBS1 contain?
  • Which exon is the longest (as judged by eye)?
  • Find the splice donor end of intron 7. Which exon is it adjacent to?
  • Find SOD1 in the UCSC Genome Browser. Which SOD1 intron is the shortest (as judged by eye)?

2.2 Coding and Noncoding Exons

The spliced transcript or messenger RNA (mRNA) is used as a template to create a polypeptide11. This process is called translation and is mediated in part by the ribosome12. Ribosomes “read” each codon13 within the mRNA to create a polypeptide. That said, an mRNA is more than just codons! In fact, codons only occupy the central region of an mRNA. The ends of the mRNA are noncoding regions or untranslated regions (UTRs). The noncoding region at the beginning of the gene (5’ end) is called the 5’ untranslated region (5’ UTR) while the noncoding region at the end of the gene (3’ end) is called the 3’ untranslated region (3’ UTR). Importantly, the transition from noncoding to coding and back to noncoding does not coincide with exon boundaries. Thus, exons can be categorized as coding exons, noncoding exons or exons of mixed character. In fact, exon 1 of BBS1 is part noncoding (5’ UTR) and part coding. The same is true for exon 17 (the last exon).

Let’s see how the noncoding regions are illustrated in the UCSC Genome Browser. “Drag-and-Select” exon 1 of BBS1 until your window looks similar to Figure 2.2. At this zoom level, you should be able to see the amino acid sequence that exon 1 codes for (Figure 2.2: M-A-A-A-S-S-S-D-S-D-A-C-G-A-E-S14. Notice that the amino acid sequence does not start at the beginning of exon 1. Moreover, the noncoding portion of exon 1 is not drawn as thick as the coding portion of exon 1. This thinner portion of exon 1 is the 5’ untranslated region (5’ UTR). To confirm that you are in the right place, the BBS1 5’ UTR begins with the sequence G-G-T-T etc.


Exon 1 (red box) of BBS1 is part noncoding exon (blue box) and part coding exon (green box)

Figure 2.2: Exon 1 (red box) of BBS1 is part noncoding exon (blue box) and part coding exon (green box)


2.2.1 Test your understanding

  • How long is the 5’ UTR of BBS1 (manually count the number of nucleotides)?
  • Consider this arrow: <— representing a single strand of DNA. Is the 3’ end of this DNA on the right or left?
  • Search for the gene, MDM4 in the UCSC Genome Browser. What type of exon is exon 1 (noncoding, coding or both - of mixed character)?
  • Search for the gene, MDM4 in the UCSC Genome Browser. What type of exon is exon 2 (noncoding, coding or both - of mixed character)?

To view the entire BBS1 gene again, click on the named session link for BBS1 Gene Structure. Notice the box representing the last exon of BBS1 also consists of both a thick and thin portion. The thin portion at the end of the BBS1 gene is also noncoding. Since this is the 3’ end of BBS1, this portion of BBS1 is called the 3’ untranslated region or 3’ UTR. Now “Drag-and-Select” the coding portion of the last exon. If your view does not look like (Figure 2.3), you may have selected the entire last exon when you should have just selected the coding portion of the last exon. See the difference? Now you can clearly see that the amino acid sequence stops before the end of the last exon and the exon schematic transitions from thick to thin. NOTE: The stop codon is represented by an asterix. In many genes, the first and last exons consist of both noncoding and coding sequence while the central exons are coding only. As you explore more of the genome, you will see there are exceptions. Sometimes the first few exons and/or the last few exons are completely noncoding!


The coding portion of the last exon of BBS1. The first four nucleotides of the 3'UTR are G-A-C-C, written with the 5' end on the left.

Figure 2.3: The coding portion of the last exon of BBS1. The first four nucleotides of the 3’UTR are G-A-C-C, written with the 5’ end on the left.


2.3 Coding Exons Determine Protein Sequence

A polypeptide is a sequence of amino acids linked together by peptide bonds. The amino acid sequence is determined by the order of codons present in the central portion of a mature mRNA, the coding exons. Each codon consists of three consecutive bases. Since there are four nucleotides total (G, A, T and C), the equation, 4^3, indicates that there are a total of 64 possible codons. Most of the 64 codons code for an amino acid. Three do not and therefore act as STOP signals also known as termination codons. Figure 2.4 is one way to display the genetic code. In this graphic, codons are read from the center out to the periphery (from 5’ to 3’). For example, G-C-A codes for Alanine. Ala and A are the standard three letter and single letter codes, respectively. In this genetic code, Uracil replaces Thymine as ribosomes scan mRNA molecules for codons, not DNA.


[Image from wikipedia](https://commons.wikimedia.org/wiki/File:Aminoacids_table.svg)

Figure 2.4: Image from wikipedia


Use whatever method you’d like to get back to a close-up view of exon 1 of BBS1 Figure 2.2. Notice that the first amino acid of BBS1 is methionine (M) and is shaded in green. In fact, the first amino acid of most proteins should be methionine as the A-U-G codon is the start codon in most cases. You can confirm that this is true for BBS1 by looking at the three nucleotides positioned directly above the “M” (You should see A-T-G not A-U-G since you are viewing genomic DNA). The second codon is G-C-C. The third codon is G-C-T. Both code for Alanine (Ala, A).

In BBS1, it is clear that the codons are read from left to right in non-overlapping groups of three nucleotides. Since the codons are found in the top strand, you can also say that these codons are read from 5’ to 3’. In fact, codons are always read from 5’ to 3’ in non-overlapping, groups of three nucleotides. Now let’s re-examine the last coding exon of BBS1 (Figure 2.3. Notice that the stop codon at the end of BBS1 is represented by an “*” and highlighted in red.

2.3.1 Test your understanding

The codons below are written as a DNA sequence from 5’ on the left to 3’ on the right (5’ to 3’). Use the genetic code found in Figure 2.4 to answer the top three questions (open in a new tab). Use the UCSC genome browser to answer the remaining questions.

  • What amino acid does the codon, CGG, code for?
  • What amino acid does the codon, TTC, code for?
  • What amino acid does the codon, CAT, code for
  • Write out the DNA sequence of the 5th codon for BBS1
  • What does the 5th codon of BBS1 code for?
  • Write out the DNA sequence of the 8th codon for BBS1
  • What does the 8th codon of BBS1 code for?
  • Write out the DNA sequence of the stop codon for BBS1
  • EXTRA CREDIT! Write out the DNA sequence of the 16th codon

2.4 Top strand and bottom strand genes

We learned in the last section that the codons for BBS1 can be reviewed in the top strand of the genome (+ strand). BBS1 is what we call a top strand gene. This means that the bottom strand of the genome (- strand) is the template strand15 for BBS1 and the top strand is the sense strand. Sometimes the sense strand is also called the nontemplate strand. The bottom line is that the sequence of the sense strand of a gene is identical to the sequence of the transcript (except of course, Uracil replaces Thymine).

That said, the sense strand of a gene is not ALWAYS the top strand of the genome in a genome browser. To easily determine if a gene is a top strand gene or a bottom strand gene, you need to look for so-called “orientation wedges”. Click on the Named Session for BBS1 Gene Structure then zoom out 3X to look at BBS1 and its neighboring genes. Your NCBI Refseq gene track should look something like (Figure 2.5).


BBS1 and its neighboring genes. BBS1 is flanked by DPP3 on the left and ZDHHC24 on the right. DPP3 is upstream. ZDHHC24 is downstream. Click on the image to enlarge. Click again to resume reading.

Figure 2.5: BBS1 and its neighboring genes. BBS1 is flanked by DPP3 on the left and ZDHHC24 on the right. DPP3 is upstream. ZDHHC24 is downstream. Click on the image to enlarge. Click again to resume reading.


Now “Drag-and-Select” the first half of BBS1 to zoom in somewhat. Notice the “orientation wedges” spaced regularly along all introns (within the Browser and in Figure 2.6). They are either oriented to the right (red box, Figure 2.6) or to the left (green box, Figure 2.6). These “orientation wedges” indicate the orientation of the gene on the chromosome. They tell you where the 5’ end of the gene is and they indicate which strand of chromosomal DNA is the sense strand for a given gene (top or bottom, + or -).


The orientation wedges are visible within introns. When they point to the right (red box), the beginning of the gene is on the left and the gene is a top strand gene (plus strand). When they point to the left (green box), the beginning of the gene is on the right and the gene is a bottom strand gene (minus strand).

Figure 2.6: The orientation wedges are visible within introns. When they point to the right (red box), the beginning of the gene is on the left and the gene is a top strand gene (plus strand). When they point to the left (green box), the beginning of the gene is on the right and the gene is a bottom strand gene (minus strand).


When the orientation wedges point to the right, “) ) ) )” (i.e. see BBS1),

  1. The beginning of the gene is on the left.
  2. The top strand (+ strand) is the sense strand (nontemplate strand)
  3. the bottom strand (- strand) is the template strand.

When the orientation wedges point to the left, “( ( ( (”, the reverse is true (i.e. see ZDHHC24),

  1. The beginning of the gene is on the right.
  2. The top strand (+ strand) is the template strand.
  3. The bottom strand (-strand) is the sense strand (nontemplate strand)

These orientation wedges not only represent the orientation of the gene on the chromosome but also the direction of transcription. The beginning of a gene (the 5’ end) is always transcribed first. RNA polymerase, the enzyme that catalyzes RNA transcription, moves from the beginning of the gene towards the end of the gene. Thus, for BBS1, RNA polymerase moves from left to right using the bottom strand as a template. For ZDHHC24, RNA polymerase moves from right to left using the top strand as a template!

2.4.1 Test your understanding

To answer the following set of questions, go to my BBS1 Gene Structure named session then zoom out 10X.

  • Which strand (top or bottom) is the sense strand for ACTN3?
  • Where is the beginning of the ACTN3 gene as displayed in the browser, on the left or right?
  • Which direction does RNA polymerase travel along genomic DNA during transcription of ACTN3 (as displayed in the browser), to the left or to the right?
  • Which strand (top or bottom) is the sense strand for DPP3?
  • Where is the beginning of the DPP3 gene as displayed in the browser, on the left or right?
  • Which direction does RNA polymerase travel along DPP3 during transcription (as displayed in the browser), to the left or to the right?
  • Which strand (top or bottom is the sense strand for CTSF?
  • Where is the beginning of the CTSF gene, on the left or right as displayed in the browser?
  • Which direction does RNA polymerase travel along CTSF during transcription (as displayed in the browser), to the left or to the right?

Now let’s zoom into the beginning of the coding portion of exon 1 for ZDHHC24: Click on my named session for BBS1 then zoom out 3X to find the beginning of ZDHHC24. “Drag-and-Select” as many times necessary to zoom into the first half of the first coding exon for ZDHHC24. Your “NCBI Refseq Gene Track should look something like Figure 2.7. If not, you may have zoomed into the first half of exon 1 which is noncoding. Or maybe you are at the wrong end of the gene! Make sure you understand what went wrong and how to find the right spot.


This is the first half of the first **coding** exon of ZDHHC24. The noncoding portion of exon 1 extends to the right this time because *ZDHHC24* is a bottom strand (or minus strand) gene. The red oval highlights the strand you are currently viewing. Since the arrow points to the right, the top strand (5' on the left) is displayed. You can toggle this arrow with your mouse.

Figure 2.7: This is the first half of the first coding exon of ZDHHC24. The noncoding portion of exon 1 extends to the right this time because ZDHHC24 is a bottom strand (or minus strand) gene. The red oval highlights the strand you are currently viewing. Since the arrow points to the right, the top strand (5’ on the left) is displayed. You can toggle this arrow with your mouse.


Again, this is the first half of the first **coding** exon of ZDHHC24. The red oval highlights the strand you are currently viewing. Since the arrow points to the left, the bottom strand (5' on the right) is displayed. Notice that the DNA sequence is gray. Compare to the top image. This is a visual reminder of which strand you are viewing.

Figure 2.8: Again, this is the first half of the first coding exon of ZDHHC24. The red oval highlights the strand you are currently viewing. Since the arrow points to the left, the bottom strand (5’ on the right) is displayed. Notice that the DNA sequence is gray. Compare to the top image. This is a visual reminder of which strand you are viewing.


Look at the three nucleotides centered above the Methionine (colored in green on the right side of Figure 2.7). Notice that the nucleotides listed from left to right is NOT an A-T-G as you might expect but instead C-A-T! This is because you are “reading” the top strand of DNA. ZDHHC24 is a bottom strand gene. To view the bottom strand of DNA, click on the far left arrow within the Base Position Track (See the red oval, Figure 2.7). Notice the orientation of the arrow will flip and the genome sequence in your “Base Position” Track will transform into light gray (red oval, 2.8). Now there is an A-T-G directly above the Methionine (M) but instead of reading it from left to right, you must read it from right to left! This is because the bottom strand is oriented with 5’ end on the right. You ALWAYS read and write sequence from 5’ to 3’.

2.4.2 Test your understanding

  • Write out the DNA sequence of the 3rd codon for ZDHHC24 (Always write sequence from 5’ to 3’).
  • Write out the DNA sequence of the 5th codon for ZDHHC24.
  • Write out the DNA sequence of the stop codon used by the short mRNA form of ZDHHC24.*
  • Write out the DNA sequence of the stop codon used by the long mRNA form of ZDHHC24 (Notice: the stop codon is not always in the last exon!).*

    HINT: To confirm that you have chosen your codon sequence correctly, use the genetic code to confirm that the codon you chose codes for the amino acid found at that position in the gene (NOTE: In general, genetics substitute U for T).

Don’t forget, DNA is double-stranded. The two DNA strands anneal (hybridize) in an antiparallel orientation. The 5’ end of the top strand is on the left. The 5’ end of the bottom strand is on the right. Some genes, like BBS1 create an RNA transcript with a sequence that matches the top strand. Other genes, like ZDHHC24, create an RNA transcript with a sequence that matches the bottom strand.

2.5 One gene, many splice variants

A single RNA transcipt is sometimes spliced in a variety of different ways to produce more than one unique mRNA. This is called alternative splicing and each mRNA produced is called a splice variant. Since each splice variant has the capacity to produce a unique protein or isoform, the 24,000 protein-coding genes present in the human genome can actually produce many more than 24,000 unique proteins!

To see this for yourself, click on the BBS1 session link then zoom out exactly 10X to explore a larger section of chromosome 1116. Notice more than one gene is present in the browser window. Again, the name of each gene is given on the left side of each gene/transcript schematic. Regions of the genome that are transcribed to produce multiple mRNAs are indicated by a series of box/line schematics that are stacked vertically. How do I know these stacks represent a single gene? Each graphical representation in the stack has the same gene name (Figure 2.9).


The DPP3 gene produces three unique mRNAs. See the red box. One obvious difference is highlighted with a red arrow. The 2nd and 3rd splice form have an exon not included in the topmost splice form listed.

Figure 2.9: The DPP3 gene produces three unique mRNAs. See the red box. One obvious difference is highlighted with a red arrow. The 2nd and 3rd splice form have an exon not included in the topmost splice form listed.


Alternative splicing is only one way multiple unique mRNAs are produced from a single region of DNA (a single gene)? There are three main methods: 1) Use of an alternative transcriptional start site (Figure 2.10), 2) use of an alternative transcriptional termination site (Figure 2.11) and 3) alternatively splicing. For example, DPP3 begins and ends transcription at the same location but the RNA transcript that is produced is spliced in three distinct ways to produce three alternative splice forms (Figure 2.9).


Transcription initiation for the LGALS12 gene can be found at two distinct places according to the NCBI RefSeq gene database. The red arrow points to the two transcription start sites

Figure 2.10: Transcription initiation for the LGALS12 gene can be found at two distinct places according to the NCBI RefSeq gene database. The red arrow points to the two transcription start sites


Transcription termination for the TTC17 gene occurs at two different sites according to the NCBI RefSeq gene database. Each is highlighted with a red arrow

Figure 2.11: Transcription termination for the TTC17 gene occurs at two different sites according to the NCBI RefSeq gene database. Each is highlighted with a red arrow


Again, multiple mRNA isoforms from the same gene can (but don’t always) produce multiple, unique protein isoforms. In some cases, these unique proteins differ enough to function distinctly. For a hilarious illustration of this concept see Figure 2.12.


Concept/Art by Allan James from the University of Glasgow

Figure 2.12: Concept/Art by Allan James from the University of Glasgow


2.5.1 Test Your Understanding

  • How many unique mRNAs does PELI3 produce?
  • Are the various mRNA forms produced by PELI3 the result of alternative splicing, alternative transcription initiation, alternative transcription termination or some combination?
  • How many unique mRNAs does MRPL11 produce?
  • Are the various mRNA forms the result of alternative splicing, alternative transcription initiation, alternative transcription termination or some combination?
  • How many unique mRNAs does ACTN3 produce?
  • Are the various mRNA forms the result of alternative splicing, alternative transcription initiation, alternative transcription termination or some combination?
  • How are the various mRNA forms produced by the ZDHHC24 gene? Do they result from alternative splicing, transcription initiation, transcription termination or a combination of these events?

2.6 Reading Frames

A reading frame begins with a start codon. This is the first codon “read” by the ribosome. All subsequent codons are read in nonoverlapping groups of three. Thus, the start codon “sets” the reading frame. A reading frame ends only when a stop codon is encountered that is in the same reading frame as the start codon. This is the last codon “read” by the ribosome.

Imagine you isolate and sequence a short random stretch of DNA pulled from the ocean. It is likely genomic DNA and you want to see if corresponds to a coding portion of a gene. Specifically, you want to see if this short stretch of DNA codes for a previously studied protein. Of course, you have no clue which strand might be used as a template for transcription nor where the start codon is. Here is a thought question: What is the maximum number of unique polypeptides that a single segment of double-stranded DNA could theoretically produce? Your answer will be revealed below.

To view all possible reading frames associated with a given segment of DNA zoom into to the first exon of BBS1. Next, right click on the grey rectangle corresponding to the Base Position track (See arrow, Figure 2.13). A pop-up menu will appear. Choose “full”.


How to view all three reading frames for a single strand of DNA. Right click on the gray rectangle on the left of the Base Position Track (red arrow). A pulldown menu will open. Choose Full.

Figure 2.13: How to view all three reading frames for a single strand of DNA. Right click on the gray rectangle on the left of the Base Position Track (red arrow). A pulldown menu will open. Choose Full.


You should now see three reading frames displayed below the nucleotide sequence. Why three? This is the maximum number of unique polypeptides that can be theoretically produced from a given segment of the top strand of DNA. See what happens when you toggle to the bottom strand of DNA (click the arrow —> on the far left of the DNA sequence). Now you can view the three reading frames that correspond to the bottom strand of DNA. Both views are shown one on top of the other in Figure 2.14. Notice they are different.



Once you change the base position track to **Full**, the three possible reading frames for a *single* DNA strand will be displayed. When the arrow points to the right, the three reading frames will be of the top strand (top image). Notice the start codon in green matches the start codon in the third reading frame (**NOTE: This image comes from the 2013 genome assembly and will look different for you.**) In fact, nearly all of the amino acids in exon one match the amino acids in the third reading frame. We expect one of the three top strand reading frames to match *BBS1* since BBS1 is a top strand gene. Now , however, when the arrow points to the left (circled in red), the three reading frames displayed are for the bottom strand (bottom image). Now *none* of the amino acids in the three reading frames match the amino acids found in *BBS1*.

Figure 2.14: Once you change the base position track to Full, the three possible reading frames for a single DNA strand will be displayed. When the arrow points to the right, the three reading frames will be of the top strand (top image). Notice the start codon in green matches the start codon in the third reading frame (NOTE: This image comes from the 2013 genome assembly and will look different for you.) In fact, nearly all of the amino acids in exon one match the amino acids in the third reading frame. We expect one of the three top strand reading frames to match BBS1 since BBS1 is a top strand gene. Now , however, when the arrow points to the left (circled in red), the three reading frames displayed are for the bottom strand (bottom image). Now none of the amino acids in the three reading frames match the amino acids found in BBS1.


The three reading frames corresponding to the top strand are numbered +1 (top), +2 (middle) and +3 (bottom). The three reading frames corresponding to the bottom strand (when shown) are numbered -1 (top), -2 (middle) and -3 (bottom). For convenience, the position of all possible start codons are highlighted in green and all possible stop codons (*) are highlighted in red.

Now, zoom out until you can see the first three exons of BBS1 (Figure 2.15). At this level of zoom, you should still be able to read the amino acid sequence within the gene schematic and within the base position track. Notice there are many “start” and “stop” codons scattered throughout this segment of the genome (red and green boxes). This should not be surprising given that start and stop codons are only three nucleotides in length and they would simply occur by chance. The ability of an ATG or TGA, TAA, TAG to function as a start or stop codon, respectively, depends on context. For example, the real start codon for a given gene is typically the first start codon encountered in an mRNA as the Ribosome scans the mRNA from 5’ to 3’ (more about this later).

All ATG codons are colored green regardless if they function as a start codon or not. Similarly, all TGA, TAA and TAG codons are colored red regardless if they function as a stop codon. How these triplets function depends on context.

Figure 2.15: All ATG codons are colored green regardless if they function as a start codon or not. Similarly, all TGA, TAA and TAG codons are colored red regardless if they function as a stop codon. How these triplets function depends on context.


Now, notice that the amino acid sequence within the BBS1 gene schematic always matches one of the three reading frames. For our example, the coding portion of exon 1 begins with a methionine (highlighted in green). The +3 reading frame also has a methionine in this same position. Compare a few more amino acids in BBS1 and the +3 reading frame. You should see that all match (except one). Thus, coding exon 1 utilizes the +3 reading frame! By the way, do you see the exception?

Which reading frame do the other coding exons utilize? Is it always the same reading frame? Or does it change from exon to exon. New trick: To easily view exons farther downstream (3’ of), click on the open arrow head on the far right (See hot pink arrow at the far right of the image, Figure 2.15). Also, if you hover over the open triangle and don’t click, a small pop up will appear with information about which exon you are jumping to! This may come in handy while answering the following questions.

2.6.1 Test your understanding

Recall: The three reading frames corresponding to the top strand (plus strand) are called +1 (top), +2 (middle) and +3 (bottom). The three reading frames corresponding to the bottom strand (minus strand) are called -1 (top), -2 (middle) and -3 (bottom). Answer the following questions about BBS1 (a top strand gene) and ZDHHC24 (a bottom strand gene).

  • Which reading frame does exon 2 of BBS1 use?
  • Which reading frame does exon 3 of BBS1 use?
  • Which reading frame does exon 4 of BBS1 use?
  • Which reading frame does exon 5 of BBS1 use?
  • Which reading frame does exon 1 of the long form of ZDHHC24 use?
  • Which reading frame does exon 2 of the long form of ZDHHC24 use?
  • Which reading frame does exon 3 of the long form of ZDHHC24 use?
  • Which reading frame does exon 4 of the long form of ZDHHC24 use?
  • Review your answers above. Are the reading frames used by BBS1 the same for each exon examined? Are the reading frames used by ZDHHC24 the same for each exon examined?

2.7 How to View and Retrieve Gene Product Sequences

To retrieve BBS1 gene product sequences (or any gene product sequence) from the UCSC genome browser, click on the schematic for the BBS1 transcript in the “Gene and Gene Predictions track”. The top half of the new page contains numerous links to pages that provide sequence information associated with this gene (Figure 2.16). For example, to view information specifically about the BBS1 mRNA, click on the NM_024649.5 link. To view information specifically about the protein sequence, click on the NP_078925.3 link. NM_024649.5 and NP_078925.3 are known as accession codes. Accession codes that begin with NM_ correspond to mRNA sequences. While those that begin with NP_ correspond to protein sequences. Both sequence pages are in so-called “Genbank format”. This format includes useful annotations that can be “read” by sequence analysis software programs. The mRNA or protein sequence is at the very bottom of the page. Scroll down. Alternatively, click the “FASTA” link to see the sequence in a simpler format. One thing you might notice: There are no uracil bases (U) in the mRNA sequence! Sequence databases do not expend any computational energy to convert thymines (T) to uracils (U) for display purposes only.

Finally, to get an overview of how BBS1 mRNA aligns with the genomic sequence, click on the “View details of parts of alignment within browser window”. Read the text to determine what highlighting means although you may be able to deduce their meaning.


When you click on a gene/transcript schematic in the gene prediction track you are taken to its gene information page. BBS1 only has one isoform and so there is only one gene information page. In other words, this information is transcript specific and depends on which isoform you click on. Useful links are highlighted. Some links will help you answer Test Your Understanding questions. Explore!

Figure 2.16: When you click on a gene/transcript schematic in the gene prediction track you are taken to its gene information page. BBS1 only has one isoform and so there is only one gene information page. In other words, this information is transcript specific and depends on which isoform you click on. Useful links are highlighted. Some links will help you answer Test Your Understanding questions. Explore!


2.7.1 Test Your Understanding

  • List the first four nucleotides of the BBS1 mRNA according to the accession record, NM_024649.5 (Answer found in FASTA or Genbank format).
  • How long (in bp) is the BBS1 spliced transcript (mRNA) according to the accession record, NM_024649.5 (Answer found in Genbank format only)?
  • List the first four amino acids of the BBS1 protein according to the accession record, NP_078925.3 (Answer found in FASTA or Genbank format).
  • How long (in amino acids) is the protein according to the accession record, NP_078925.3 (Answer found in Genbank format only)?
  • In general, what is the difference between FASTA and Genbank formats?
  • EXTRA CREDIT!!! How long is the BBS1 unspliced transcript (the pre-mRNA)? (HINTS:You will find this information in the gene information page for BBS1 although you will not find the phrase “unspliced transcript” there. That said, the length of the unspliced transcript is equivalent to the length of the _______)

2.8 For Discussion

Click on the following link to see a small section of BBS1 within the UCSC Genome Browser that includes 4 exons and 3 introns: link. Use this section of BBS1 to answer the following questions:

  • What do you notice in general about the distribution of stop codons in introns? To see if there is a pattern, estimate the number of stop codons in each reading frame (+1, +2 and +3) within the introns. You are looking for trends. If you don’t see any trends, review more introns for BBS1 by scrolling to the right. It is particularly useful to examine long introns.
  • What do you notice in general about the distribution of stop codons in exons? To see if there is a pattern, count the number of stop codons in each reading frame (+1, +2 and +3) within exons. Again, you are looking for trends. It is particularly useful to examine long coding exons.
  • Do the same type of analysis but for ATG codons17. Is the distribution pattern you observe for start codons similar or different from the pattern observed for stop codons?

If you created a graph or table (in excel or by hand) to answer these questions, take a picture and DM me on SLACK for extra credit.

2.9 Addendum

Now that you have more familiarity with gene structure, the following tutorials are highly recommended. They were created by the UCSC Genome Browser team. These are also worth revisiting before you embark on your Independent Project at the end of the term.

© 2019, Maria Gallegos. All rights reserved.