2 Understanding Gene Structure

In this chapter, you will learn how genes are structured and how to recognize the different parts of a gene as illustrated in the UCSC genome browser. To follow along with the text and to answer the “Test Your Understanding” questions, start with the BBS1 session link. You can also use this link as a starting point to view gene structure for any gene of interest by typing the name of your gene of interest in the search window.

2.1 Exons and Introns

During gene expression, one continuous strand of genomic DNA is used as a template to produce a single-stranded RNA transcript. This process is called transcription. The sequence of this RNA transcript is identical to the nontemplate strand of genomic DNA (although RNA has uracil in place of thymine bases). Before the transcript is exported out into the cytoplasm, it is processed including the removal of internal segments called introns (or intervening regions). All the RNA segments that remain are called exons (or expressed regions). Exons are then welded back together in order. To view the exon/intron organization of BBS1, click on the BBS1 session link, and review the schematic within the NCBI RefSeq Gene Prediction Evidence Track. Notice, the BBS1 gene consists of “boxes” connected by lines (Figure 2.1). The boxes represent exons (two examples are highlighted with a red bracket/asterix). The lines represent introns (one example is highlighted with a purple bracket/asterix). Now, hover your mouse over various parts of the gene schematic within the Browser window. If you pause long enough a popup window will appear indicating which intron or exon you are pointing to among the total number! Thus, this popup window also reveals the location of the beginning of a gene (i.e. exon 1 marks the beginning). This is the part of the gene that is transcribed first. For BBS1, exon 1 is located on the left side of the gene schematic.


Understanding the gene prediction evidence track schematic. Boxes represent exons (two examples are bracketed and highlighted with a red asterix). The lines represent introns (one example is bracketed and highlighted with a purple asterix)

Figure 2.1: Understanding the gene prediction evidence track schematic. Boxes represent exons (two examples are bracketed and highlighted with a red asterix). The lines represent introns (one example is bracketed and highlighted with a purple asterix)


2.1.1 Test your understanding

  • How many exons does BBS1 contain?
  • How many introns does BBS1 contain?
  • Which BBS1 exon is the longest (as judged by eye)?
  • Find SOD1 in the UCSC Genome Browser. How many exons does SOD1 contain?
  • How many introns does SOD1 contain?
  • Which SOD1 exon is the longest (as judged by eye)?
  • Fill in the blank: There is always ___________ intron than exons in any given gene (one more or one less).
  • In your own words, explain why your answer to the above question is always true.
  • Fill in the blank: Transcribed regions (also transcripts) corresponding to protein coding genes always begin and end with ___________.

2.2 Coding and Noncoding Exons

The spliced transcript or messenger RNA (mRNA) is used as a template to create a polypeptide11. This process is called translation and is mediated in part by the ribosome12. Ribosomes “read” each codon13 within the mRNA to create a polypeptide. That said, they don’t read from the beginning of the transcript but from a start codon present internally. Moreover, the ribosome does not read to the end of the transcript but only until it encounters a stop codon. Thus, an mRNA is more than just codons! In fact, codons only occupy the central region of an mRNA. The ends of the mRNA not read by the ribosome are noncoding regions or untranslated regions (UTRs). The noncoding region at the beginning of a gene (5’ end) is called the 5’ untranslated region (5’ UTR) while the noncoding region at the end of a gene (3’ end) is called the 3’ untranslated region (3’ UTR). Importantly, the transition from noncoding to coding and back to noncoding does not coincide with exon boundaries. Thus, an exon can be categorized as a coding exon, noncoding exon or exon of mixed character. In fact, exon 1 of BBS1 is part noncoding (5’ UTR) and part coding. The same is true for exon 17 of BBS1 (the last exon).

Let’s see how noncoding regions are illustrated in the UCSC Genome Browser. “Drag-and-Select” exon 1 of BBS1 until your window looks similar to Figure 2.2. At this zoom level, you should be able to see the amino acid sequence that exon 1 codes for (Figure 2.2: M-A-A-A-S-S-S-D-S-D-A-C-G-A-E-S14. Again, notice that the amino acid sequence does not start at the beginning of exon 1. Moreover, the noncoding portion of exon 1 is not drawn as thick as the coding portion of exon 1. This thinner portion of exon 1 is the 5’ untranslated region (5’ UTR). To confirm that you are in the right place, the BBS1 5’ UTR begins with the sequence G-G-T-T etc. This is also the beginning of the transcript.


Exon 1 (red box) of BBS1 is part noncoding exon (blue box) and part coding exon (green box)

Figure 2.2: Exon 1 (red box) of BBS1 is part noncoding exon (blue box) and part coding exon (green box)


To view the entire BBS1 gene again, click on the named session link for BBS1 Gene Structure. Notice the box representing the last exon of BBS1 also consists of both a thick and thin portion. The thin portion at the end of the BBS1 gene is also noncoding. Since this is the 3’ end of BBS1, this portion of BBS1 is called the 3’ untranslated region or 3’ UTR. Now “Drag-and-Select” the coding portion of the last exon. If your view does not look like (Figure 2.3), you may have selected the entire last exon when you should have just selected the coding portion of the last exon. See the difference? Now you can clearly see that the amino acid sequence stops before the end of the last exon and the exon schematic transitions from thick to thin. NOTE: The stop codon is represented by an asterix. In many genes, the first and last exons consist of both noncoding and coding sequence while the central exons are coding only. As you explore more of the genome, you will see there are exceptions. Sometimes the first few exons and/or the last few exons are completely noncoding!


The coding portion of the last exon of BBS1 ends with the stop codon, a red codon containing an asterix. The 3' UTR begins directly after the stop codon.

Figure 2.3: The coding portion of the last exon of BBS1 ends with the stop codon, a red codon containing an asterix. The 3’ UTR begins directly after the stop codon.


2.2.1 Test your understanding

  • Use Drag and Select to zoom into the BBS1 5’ UTR (zoom in until you see individual nucleotide bases). The first two nucleotides of the BBS1 5’ UTR are GG (check to confirm). What are the last two nucleotides of the 5’ UTR?
  • The BBS1 3’ UTR is fairly long. The end of the 3’ UTR is also the end of the transcript. Use Drag and Select to zoom into the beginning of the BBS1 3’ UTR (zoom in until you see individual nucleotide bases). What are the first two nucleotides of the 3’ UTR?
  • Search for the gene, SOD1 in the UCSC Genome Browser. Characterize each exon as coding, noncoding or of mixed character (Choose one for each).

2.3 Coding Exons Determine Protein Sequence

A polypeptide is a sequence of amino acids linked together by peptide bonds. The amino acid sequence is determined by the order of codons present in the central portion of a mature mRNA, the coding portion of the exons. Each codon consists of three consecutive bases. Since there are four nucleotides in total (G, A, T and C), there are a total of 64 possible codons (4 x 4 x 4). Most of the 64 codons code for an amino acid. Three do not and therefore act as STOP signals also known as termination codons. Figure 2.4 is one way to display the genetic code. In this graphic, RNA codons are read from the center out to the periphery (from 5’ to 3’). For example, G-C-A codes for Alanine. Ala and A are the standard three letter and single letter codes, respectively. U-U-G codes for Leucine. Don’t forget, Uracil replaces Thymine in an RNA molecule which explains why the genetic code contains Uracil and not Thymine. That said, genome browsers do not display RNA sequence that way (Click here to view the mRNA sequence corresponding to BBS1).


[Image from wikipedia](https://commons.wikimedia.org/wiki/File:Aminoacids_table.svg)

Figure 2.4: Image from wikipedia


Click on the BBS1 session link, then use the “Drag-and-Select” method to get a close-up view of exon 1. Notice that the first amino acid of BBS1 is methionine (M) and is shaded in green (Figure 2.2). In fact, the first amino acid of the vast majority of proteins is methionine as A-U-G is the start codon in most cases. You can confirm that this is true for BBS1 by looking at the three nucleotides positioned directly above the “M” (NOTE: You should see A-T-G not A-U-G since you are viewing genomic DNA). The second codon is G-C-C. The third codon is G-C-T. Both code for Alanine (Ala, A).

In general, ribosomes “read” mRNA codons from 5’ to 3’ in non-overlapping groups of three nucleotides. Since the codons of BBS1 are found in the top strand where the 5’ end of the DNA is on the left, these codons are read from left to right. Now zoom out then drag and select to view the last coding exon of BBS1 (Figure 2.3). Notice that the stop codon is represented by an “*” and highlighted in red. This is where translation ends.

In summary, ribosomes and supporting factors “read” codons from 5’ to 3’ in a process called translation. mRNA translation begins at the start codon and ends at the stop codon to produce a single polypeptide chain.

2.3.1 Test your understanding

The codons below are written as a DNA sequence from 5’ on the left to 3’ on the right (5’ to 3’). Use the genetic code in Figure 2.4 to answer the first two questions. Use the UCSC genome browser to answer the remaining questions.

  • What amino acid does the codon, CGG, code for?
  • What amino acid does the codon, TTC, code for?
  • Write out the DNA sequence of the 5th codon for BBS1
  • What does the 5th codon of BBS1 code for?
  • EXTRA CREDIT! Write out the DNA sequence of the 16th codon
  • EXTRA CREDIT! What is unusual about the 16th codon?

2.4 One gene, many splice variants

A single RNA transcipt is sometimes spliced in a variety of different ways to produce more than one unique mRNA. This is called alternative splicing and each mRNA produced is called a splice variant. Since each splice variant has the potential to produce a unique protein or isoform, the 24,000 protein-coding genes in the human genome can actually produce many more than 24,000 unique proteins!

To see how the UCSC genome browser illustrates splice variants, click on the BBS1 session link then zoom out exactly 10X to explore a larger section of chromosome eleven15. Now more than one gene is present in the browser window. Again, the name of each gene is given on the left side of each gene/transcript schematic. Regions of the genome that are transcribed to produce multiple mRNAs are indicated by a series of box/line schematics that are stacked vertically. How do I know these stacks represent a single gene? Each graphical representation in the stack has the same gene name (Figure 2.5).


Notice how DPP3, highlighted with a red box, begins and ends transcription at the same location but the RNA transcript that is produced is spliced in three distinct ways to produce three alternative splice forms. One obvious difference is highlighted with a red arrow. The 2nd and 3rd splice forms have an exon not included in the topmost splice form listed. Click on the image to get a closer look. Click again to continue reading.

Figure 2.5: Notice how DPP3, highlighted with a red box, begins and ends transcription at the same location but the RNA transcript that is produced is spliced in three distinct ways to produce three alternative splice forms. One obvious difference is highlighted with a red arrow. The 2nd and 3rd splice forms have an exon not included in the topmost splice form listed. Click on the image to get a closer look. Click again to continue reading.


Alternative splicing is only one way multiple unique mRNAs are produced from a single region of DNA (a single gene)? There are three main methods:

  1. Use of an alternative transcriptional start site (Figure 2.6),
  2. Use of an alternative transcriptional termination site (Figure 2.7) and
  3. Alternative splicing (Figure 2.5).

Transcription initiation for the LGALS12 gene can be found at two distinct places according to the NCBI RefSeq gene database. The red arrow points to the two transcription start sites

Figure 2.6: Transcription initiation for the LGALS12 gene can be found at two distinct places according to the NCBI RefSeq gene database. The red arrow points to the two transcription start sites


Transcription termination for the TTC17 gene occurs at two different sites according to the NCBI RefSeq gene database. Each is highlighted with a red arrow

Figure 2.7: Transcription termination for the TTC17 gene occurs at two different sites according to the NCBI RefSeq gene database. Each is highlighted with a red arrow


Again, multiple mRNA variants produced by the same gene can (but don’t always) produce multiple, unique protein isoforms. In some cases, these unique proteins differ enough to function distinctly. For a hilarious illustration of this concept see Figure 2.8.


Concept/Art by Allan James from the University of Glasgow

Figure 2.8: Concept/Art by Allan James from the University of Glasgow


2.4.1 Test Your Understanding

To answer the first three questions, use your BBS1 session link to locate BBS1 then zoom out 10X to find the neighboring genes.

  • How many unique mRNAs does PELI3 produce?
  • How many unique mRNAs does MRPL11 produce?
  • How many unique mRNAs does NPAS4 produce?
  • EXTRA CREDIT: Use the search window to locate ACVR1. How many unique mRNAs does ACVR1 produce?

2.5 Top strand and bottom strand genes

We learned in a previous section that BBS1 codons can be viewed in the top strand of the genome (also known as the + strand). This is because BBS1 uses the bottom strand (- strand) as template during transcription. This produces a transcript identical in sequence to the top strand of DNA16. I like to call BBS1 a top strand gene. For top strand genes, the top strand can be described as the sense strand, coding strand, informational strand or nontemplate strand. My favorites are “nontemplate strand” and “informational strand”17. For top strand genes, the bottom strand (- strand) is known as the template strand.

Interestingly, the informational/nontemplate strand of a gene is not ALWAYS the top strand. To determine if a gene is a top strand gene or a bottom strand gene, you need to look for so-called “orientation wedges” in the gene schematic. First, click on the BBS1 session link then zoom out 3X to look at BBS1 and its neighboring genes. Your NCBI Refseq gene track should look something like (Figure 2.9).


BBS1 and its neighboring genes. Click on the image to enlarge. Click again to resume reading.

Figure 2.9: BBS1 and its neighboring genes. Click on the image to enlarge. Click again to resume reading.


Now “Drag-and-Select” the first half of BBS1 to zoom in. Notice the “orientation wedges” spaced regularly along all introns (Figure 2.10). They are either oriented to the right (red box, Figure 2.10) or to the left (green box, Figure 2.10). These “orientation wedges” indicate the orientation of the gene on the chromosome. They tell you where the 5’ end of the gene is and they indicate which strand of chromosomal DNA is the informational strand.


The orientation wedges are visible within introns. When they point to the right (red box), the gene is a top strand gene (plus strand). When they point to the left (green box), the gene is a bottom strand gene (minus strand).

Figure 2.10: The orientation wedges are visible within introns. When they point to the right (red box), the gene is a top strand gene (plus strand). When they point to the left (green box), the gene is a bottom strand gene (minus strand).


When the orientation wedges point to the right, “) ) ) )” (i.e. see BBS1),

  1. The beginning of the gene is on the left.
  2. The top strand (+ strand) is the informational strand (nontemplate strand)
  3. the bottom strand (- strand) is the template strand.

When the orientation wedges point to the left, “( ( ( (”, the reverse is true (i.e. see ZDHHC24),

  1. The beginning of the gene is on the right.
  2. The top strand (+ strand) is the template strand.
  3. The bottom strand (-strand) is the informational strand (nontemplate strand)

2.5.1 Test your understanding

To answer the following set of questions, go to the BBS1 session link then zoom out 10X to see neighboring genes. Drag and select to zoom into individual genes as needed.

  • Which strand (top or bottom) is the informational/nontemplate strand for PELI3?
  • Where would you look to locate the start codon of PELI3 (as displayed in the browser), near the left or right side of the gene schematic?
  • How are the unique mRNA variants of PELI3 created (Alternative transcription initiation, Alternative transcription termination or Alternative splicing)?
  • Which strand (top or bottom) is the informational/nontemplate strand for MRPL11?
  • Where would you look to locate the start codon of MRPL11 (as displayed in the browser), near the left or right side of the gene schematic?
  • How are the unique mRNA variants of MRPL11 created (Alternative transcription initiation, Alternative transcription termination or Alternative splicing)?
  • Which strand (top or bottom) is the informational/nontemplate strand for ZDHHC24?
  • Where would you look to locate the start codon of ZDHHC24 (as displayed in the browser), near the left or right side of the gene schematic?
  • How are the unique mRNA variants of ZDHHC24 created (Alternative transcription initiation, Alternative transcription termination or Alternative splicing)?


Now let’s look at a bottom strand gene (ZDHHC24) more closely. Click on my saved session for BBS1 then zoom out 3X to search for the beginning of ZDHHC24 (it should be on the right). “Drag-and-Select” as many times necessary to zoom into the first half of the first coding exon for ZDHHC24. Your “NCBI Refseq Gene Track should look something like Figure 2.11. If not, you may have zoomed into the first half of exon 1 which is noncoding. Or maybe you are at the wrong end of the gene!


This is the first half of the first **coding** exon of ZDHHC24. The noncoding portion of exon 1 extends to the right this time because *ZDHHC24* is a bottom strand (or minus strand) gene. The red oval highlights the strand you are currently viewing. Since the arrow points to the right, the top strand (5' on the left) is displayed. You can toggle this arrow with your mouse.

Figure 2.11: This is the first half of the first coding exon of ZDHHC24. The noncoding portion of exon 1 extends to the right this time because ZDHHC24 is a bottom strand (or minus strand) gene. The red oval highlights the strand you are currently viewing. Since the arrow points to the right, the top strand (5’ on the left) is displayed. You can toggle this arrow with your mouse.


Again, this is the first half of the first **coding** exon of ZDHHC24. The red oval highlights the strand you are currently viewing. Since the arrow points to the left, the bottom strand (5' on the right) is displayed. Notice that the DNA sequence is gray. Compare to the top image. This is a visual reminder of which strand you are viewing.

Figure 2.12: Again, this is the first half of the first coding exon of ZDHHC24. The red oval highlights the strand you are currently viewing. Since the arrow points to the left, the bottom strand (5’ on the right) is displayed. Notice that the DNA sequence is gray. Compare to the top image. This is a visual reminder of which strand you are viewing.


Look at the three nucleotides centered above the Methionine (colored in green on the right side of Figure 2.11). Notice that the nucleotides listed from left to right is NOT an A-T-G as you might expect but instead C-A-T! This is because you are “reading” the top strand of the genomic DNA and ZDHHC24 is a bottom strand gene. To view the bottom strand of the genomic DNA, click on the far left arrow within the Base Position Track (See the red oval, Figure 2.11). Notice the orientation of the arrow will flip and the genome sequence in your “Base Position” Track will become light gray (red oval, 2.12). Now there is an A-T-G directly above the Methionine (M) but only if you read it from right to left! This is because the bottom strand of DNA is oriented with 5’ end on the right.

2.5.2 Test your understanding

  • Write out the DNA sequence for the 3rd codon for ZDHHC24 (Written 5’ to 3’ as is standard practice).
  • Write out the DNA sequence for the 5th codon for ZDHHC24 (Written 5’ to 3’ as is standard practice).
  • Write out the DNA sequence for the stop codon used by the short mRNA variant of ZDHHC24 (Written 5’ to 3’ as is standard practice).
  • EXTRA CREDIT! Write out the DNA sequence for the stop codon used by the long mRNA variant of ZDHHC24 (Written 5’ to 3’ as is standard practice).

    HINT: To confirm that you have identified the codon sequence correctly, use the genetic code to confirm that the codon codes for the expected amino acid or stop.


In summary, DNA is double-stranded. Some genes, like BBS1 create an RNA transcript with a sequence that matches the top strand. Other genes, like ZDHHC24, create an RNA transcript with a sequence that matches the bottom strand. Also, the two genomic DNA strands anneal (hybridize) in an antiparallel orientation. The 5’ end of the top strand is on the left by convention. Thus, the 5’ end of the bottom strand is on the right.

2.6 Reading Frames

Translation begins at the start codon. This is the first codon “read” by the ribosome. All subsequent codons are read in nonoverlapping groups of three. Thus, the start codon “sets” the reading frame. A reading frame ends only when a stop codon is encountered that is in the same reading frame as the start codon. This is the last codon “read” by the ribosome.

Imagine you isolate and sequence a short random stretch of DNA pulled from the ocean. It is likely genomic DNA and you want to see if it corresponds to a coding portion of a gene. Specifically, you want to see if this short stretch of DNA codes for a previously studied protein. Of course, you have no clue which strand might be used as a template for transcription nor where the start codon is. Here is a thought question: What is the maximum number of polypeptide sequences that can hypothetically be created by a single segment of double-stranded DNA? Your answer will be revealed below.

To view all possible reading frames associated with a given segment of DNA zoom into to the first exon of BBS1. Next, right click on the grey rectangle corresponding to the Base Position track (See arrow, Figure 2.13). A pop-up menu will appear. Choose “full”.


How to view all three reading frames for a single strand of DNA. Right click on the gray rectangle on the left of the Base Position Track (red arrow). A pulldown menu will open. Choose Full.

Figure 2.13: How to view all three reading frames for a single strand of DNA. Right click on the gray rectangle on the left of the Base Position Track (red arrow). A pulldown menu will open. Choose Full.


You should now see three reading frames displayed below the nucleotide sequence. Why three? This is the maximum number of unique polypeptides that can be theoretically produced from a given segment of the top strand of DNA. See what happens when you toggle to the bottom strand of DNA (click the arrow —> on the far left of the DNA sequence). Now you can view the three reading frames that correspond to the bottom strand of DNA. Both views are shown one on top of the other in Figure 2.14. Notice they are different.



Once you change the base position track to **Full**, the three possible reading frames for a *single* DNA strand will be displayed. When the arrow points to the right, the three reading frames will be of the top strand (top image). Notice the start codon in green matches the start codon in the third reading frame (**NOTE: This image comes from the 2013 genome assembly and will look different for you.**) In fact, nearly all of the amino acids in exon one match the amino acids in the third reading frame. We expect one of the three top strand reading frames to match *BBS1* since BBS1 is a top strand gene. Now , however, when the arrow points to the left (circled in red), the three reading frames displayed are for the bottom strand (bottom image). Now *none* of the amino acids in the three reading frames match the amino acids found in *BBS1*.

Figure 2.14: Once you change the base position track to Full, the three possible reading frames for a single DNA strand will be displayed. When the arrow points to the right, the three reading frames will be of the top strand (top image). Notice the start codon in green matches the start codon in the third reading frame (NOTE: This image comes from the 2013 genome assembly and will look different for you.) In fact, nearly all of the amino acids in exon one match the amino acids in the third reading frame. We expect one of the three top strand reading frames to match BBS1 since BBS1 is a top strand gene. Now , however, when the arrow points to the left (circled in red), the three reading frames displayed are for the bottom strand (bottom image). Now none of the amino acids in the three reading frames match the amino acids found in BBS1.


The three reading frames corresponding to the top strand are numbered +1 (top), +2 (middle) and +3 (bottom). The three reading frames corresponding to the bottom strand are numbered -1 (top), -2 (middle) and -3 (bottom). For convenience, the position of all possible start codons are highlighted in green and all possible stop codons (*) are highlighted in red.

Now, zoom out until you can see the first three exons of BBS1 (Figure 2.15). At this level of zoom, you should still be able to read the amino acid sequence within the gene schematic and within the base position track. Notice there are many “start” and “stop” codons scattered throughout this segment of the genome (red and green boxes). This should not be surprising given that start and stop codons are only three nucleotides in length and thus they occur by chance. That said, the ability of an ATG or TGA, TAA, TAG to function as a start or stop codon, respectively, depends on context. For example, the real start codon for a given gene is typically the first start codon encountered in an mRNA as the Ribosome scans the mRNA from 5’ to 3’ (more about this later).

All ATG codons are colored green regardless if they function as a start codon or not. Similarly, all TGA, TAA and TAG codons are colored red regardless if they function as a stop codon. How these triplets function depends on context.

Figure 2.15: All ATG codons are colored green regardless if they function as a start codon or not. Similarly, all TGA, TAA and TAG codons are colored red regardless if they function as a stop codon. How these triplets function depends on context.


Now, notice that the amino acid sequence within the coding exons of BBS1 always match one of the three reading frames. For our example, the coding portion of exon 1 begins with a methionine (highlighted in green). The +3 reading frame also has a methionine in this same position. Compare a few more amino acids in BBS1 and the +3 reading frame. You should see that all match (except one - do you see which one?). Thus, coding exon 1 utilizes the +3 reading frame!

Which reading frame do the other coding exons utilize? Is it always the same reading frame? Or does it change from exon to exon. New trick: To easily view exons farther downstream (or 3’ of), click on the open arrow head on the far right (See hot pink arrow at the far right of the image, Figure 2.15). Also, if you hover over the open triangle and don’t click, a small pop up will appear with information about which exon you are jumping to! This may come in handy while answering the following questions.

2.6.1 Test your understanding

Recall: The three reading frames corresponding to the top strand (plus strand) are called +1 (top), +2 (middle) and +3 (bottom). The three reading frames corresponding to the bottom strand (minus strand) are called -1 (top), -2 (middle) and -3 (bottom). Answer the following questions about BBS1 (a top strand gene) and ZDHHC24 (a bottom strand gene).

  • Which reading frame does exon 2 of BBS1 use?
  • Which reading frame does exon 3 of BBS1 use?
  • Which reading frame does exon 4 of BBS1 use?
  • Which reading frame does exon 1 of the long form of ZDHHC24 use?
  • Which reading frame does exon 2 of the long form of ZDHHC24 use?
  • EXTRA CREDIT! How would the amino acid sequence of BBS1 change if intron 1 were not spliced out?

2.7 How to View and Retrieve Gene Product Sequences

To retrieve BBS1 gene product sequences (or any gene product sequence) from the UCSC genome browser, click on the schematic for the BBS1 transcript in the “Gene and Gene Predictions track”. The top half of the new page contains numerous links to pages that provide sequence information associated with this gene (Figure 2.16).


When you click on a gene/transcript schematic in the gene prediction track you are taken to its gene information page. BBS1 only has one isoform and so there is only one gene information page. In other words, this information is transcript specific and depends on which isoform you click on. Useful links are highlighted. Some links will help you answer Test Your Understanding questions. Explore!

Figure 2.16: When you click on a gene/transcript schematic in the gene prediction track you are taken to its gene information page. BBS1 only has one isoform and so there is only one gene information page. In other words, this information is transcript specific and depends on which isoform you click on. Useful links are highlighted. Some links will help you answer Test Your Understanding questions. Explore!


For example, to view information about the BBS1 mRNA, click on the NM_024649.5 link. To view information specifically about the protein sequence, click on the NP_078925.3 link. NM_024649.5 and NP_078925.3 are so-called as accession codes. Accession codes that begin with NM_ correspond to mRNA sequences. While those that begin with NP_ correspond to protein sequences. By default, both sequence pages are “Genbank format”. This format includes useful annotations that can be “read” by sequence analysis software programs. The mRNA or protein sequence is at the very bottom of the page. Scroll down. Alternatively, click the “FASTA” link to see the sequence in a simpler format. One thing you might notice: There are no uracil bases (U) in the mRNA sequence! This is typical. Sequence databases do not expend any computational energy to convert thymines (T) to uracils (U) for display purposes only. Finally, to see how the mRNA aligns with the genomic sequence, click on the “View details of parts of alignment within browser window”. Read the text to determine what the highlighting means although it may be intuitive.

2.7.1 Test Your Understanding

  • List the first four nucleotides of the BBS1 mRNA according to the accession record, NM_024649.5 (Answer found in FASTA or Genbank format).
  • How long (in bp) is the BBS1 spliced transcript (mRNA) according to the accession record, NM_024649.5 (Answer found in Genbank format only)?
  • List the first four amino acids of the BBS1 protein according to the accession record, NP_078925.3 (Answer found in FASTA or Genbank format).
  • How long (in amino acids) is the protein according to the accession record, NP_078925.3 (Answer found in Genbank format only)?
  • EXTRA CREDIT!!! How long is the BBS1 unspliced transcript (the pre-mRNA)? (HINTS:You will find this information in the gene information page for BBS1 although you will not find the phrase “unspliced transcript” there. That said, the length of the unspliced transcript is equivalent to the length of the _______)

2.8 Addendum

Now that you have more familiarity with gene structure, the following tutorials are highly recommended. They were created by the UCSC Genome Browser team. These are also worth revisiting before you embark on your Independent Project at the end of the term.

© 2023, Maria Gallegos. All rights reserved.


  1. A chain of amino acids linked together by peptide bonds. While the words “protein” and “polypeptide” are often used interchangeably, the term protein is more vague. It could refer to a single polypeptide or it might mean multiple polypeptides bound together in a complex. On the other hand, polypeptide always refers to a single sequence of amino acids↩︎

  2. A cytoplasmic machine (comprised of RNA and protein) which uses the mRNA as a template to synthesize a polypeptide.↩︎

  3. A set of three adjacent nucleotides in an mRNA molecule that either specify the incorporation of an amino acid in a growing polypeptide chain or signals the end of polypeptide synthesis. Codons with the latter function are called termination codons (Snustad).↩︎

  4. IMPORTANT: If you do not see the amino acid sequence displayed over the exon schematic, you may need to zoom in farther. That said, if you are zoomed in enough to see the genome sequence, and you still don’t see the amino acid sequence then something else is wrong. Right click on the gray rectangle corresponding to the NCBI RefSeq Gene Prediction track and choose, “Configure RefSeq Curated”. Then change “Color track by codons:” from “OFF” to “genomic codons”↩︎

  5. Notice that at this zoom level some exons simply look like vertical lines. This is because, exons are typically shorter than introns (at least in mammals) and so at this zoom level exons look like lines instead of boxes↩︎

  6. Except that Uracil replaces Thymine↩︎

  7. “Coding strand” is misleading because it implies that the entire transcript is coding - not true. “Sense strand” is just not very informative. I like “informational strand” because even though the 5’ UTR, 3’ UTR and introns lack codons, they can and do contain information important for gene expression - to be discussed later. You can’t argue with the term “nontemplate strand”. It’s clearly accurate↩︎