8 Finding Genes with Orthology
Genetic model organisms like yeast, worms, flies, fish and mice play a critical role in the advancement of biomedical research. Research in model organisms can be cheaper, faster and is undoubtedly more ethical than conducting research on humans! They are particularly useful for studying gene function in vivo. A list of advantages and disadvantages for each of the most common species studied today are shown in Figure 8.1.
While at the surface humans appear to have little in common with mice, fish, flies, worms or fungi, this approach works. Why? Cell Theory states that “all cells come from pre-existing cells”. And in fact, a significant percentage of genes from these ancestral cells persist in some form in Eukaryotes today. Despite nearly 1 billion years of evolution, many gene products are recognizable at the amino acid sequence level and in some cases are functionally interchangeable!
For example, a human gene called BCL2 is known to inhibit programmed cell death (Vaux et al. 1988). Programmed cell death is a biological process that has been observed in all animals studied to date. When certain cells are no longer needed, they “commit suicide” by activating this intracellular death program. But programmed cell death is tightly regulated. Too much death is also problematic. In 1992, Vaux et al. discovered that if you inject human BCL2 fused downstream of a worm heat-shock promoter54 its progeny55 had less programmed cell death following heat shock! See Figure 8.2 A and B.
A worm gene called ced-9 functions similarly. In loss of function ced-9 mutants, more cells die suggesting that wildtype ced-9 (like wildtype BCL2) inhibits programmed cell death. In subsequent experiments, Hengartner and Horvitz partially rescued ced-9 loss of function mutants with the human BCL2 gene! In other words, ced-9 mutant worms expressing human BCL2 looked more like wild type (Figure 8.2 C). Dr. Vaux, the first author explained in an interview: “This meant that the human BCL2 protein must be interacting with the worm’s cell death mechanism. The fact that human BCL2 could work in a worm suggested that human BCL2 can interact with whatever proteins the worm CED-9 protein interacts with. That, in turn, suggests that not only the gene but also the pathway of cell death is likely to be universal.” That is one billion years of evolution separating worms and humans! This is strong evidence that human BCL2 and worm ced-9 are orthologs. Again, this is a term that describes two genes in distinct species that evolved from an ancestral gene present in the last common ancestor.
In 2014, Dr. Ethan Perlstein created Perlara, a biotech Public Benefit Corporation which aimed to accelerate the discovery of treatments for rare genetic diseases through the use of simple model organisms. While this company has since shut down56 they have had some success.
Perlara created yeast, worm and fly models of a rare congenital disorder of glycosylation called PMM2-CDG57 caused by mutations in the PMM2 gene58. From their initial studies they knew they needed a drug that would boost the enzymatic activity of PMM2. To identify such a drug, Perlara performed a drug repurposing screen59 using the worm model Perlara created for PMM2-CDG. Epalrestat, a drug normally used to treat diabetic neuropathy looked promising and was later shown to boost PMM2 enzyme function in PMM2-CDG patient fibroblasts60! Finally in early 2020 Dr. Ethan Persltein began treatment of the first human patient, Maggie Carmichael. After 6 months, the change was dramatic (Figure 8.3). To read more, you can subscribe to Ethan’s substack
The bottom line: The accurate identification of an orthologous pair between a model organism and humans is a prerequisite for creating a human disease model. A simple and widely used approach to identify putative orthologs is known as the reciprocal best hit (RBH) method61. In fact, one study concludes that a high proportion of RBH pairs are likely to be true orthologs (Dalquen and Dessimoz, 2013).
8.1 Performing a Reciprocal Best Hit Analysis
To find a putative ortholog of a human gene in a species of interest, you start with the protein it encodes (its gene product) and use it as a query62 to search a protein database from a model organism of your choice (i.e. C. elegans). This is the first BLASTP63 in your reciprocal BLASTP analysis. Your goal is to identify the protein in your model organism of choice that is most similar to the human protein you used as a query. This is your top hit also known as your best hit. You then use the protein sequence corresponding to the “top hit” gene as a query to search the human protein database again. This is called the reciprocal BLASTP. If your “top hit” in the reciprocal BLASTP pulls up the same gene as the one used as a query in your first BLASTP then you have found a “Reciprocal Best Hit” (Figure 8.4). In other words, a reciprocal best hit succeeds at identifying a pair of orthologs when it identifies a pair of genes that are eachother’s best hits.
8.2 First BLASTP
To start, click the session link for Gene Structure. Click on the BBS1 schematic within the NCBI Refseq Gene Predictions Track to open the gene information page. In the gene information page, click on the Protein Accession ID link (Format: NP_XXXXX.X) to open the protein information page. At the top right corner of the protein information page, under the section entitled, “Analyze this sequence”, click on “Run BLAST”. A BLASTP page will open. On the BLASTP page, notice that the protein accession code for BBS1 has been entered into the query sequence window. This is the “Query ID” for your first BLASTP search. Keep all default settings except the following: Choose the “Reference Proteins (Refseq_proteins)” from the pull down menu for your Database. Then beign to type “Caenorhabditis elegans” into the query window for “Organism”. As you type, a drop down menu will appear for your convenience. Choose the correct species on this list. Caenorhabditis is not easy to spell. Best to just type “elegans”! Finally, scroll to the bottom of the BLASTP page, click “BLAST” and wait (Figure 8.5).
Your results page will appear after some time. It will contain multiple tabs near the top of the page including “Descriptions”, “Graphic Summary”, “Alignments” and “Taxonomy”. You are welcome to “click around” and play but for now we will stick with the “Descriptions” tab. This tab includes a table that lists all your “hits” that have significant similarity to your query sequence. The table includes seven columns. In the descriptions column you will find the common or systematic name of the protein and the species in brackets. Columns 2, 3 and 5 quantify how similar your query is to the protein listed. More intuitive measures of sequence similarity include “Query Cover” and “Percent Identity”. “Query Cover” describes what fraction of the “hit” aligns with your query. “Percent Identity” describes the fraction of the alignment that are identical between the two sequences. For reference, ced-9 and BCL2 are thought to be orthologs but they are only 25% identical. The last column provides the protein accession ID for the “hit”. The “top hit” is simply the hit in the top row (row 1). It should have (in sum) the largest “Max Score”, “Total Score”, “Query Coverage” and “Percent Identity”. It should also have the lowest “E value”. To see the pairwise alignment of your top hit and query, click the “Alignment” tab. Scroll down to see the complete alignment. Each alignment includes a description of the hit including the protein name (descriptive or systematic) and the species in brackets. The next line includes the “Sequence ID” (Protein Refseq Accesion ID) and the “Length” of the protein (in our example, this is the number of amino acids).
8.3 Understanding a Pairwise Alignment
Now that you have done your first BLASTP, you need to understand how to “read” a pairwise alignment. Figure 8.6 includes a small portion of a pairwise alignment between two proteins. The first row of the alignment contains the query sequence (from amino acids 196 to 255). This sequence is aligned to and stacked on top of a protein sequence that has significant sequence similarity with the Query (from amino acids 183 to 240). This aminco acid sequence is displayed in the “Sbjct” row. The middle row of the alignment in Figure 8.6 displays the degree of matching between the two sequences at each position. A “+” is placed in each column of the alignment where the amino acids in the “query” and “sbjct” are similar but not identical (conserved). An amino acid (in single letter code) is placed in the column where the amino acids in the “query” and “sbjct” are identical (conserved). Columns where the the amino acids in the “query” and “sbjct” are unrelated are left blank. A gap in the alignment is represented by “-“. Each gap suggests that during the process of evolution, one or more complete codons were either deleted in the query gene or added in the subject gene.
To better understand how the numbering system works in a pairwise alignment, see if you can confirm the following statements. In Figure 8.6, there is a “Q” at position 197 of the Query sequence. Q197 aligns with an S at position 184 of the Sbjct. The “I” at position 200 of the Query is identical to the “I” at position 187 of the Sbjct. The “M” at position 203 of the Query is not identical but conserved with the “I” at position 190 in the Sbjct (methionine (M) and isoleucine (I) are both hydrophobic). Finally, there is a “P” at position 200 of the Sbjct. Can’t confirm? Note that gaps in an alignment are NOT counted.
8.3.1 Test Your Understanding
Click the link above to see the TYU questions.
8.4 Reciprocal BLASTP
Now that you have performed your first BLASTP using BBS1 as your Query, you are ready to see if the best hit in worms will identify the human BBS1 gene as the top hit in a “reciprocal BLASTP”. First step, retrieve the protein information page for your top hit from the first BLASTP. You can either go to the “Descriptions” tab and click on the link to the Refseq Accession ID (Format: NP_XXXXX.X) in the “Accession” Column for your top hit ((Figure 8.7). Or you can go to the “Alignment” tab and click on the Refseq Accession ID link listed for “Sequence ID:”. Either will take you to the same protein information page for your top hit in C. elegans. Then click on the “Run BLAST” link as before. A BLASTP submission page will open. On the BLASTP submission page, the accession code for the worm top hit will have been entered into the query sequence window for you. This is your new “Query ID” for your reciprocal BLASTP. Keep all default settings except the following: Choose the “Reference Proteins (Refseq proteins)” from the pull down menu for your Database. Then type “Homo sapiens” into the query window for “Organism” ((Figure 8.7). As you type, a drop down menu will appear for your convenience. Choose the correct species on this list. Finally, scroll to the bottom of the BLASTP page, click “BLAST”. Once your results page open, look at the top hit. Is it the BBS1 gene?
8.4.1 Test Your Understanding
Click the link above to see the TYU questions.
8.5 Understanding your RBH analysis
In the section above, I alked you though how to do an RBH analysis starting with the human BBS1 protein as the query in your first BLASTP. The interpretation was straightforward. The protein efseq ID used as a query in the first BLASTP was the same Protein refseq ID obtained as the top hit in the reciprocal BLASTP. What does it mean when they don’t match? There are two possibilities. Each protein refseq ID corresponds to two distinct genes and the gene pair are NOT orthologs. Or each protrein refseq ID corresponds to the same gene and the gene pair are orthologs (Figure 8.8. Compare this outcome with Figure 8.4.
When the the Refseq Protein Accession IDfs do not match you need to confirm that they are gene products of different genes before you claim that they are NOT orthologous. Recall, a gene can produce multiple protein isoforms! Thus, one gene can have multiple protein Accession number ID codes. Only one can be a top hit! Where can you find this information? One way, go to the protein information page for each protein Refseq ID and search (Control F) for “/gene” to find the gene annotation buried at the bottom of the page. (Figure 8.9).
Take home Message:
When one does an RBH analysis, protein sequences are used to perform each BLASTP (first and reciprocal). But a reciprocal best hit successfully identifies a pair of orthologs when it identifies a pair of genes that are eachother’s best hit. The BLASTP is used as a tool to identify those genes.
8.5.1 Test Your Understanding
Click the link above to see the TYU questions.
8.6 Homework
A promoter that is activated by heat. With heat shock, the promoter will initiate high levels of transcription!↩︎
offspring or descendants of an animal, or plant - online dictionary↩︎
an early backer and partner withdrew from a licensing deal which triggered a fatal downward spiral↩︎
According to Wikipedia, glycosylation is a process whereby a carbohydrate is attached to a hydroxyl or other functional group of another molecule (often a lipid or protein)↩︎
According to Genetics Home Reference, the PMM2 gene produces an enzyme called phosphomannomutase 2. This enzyme is involved in glycosylation (see earlier footnote), which acts by attaching groups of sugar molecules to proteins↩︎
searching through a library of drugs already available and approved for a different disorders↩︎
PMM2-CDG patient fibroblasts are skin cells taken from patients with the PMM2 disorder.↩︎
Sometimes called the bidirectional best hit (BBH) method↩︎
defined by the online dictionary simply as “a question”. In our context, your human protein sequence implies the question: Are there proteins in my model system of choice that are similar to this human protein sequence?↩︎
The BLAST algorithm finds and aligns regions of similarity between biological sequences. BLASTP searches protein databases with protein queries. BLASTN searches nucleotide databases with nucleotide sequences↩︎