Chapter 3 Workshop: Discovering a new gene in a DNA sequence

3.1 Using WWW resources to discover a new gene

The data in this Nucleotide database include complete chromosomes, individual genes, expressed sequence tags (ESTs) and various other DNA/RNA molecules and fragments of molecules. Some of these DNA sequences have been very well described and annotated, e.g. sequences of human chromosomes. However, many DNA sequences in the database have hardly been studied at all and present an opportunity for us to discover something new. Examples include some microbial genome sequences and a wealth of metagenomics sequences. These are a good place to go ‘fishing’ for new genes. In fact there is a name for this: ‘bioprospecting’.

Photo of people fishing

A genome (or metagenome) sequence typically contains segments called genes that encode a protein (or a functional RNA), separated by intergenic regions. Generally, the genes evolve relatively slowly, so they will share sequence similarity with genes from other organisms, even if that similarity is limited. On the other hand, intergenic regions often evolve relatively quickly and may share little or no sequence similarity with any other organisms. So, we are going to look for islands of relatively conserved sequence to help identify new genes.

3.2 The data

A good place to access DNA sequence data is the NCBI Nucleotide database (https://www.ncbi.nlm.nih.gov/nuccore/). You can do simple (as well as quite sophisticated) searches for sequence data here. But perhaps the most useful way to access sequence data is via a BLAST search. You might already be familiar with BLAST from your previous studies.

“The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.”

From: https://blast.ncbi.nlm.nih.gov/Blast.cgi

3.3 Discover a previously unknown gene

The data in this Nucleotide database include complete chromosomes, individual genes, expressed sequence tags (ESTs) and various other DNA/RNA molecules and fragments of molecules. Some of these DNA sequences have been very well described and annotated, e.g. sequences of human chromosomes. However, many DNA sequences in the database have hardly been studied at all and present an opportunity for us to discover something new. Examples include some microbial genome sequences and a wealth of metagenomics sequences. These are a good place to go ‘fishing’ for new genes. In fact there is a name for this: ‘bioprospecting’.See, for example, see Ikeda (2017).

A genome (or metagenome) sequence typically contains genes that encode a protein (or a functional RNA), separated by intergenic regions. The genes evolve relatively slowly, so they will show some sequence similarity with genes from other closely related organisms. Furthermore, the amino-acid sequences of proteins evolve more slowly than the underlying DNA sequence; this means that we can detect amino-acid sequences similarity even between quite distantly related organisms. On the other hand, intergenic regions often evolve relatively quickly and may share little or no sequence similarity with any other organisms. So, we are going to look for islands of relatively conserved sequence to help identify new genes.

3.3.1 Choose a previously uncharacterised DNA sequence

The first step towards discovering a new gene is to choose a DNA sequence from the public databases that has not already undergone extensive study and gene prediction.

Here are some example DNA sequences that you could use for this exercise (choose just one):

A contig from a normal human genome:
- https://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Search&db=nucleotide&term=JAAIVS010000007.1
- https://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Search&db=nucleotide&term=JAAIVS010000027.1
A contig from a human tumour genome:
- https://www.ncbi.nlm.nih.gov/nuccore/1932101380
- https://www.ncbi.nlm.nih.gov/nuccore/JABWGJ010000008.1
A contig from the genome of the black-handed spider monkey (Ateles geoffroyi):
- https://www.ncbi.nlm.nih.gov/nuccore/JAKFHY010000027.1
A contig from the assembly of a metagenome (from a lake):
- https://www.ncbi.nlm.nih.gov/nuccore/PDVJ01000554.1.
A contig from a butterfly genome:
- https://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Search&db=nucleotide&term=JAXCMA010000013.1

If you prefer not to use one of those suggested examples, you are welcome to find an alternative yourself. A good source of such sequences are the various metagenomics studies of microbiomes. Try searching for a suitable example at the NCBI Nucleotide web portal: https://www.ncbi.nlm.nih.gov/nucleotide.

At the NCBI website you could simply search for ‘metagenome’ or something more specific like ‘coral metagenome’, ‘alkali sediment metagenome’; use your imagination! Then choose a specific DNA sequence from this metagenome/genome to work on, from among those returned by the search. You might need to use some trial and error to navigate around the NCBI website, clicking on a few links before you get to a specific sequence.

Ideally, you will choose a DNA sequence that is between about 4 kbp and 45 kbp in size. This ensures that the sequence is big enough to include at least one gene and is not so big that the analysis will be slow and cumbersome.

You should ensure that your chosen sequence has not already been annotated with predicted genes. This is so that you can discover something truly novel that has not already been discovered/predicted and entered into the public databases.

For the DNA sequence that you have chosen to study, can you find the following information?

Question	Please write you answer in this space
Where did this DNA sample come from?
What method was used to sequence the DNA?
Accession number of this sequence?
How long is this DNA sequence?
Do we know what organism this DNA came from?
Are there any research papers associated with this DNA sequence?
Who generated the data?

Now, we are going to try to discover a new protein-coding gene on this DNA sequence. In other words, we are going to annotate the naked DNA sequence with the positions of a gene. The underlying principle is that the protein encoded by this gene shares some sequence similarity with already-known proteins. So we are going to search for patches of sequence similarity between our naked DNA sequence and a database of amino-acid sequences from known proteins.

To achieve this, we will perform a BLAST search using your chosen DNA sequence as the query against the databases of protein amino acid sequences at the NCBI. The idea is that your DNA sequence will contain ‘islands’ of protein-coding sequences, surrounded by a ‘sea’ of intergenic non-coding sequence.

Remember, there are several ‘flavours’ of BLAST (BLASTN, BLASTP, TBLASTN, BLASTX, TBLASTX); which one should you use and why? (Clue: BLASTX takes a DNA query sequence and searches against a database of proteins). You can perform your BLAST search at https://blast.ncbi.nlm.nih.gov/Blast.cgi as in the example in the image below.

Notice that for the query you can insert either the actual DNA sequence or just the accession number. When you are ready, hit the “BLAST” button and wait for the results. By the way, if your patience is limited, you might wish to search against the Swiss-Prot database rather than the default database because searching this smaller database will be a bit quicker than searching the much larger one. On the other hand, searching against the smaller, less-comprehensive database might yield less information. So, you could try both (in different tabs in your web browser).

In the example BLAST result below, we can see four regions of the DNA sequence that show clear similarity. This strongly suggests there is homology between the product of our query sequence and previously described proteins in the SwissProt database. Since these previously-known proteins are homologous to our query protein, it is likely that they share similar structure and function.

Let’s choose one of these for further investigation (click on one of the four red blocks that each represent a block of sequence similarity):

We are now going to perform a series of steps in order to extract the DNA sequence of our newly discovered gene. Note that this gene sequence is a substring of the original query DNA sequence that we started with. In other words, the protein-coding gene sequence occupies only a small part of the genomic DNA sequence; their might be other genes elsewhere in the DNA sequence. In the example above, our gene lies around positions 76,476 and 77,696 in the original 188,848-bp sequence.

There is more than one way to extract the gene sequence from the larger DNA query sequence. For example, we could try to manually copy and paste this substring onto the clipboard. However, the steps below are probably easier and less error-prone.

In the example above, we can see that protein accession P07323.2 is homologous with our newly discovered gene. So, let’s use TBLASTN to get the sequence of our new gene by aligning the protein against the DNA sequence. This time we will use the “Align two or more sequences” option and we will use previously described protein P07323.2 as the query and metagenomic DNA sequence accession PDVJ01000001.1 as the ‘subject sequence’ against which we will search the query:

The result of this example looks like this:

This clearly shows that about 85% of the query protein’s length has detectable sequence similarity with our query DNA sequence. Now if we scroll down we can see the sequence alignment:

Note the “Download” icon (in the figure above). This allows us to obtain the sequence of our new gene:

Here is the resulting nucleotide sequence of our new gene:

Notice that this contains just the sequence of the actual gene and we have discarded the non-coding DNA sequence from upstream and downstream of the gene. We can now perform analyses on this gene without confusing or confounding results arising from the up- and downstream flanking sequences.

We can now try to gain insights into the structure, function and evolution of this gene by pasting its sequence, or the amino-acid sequence of its encoded protein, into various publicly available webservers …

Once you have extracted the sequence of the gene that you have discovered, try to answer the following questions about it. Some hints are given below about online tools that you could use to address these questions. BLAST searches will help to answer many of these. The most useful thing to do first is to translate from DNA to protein. Experiment with the tools mentioned below, copying and pasting your new protein sequence. They are all reasonably easy to use; but please ask for help if you get stuck. Discuss your findings with other students, the demonstrators and lecturer.

You may wish to come back to the questions about structure and function after completeing the subsequent workshop, which is all about predicting functions of proteins from their amino-acid sequences.

Question	Please write you answer in this space (it may be useful for your coursework)
What is the predicted function?
Does it have any significance in pathogenesis, biotechnology, fundamental biological processes? Hint: PubMed database might be helpful.
Can we predict the structure of the protein?
Does it contain any structural domains?
Is the gene part of an operon or cluster of functionally related genes?
Do homologues occur in only closely related organisms or do they occur more widely across all domains of life?
Can you generate a multiple sequence alignment for your protein and its homologues?
From the alignment, can you build a phylogenetic tree?
What organism (taxonomic group) does your protein come from? The phylogenetic tree might help you to answer this.

Suggested online analysis tools (additional information about some of these is provided below):

Suggested tool	Web address
Translate DNA to protein	https://web.expasy.org/translate/
STRING (see mini-tutorial below)	https://string-db.org/
Pfam	http://pfam.xfam.org/
Find open reading frames in your gene sequence	http://www.ncbi.nlm.nih.gov/gorf/gorf.html

3.4 Finding and displaying open reading frames (ORFs)

A double-stranded DNA sequence has six different reading frames (three forward and three reverse). An open reading frame (ORF) is the part of a reading frame that has the potential to code for a protein or peptide. An ORF is a continuous stretch of DNA beginning with a start codon usually methionine (ATG) and ending with a stop codon (TAA, TAG or TGA in most genomes).

For this task, you will use the NCBI ORF Finder at http://www.ncbi.nlm.nih.gov/gorf/gorf.html

Exercise: Use this tool to find the ORFs in the human opsin-5 transcript sequence (accession number NM_181744) or in your newly discovered gene.

So, in your newly discovered gene, can you find a single ORF that encodes the newly discovered protein? If so, then in which reading frame does the ORF fall? If not, then why not? (You may wish to draw a diagram here)

By now, I hope that you are now convinced that it is possible to make new discoveries in biology surprisingly easily by exploring freely available data and using freely available analysis tools. This is thanks to the efforts of software developers, curators and database managers as well as the researchers who deposit their data into these public repositories. The techniques that you explored today can lead you to important insights, based entirely on systematic, yet quite simple, analysis of publicly available DNA and protein sequence data. It doesn’t always require any expensive equipment or specialised laboratory, just a computer with an internet connection and some curiosity!

References

Ikeda, Haruo. 2017. “Natural products discovery from micro-organisms in the post-genome era†.” Bioscience, Biotechnology, and Biochemistry 81 (1): 13–22. https://doi.org/10.1080/09168451.2016.1248366.