Section 9 Week 8: Microbial genomics and infectious disease (continued)

9.1 Online resources for SARS-CoV-2 genomics

A massive international SARS-CoV-2 virus genomics sequencing effort effort has been mobilised in response to the global pandemic sthat began in Wuhan, China in December 2019. Consequently, this virus is now almost certainly the most sequenced biological entity on earth.

Laboratories across the world are sequencing huge numbers of virus samples. Typically, the viral RNA genome is amplified from patient’s testing swabs by RT-PCR and the resulting amplified DNA is subjected to sequencing by one of the second or third-generation sequencing instruments. At Exeter, we currently use the Oxford Nanopore MinION sequencing platform. The Exeter effort is part of the COG-UK consortium, which coordinates SARS-CoV-2 sequencing across the United Kingdom and disseminates the resulting data so that it can be combined with and compared to data from across the globe.

Data from COG-UK are imported into a database of viral sequences from all over the world called GISAID. This initiative was originally set up to share sequence data from influenza viruses but has proved to be well suited to be the go-to data repository for sharing SARS-CoV-2 data.

The huge collection of SARS-CoV-2 genome sequence data, and metadata (such as geographical location, date of collection, etc.) in GISAID can be accessed by simply downloading the whole dataset, thus allowing you to write your own code to analyse the data yourself. Or you can use web-based tools on the GISAID website to explore data, for example the PhyloDynamics tool. There is also the fantastic NextStrain web portal that generates informative and highly visual analyses of the GISAID data.

In these exercises, we are going to take a brief look at what can be learned from large-scale SARS-CoV-2 genome sequencing on a national and international scale using some web-based tools.

9.1.1 The COVID-19 Genomics UK (COG-UK) consortium

Navigate your browser to the COG-UK website at

  • How many SARS-CoV-2 genomes have been sequenced by COG-UK to date?

Now, let’s take a look at the COG-UK data and analyses. Navigate your web browser to: You will see links to several tools that use data from COG-UK.

9.1.2 SARS-CoV-2 lineages

First, let’s take a look at Microreact. Click on the link to the Microreact website. You should see something like this:

First, let’s take a look at the phylogenetic tree on the right side of the page. It looks a bit of a mess because it contains hundred of thousands of nodes and branches. This phylogenetic tree represents the evolutionary relationships between each sequenced virus. RNA-based viruses such as SARS-CoV-2 evolve quickly, though this virus evolves more slowly than some other RNA viruses such as HIV and influenza viruses.
Over time, more and more mutations occur and so each branch of the tree splits into sub-branches and the sub-branches later split into sub-sub-branches and so on … Each lineage contains a set of specific mutations that it has acquired since diverging from the original Wuhan, China ancestral virus. The major branches represent “lineages” of the virus and these lineages have been assigned names. You may have heard of some of these lineages, for example the so-called “Kent” lineage. Although some lineages have informal names such as “Kent,” or “Bristol,” or “South African,” there is also a formal naming system called PANGO and is described in a research paper by Rambaut et al. (2020). Under this system, the “Kent” lineage is B.1.1.7. Before the B.1.1.7 lineage came to dominate the population in the UK, the most common lineage was the confusingly similarly named B.1.1.7. Towards the middle and top of the Microreact page, you can seem plots that recount the rise and fall of various lineages during the course of the pandemic.

  • Which is currently the most common lineage of SARS-CoV-2 in the UK today?
  • Can you find which were the most common lineages one month ago? Six months ago? (Hint: look for the small calendar icon below the main graphic)

Now, let’s take a more global view. We will look at the PANGO lineages report website here:

Notice that it generates plots for the geographic distribution of “lineages of concern.” If you read/watch/listen to the news then you have probably heard of some of these.

  • Which lineages of concern have been found in the UK?
  • Can you find any information about why these lineages cause concern?

9.1.3 SARS-CoV-2 mutations

We have already seen that over time, the virus population diversified as it acquired more and more new mutations. We have also seen that each lineage is associated with a characteristic set of mutations. Let’s take a close look at an example.

From, let’s click on the link to the report for lineage B.1.1.7. This will take us to: where we can see a list of the mutations associcated with this lineage:

  • aa:orf1ab:T1001I
  • aa:orf1ab:A1708D
  • aa:orf1ab:I2230T
  • del:11288:9
  • del:21765:6
  • del:21991:3
  • aa:S:N501Y
  • aa:S:A570D
  • aa:S:P681H
  • aa:S:T716I
  • aa:S:S982A
  • aa:S:D1118H
  • aa:Orf8:Q27
  • aa:Orf8:R52I
  • aa:Orf8:Y73C
  • aa:N:D3L
  • aa:N:S235F

So, what does this mean? Let’s break it down. It might be helpful to take a look at the organisation of the SARS-CoV-2 genome, for example in the Ensembl genome browser here.

You can see that the genome consists of a single “chromosome” of approximately 30,000 nucleotides long. Within that nucleotide sequence are several open reading frames or genes that are known to encode proteins. The largest of these are ORF1a and ORF1b, which overlap each other. For interaction with the host, arguably the most important is the S gene, which encodes the spike protein that sits on the surface of the virus particle.

Going back to the list of mutations above, * aa:orf1ab:T1001I means a change from T (threonine) to I (isoleucine) at amino-acid number 1001 in the protein encoded by ORF1ab. * aa:S:N501Y means a change from N (asparagine) to Y (tryptphan) at position 501 in the spike protein. * … and so on.

The S:N501Y mutation is particularly interesting because mutations in this residue have independently arisen more than once during the pandemic; that is, it has arisen in more than one lineage. Moreover, those lineages that contain these mutations seem to overtake other lineages, suggesting that it confers greater transmisability. There is some evidence that mutations in this part of the spike protein can alter the binding affinity between the viral spike protein and the human hACE2 receptor, but the molecular details are beyond the scope of what we are looking at today.

On the COG-UK website, these is a link to CoVal. CoVal is a repository of amino acid replacement mutations identified in the SARS-CoV-2 genome sequences, mapped onto the cryo-EM derived protein structures. CoVal provides information on the demographic distribution of these mutations, and report co-occuring mutations. Unfortunately, it appears that CoVal is currently offline; otherwise it would have been fun to investigate the S:N501 mutations there.

9.1.4 NextStrain

Finally, lets’ follow the link on the COG-UK website to the Nextstrain website and navigate to the section about SARS-CoV-2. This will bring you to

You should see something like this:

Now we can start to see some of the value in such a large-scale and global effort in genome sequencing of this virus. * We have already seen that genome sequencing reveals mutations that have occurred since the start of the pandemic. * In turn that reveals different lineages of the virus that have emerged. That is particularly important as we see so-called “lineages of concern,” those that may be more highly contagious and/or more deadly. * We can also see whether mutations (detected by genome sequencing) might allow the virus to escape from the immunity provided by vaccines and/or to evade detection (e.g. in PCR-based testing). * Here we cans see another benefit from tracking lineages and mutation. We can start to understand how the virus is spreading within and between counties and continents.

Let’s take a closer look at the diversity plot at the bottom of the page:

This shows a schematic diagram of the 30-kbp genome, with the positions of the genes marked. Along the length of the genome, sequence diversity is plotted on the vertical axis. Simply speaking, the higher the diversity, the more frequent are mutations at that position in the genome. Notice that the pattern does not seem to be entirely random. For example, amino-acid changes in the spike (S) protein seem to be more prevalent than average over the genome; this might suggest that natural selection is favouring mutants that have modified S proteins with altered interactions with the host. In the future, when after widespread vaccination, we might expect selection to favour mutants that escape. However, the would depend on such mutations arising in the first place and probably escape would require a combination of mutations that would not spontaneously arise all at once together. So let’s not be too pessimistic. In any case, genomic surveillance is essential so that we can detect potential ‘escape’ mutations to allow rapid intervention.

Note that you can use the sliders to zoom-in on a smaller part of the genome. Let’s take a closer look at the genomic region encoding the spike protein:

If you hover the cursor over the plot, you can locate and click on specific mutations.

  • Can you now find and click on the N501 site in the S protein?

If you do so, you will notice that the phylogenetic tree plot and the geographic map get modified such that they now highlight the S:N501 mutant lineages.

Now, here’s the fun part: click on where it says “Play.” This will play back past events!

At first, all of the sequenced viruses have an N at position 501 in the S protein. But then when we reach about July 2020 then we start to see viruses with an T at this position (S:N501T). Then in September 2020 we start to see viruses that have a Y at this position (S:N501Y). Notice how the N501Y mutation independently crops up in several different branches of the phylogenetic tree at around the same time. Also notice how it increases in prevalence at diverse geographical locations around the world.

Here is the moment in late 2020 when the S:N501Y mutation appears to really take-off, being very common in UK and much of Europe and nearly completely dominant in southern Africa: