Section 5 Week 4: DNA sequencing and assembly

Note that in the 2020/21 academic year, we could not hold the practical class in-person, so completion of this task is optional. However, completing these excercises will help you to understand the concept of sequence assembly better.

You should be able to complete this easily within the three-hour session so there is plenty of time to ask questions to the instructor(s) and to finish the previous practical(s). You may wish to work in pairs or small groups or individually.

The point of this practical is to consolidate your understanding de novo sequence assembly, which we are also covering in the lectures. The emphasis is on understanding the underlying principles rather than the software implementations. Therefore, you will perform the alignment and assembly tasks ‘manually’ with pen and paper or using laminated paper and scissors.

For these tasks you can either manipulate the sequences using pen and paper or word-processor or text editor software on your computer or you can cut out the sequences with scissors and arrange them on your desk.

The sequences are reproduced on the last pages of this hand-out so you can remove the pages and cut them up. They are also provided on laminated paper, which you can cut up if you wish.

By the end of these exercises, you should have a much stronger understanding of the basic concepts of sequence assembly.

5.1 Assembly Exercise 1: A simple sequence assembly example

The following sequence reads come from shotgun sequencing of the genome of a virus isolated from a dragonfly. Try to assemble these 10 sequence reads into a single contiguous sequence (contig):

5’ TTCTATATAGGTGCCACTGCCACTGCTCCACCGTA 3’

5’ GATAGCCTTCTATATAGATGCCACTGCCACTGCTC 3’

5’ AGCGGTGGCAGTGGCACCTATATAGAAGGCTATCG 3’

5’ CCTATATAGAAGGCTATCGGAGATAAGACTACTTA 3’

5’ TATGGGAGATAAGACTACTTAATATTATTCTCTAC 3’

5’ GAGCAGTGGCAGTGGCACCTATATAGAAGGCTATC 3’

5’ AGAAGAATATTAAGTAGTCTTATCTCCGATAGCCT 3’

5’ CGATAGCCTTCTATATAGGTGGCACTGCCACTGCT 3’

5’ CGGAGATAAGACTACTTAATATTATTCTCTACGGT 3’

5’ CCGTAGAGAATAATATTAAGTAGTCTTATCTCCGA 3’

(Note: in the practical, you will be provided with a card and scissors from which you can cut-out these sequences and arrange them on your desk.)

You could try using the greedy algorithm, OLC or that we covered in the lectures or you could try to devise a method yourself. You might be able to just align the sequences together to form a contig. It is probably not feasible to attempt a k-mer / de Bruijn graph with this example.

Note that this is an extremely simplified example. In a ‘real’ genome sequence assembly task, you would typically be handling many millions of sequence reads and most genomes are far bigger than this toy example.

When assembling this sequence, remember what you have learned about DNA and sequencing. For example, recall that * DNA is double-stranded * DNA sequencing is not 100% accurate * DNA molecules can be linear or circular

5.2 Assembly Exercise 2: Assembly of a repetitive region using unpaired and paired reads

As we discussed in the lectures, assembling repetitive sequences is difficult. If the sequence reads are shorter than the repetitive sequence, then multiple repeats can get collapsed together, yielding an incorrect sequence.

Try assembling these 34 reads:

TATATATATATATATA TATATATATATATATA CGCATATATATATATA TATATATATATATATA ATATATATATATATAT ATATATATATATATAT TATATATATATATATA TATATATATATATATG TATATATATATATATA TATATATATATATATA TATATATATATATATA TATATATATATATGCC TATATATATATATATA TATATATATGCCGATT CATATATATATATATA TATATATATATATATA CGCGCATATATATATA TATATATATATATATA ATATATATATATATAT ATATATATATGCCGAT CGCGCATATATATATA TATATATATATATATA TATATATATATATATA TATATATATATATGCC ATATATATATATATAT ATATATATATATATAT ATATATATATATATAT ATATATATATATGCCG ATATATATATATATAT ATATATATATGCCGAT TATATATATATATATA TATATATATGCCGATT TATATATATATATATA TATATATATATATGCC

(Note: in the practical, you will be provided with a card and scissors from which you can cut-out these sequences and arrange them on your desk.)

What was the answer? Perhaps you assembled the sequence as CGCGCATATATATATATATATATGCCGATT? Actually, the original genome sequence was: CGCGCATATATATATATATATATATATATATATATATATATATATATATATATATATATGCCGATT. See how the repeats have been collapsed.

The problem of assembling repetive sequences

However, using paired reads, we can overcome this problem, at least partially. To see how this works, try assembling these 17 paired reads:

TATATATATATATATAnnnnnnnnnnnnnnnnnnnnTATATATATATATATA

CGCATATATATATATAnnnnnnnnnnnnnnnnnnnnTATATATATATATATA

ATATATATATATATATnnnnnnnnnnnnnnnnnnnnATATATATATATATAT

TATATATATATATATAnnnnnnnnnnnnnnnnnnnnTATATATATATATATG

TATATATATATATATAnnnnnnnnnnnnnnnnnnnnTATATATATATATATA

TATATATATATATATAnnnnnnnnnnnnnnnnnnnnTATATATATATATGCC

TATATATATATATATAnnnnnnnnnnnnnnnnnnnnTATATATATGCCGATT

CATATATATATATATAnnnnnnnnnnnnnnnnnnnnTATATATATATATATA

CGCGCATATATATATAnnnnnnnnnnnnnnnnnnnnTATATATATATATATA

ATATATATATATATATnnnnnnnnnnnnnnnnnnnnATATATATATGCCGAT

CGCGCATATATATATAnnnnnnnnnnnnnnnnnnnnTATATATATATATATA

TATATATATATATATAnnnnnnnnnnnnnnnnnnnnTATATATATATATGCC

ATATATATATATATATnnnnnnnnnnnnnnnnnnnnATATATATATATATAT

ATATATATATATATATnnnnnnnnnnnnnnnnnnnnATATATATATATGCCG

ATATATATATATATATnnnnnnnnnnnnnnnnnnnnATATATATATGCCGAT

TATATATATATATATAnnnnnnnnnnnnnnnnnnnnTATATATATGCCGATT

TATATATATATATATAnnnnnnnnnnnnnnnnnnnnTATATATATATATGCC

(Note: in the practical, you will be provided with scissors and a card from which you can cut-out these sequences and arrange them on your desk.)

You should find assembly using paired reads is more accurate than using the unpaired reads, though it is still far from perfect with the number if repeats still ambiguous. This illustrates one of the advantages of using paired reads, i.e. where we sequence both ends of each DNA fragment rather than sequencing just one end (we routinely do this when using the short-read Illumina method of DNA sequencing). An alternative would be to generate longer reads such that an individual read spans across the repetive region.

5.3 Assembly Exercise 3: Building a de Bruijn graph

For this excercise, it is best to use a pen/pencil and paper. A blank sheet of paper is included in the booklet for this purpose. Your task is to generate a de Bruijn graph from the following ten sequence reads:

   CGTTAGTG
    GTTAGTGG
     TTAGTGGC
      TAGTGGCA
       AGTGGCAA
        GTGGCAAG
         TGGCAAGC
       AGTGGCAT
        GTGGCATT
         TGGCATTC

Notice that, for convenience, the reads have been aligned (but not assembled). If you have time, you could also try making the overlap graph (i.e.part of the OLC method).

5.4 Exploring an assembly graph using Bandage software

In case you finish the main assembly tasks with lots of time to spare, you may wish to explore some real genome assembly results using the Bandage software.

During the lectures and outside reading, you have learned about the k-mer (de Bruijn) graph approach to assembling a genome sequence from short sequence reads generated by ‘next-generation’ sequencing methods such as Illumina MiSeq. Now, we are going to take a look at the end results of assembling a bacterial genome sequence. Originally, we sequenced the genome of this bacterial strain using Illumina HiSeq, which generates pairs of short sequence reads. You have been provided with results of assembling these short sequence reads (using an assembly software called SPAdes).

The Illumina-based data are available here:

https://universityofexeteruk-my.sharepoint.com/:f:/r/personal/d_j_studholme_exeter_ac_uk/Documents/Teaching/bio2092/practicals/BIO2092%20practical%20data/Practical%205%20(Week%208)%20Sequence%20assembly/Xanthomonas%20vasicola%20NCPPB1060/Illumina-based%20assembly?csf=1&e=TVFx3I

or, via ELE, here:

https://vle.exeter.ac.uk/mod/url/view.php?id=722134

Navigate to the OneDrive folder containing data for genome assembly, and download the files:

Once you have downloaded these files, first take a look at the contents of the contigs.fasta and the scaffolds.fasta files using a text editor or word processor (e.g. NotePad or Microsoft Word). These files contains the final results of the assembly process; that is the nucleotide sequences of the assembled contigs and scaffolds.

Next, take a look at the Assembly statistics folder.

In particular, download and open the report.html file, which contains a summary of several measures of assembly quality. Make sure that you download the file (e.g. to your Downloads folder and open the downloaded file rather than just trying to open it within OneDrive).

Note the N50 value. This is a measure of how contiguous the assembly is; it measures how large the contigs are compared to the total size og the assembly. What do you think the steepness of the curve indicates about the assembly? These statistics were generated by running the Quast software on the contigs.fasta and scaffolds.fasta files.

Now, let’s use the Bandage software to examine the assembly graph. This graph is derived from the k-mer graph but each node in the graph corresponds to a contig. In other words where possible, adjacent k-mers in the k-mer graph have been combined to yield contigs. However, ambiguities in the assembly graph (forks, loops, branches, etc.) remain.

Run the Bandage software, which should already be installed on your PC. IF it is not installed or not working, you can download the latest version from here: https://github.com/rrwick/Bandage/releases/. Click on appropriate file for your computer. assuming that you are using a Windows PC, then this will likely be Bandage_Windows_v0_8_1.zip.

Once you have downloaded the .zip file, you need to unzip it (try right-clicking on the filename). Then find the executable file called Bandage.exe. Double click on that file and it will launch the Bandage software.

Earlier, you downloaded data from the OneDrive and, hopefully, unzipped it. You are going to need the file called assembly_graph.gfa, so make sure you know where this file is.

Use the File -> Load graph menu in Bandage item to load the graph file that you downloaded earlier (assembly_graph.gfa):

Now, when you click on the Draw graph button, you should see the assembly graph. It will look something like this:

Note the zoom control, which allows you to zoom in and take a closer look at a smaller part of the graph.

Also, it would be good idea to switch on the Length, Name and Depth labels (see the ‘node labels’ section).

It is kind of fun to explore this assembly graph. Some things you can do: * Use the controls to zoom-in, zoom-out and move around the graph. * You can also label the nodes in the graph with their names, sequence depths, and lengths. * You can select nodes (contigs) and perform BLAST searches with the selected contigs (using the Output menu item).

What you are looking at is an assembly graph. Each node (i.e. vertex) is essentially a contig (contiguous sequence). The connecting lines (i.e. edges) indicate that the ends of the contigs overlap. Notice how difficult it would be to infer the originial genome sequence from this tangle of overlapping sequences!

If you want to further investigate any of the nodes, you can click on a node to select it and then go to the Output menu to copy its sequence to clipboard.

From there, you can go to the NCBI’s BLAST website and paste the sequence into the query box. A BLASTN search against the NR database might reveal moe about this sequence.

There is no specific task for you to do here. I just want you to get a feel for the complexity of the assembled data and that the resulting assembly is simply a model of the original genome – and is likely to be imperfect!