Chapter 1 Week 1 Lab


Pre-lab Assignments/Readings

  • Watch the welcome video in the Canvas media gallery.

Homework for Next Week

  • Fill in all of the answer blocks in the lab 1 Rmd.
  • Complete all tasks in the Homework section (bottom of the Rmd).
  • Knit the complete Rmd as a pdf or html file and submit it on Canvas. (To download from RStudio Cloud, click the checkbox next to a file in the “Files” tab in the lower right quadrant, click on the “More” gear, then click “Export…” and save it to your computer.)
  • Read this short news piece from some UC Davis faculty. https://biology.ucdavis.edu/news/language-biology-how-heck-do-scientists-assemble-genome
  • Respond to the Week 1 discussion post on Piazza.

Learning Objectives

  • Apply genome assembly approaches to text.
  • Think formally about using algorithms to solve a problem.
  • Practice using R to manipulate and analyze strings.
  • Generate an Rmd file and associated html or pdf file.

1.0.1 Overview

Today’s work will be a combination of discussion, individual work, and small group work. We will use this Rmd file to take notes. Over the quarter, the individual labs will help us build up a full “lab notebook” and provide a space for your code, reflection, and synthesis.

1.0.2 R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

3+5
## [1] 8

More on that to come later.

1.0.3 Genome assembly – a literature perspective

When you perform DNA sequencing, you don’t immediately end up with long DNA sequences that can be easily connected to data from genomic databases. Instead, you begin by getting a set of short reads. Each read is a short sequence of DNA usually ranging from ~50 to 500 base pairs, depending on the sequencing technology used. We will jump into some reads of real DNA sequences from our halophiles next week. But this week, let’s build our intuition with something closer to our everyday experience, regular text.

We (your instructors) will share some short “reads” of text on a Google Sheet. Each read here is only 20 characters long, and consists of the letters a to z, with underscores _ indicating spaces between words. Just like DNA, the characters here have meaning. These reads were created by sequencing a text “gene”. For example, here_is_the_rest_of_this sentence_as_two_reads.


Pause here for instructions before going on.


1.0.3.1 Your objective: assemble this text

Here are a few features you’ll want to keep in mind.

  • Many of these reads overlap. So multiple reads should cover most parts of the text.
  • In fact, the average depth of coverage is 5X. This means that 5 reads, on average, should cover any letter in the text.
  • Sequencing is not free from errors. There was a 1% error rate in our "sequencing.


1.0.4 Algorithmic thinking

1.0.4.1 5-minute task – making toast

Take 5 minutes to think about and write down the steps required to making toast, starting with a bag of bread and a toaster. Imagine you are explaining this to a 5-year-old who is a reasonable human being, but has never used a toaster by themselves.

1.0.4.2 Class discussion

Instructors will guide.



1.0.5 Getting acquainted with R

So far we haven’t gotten into the real benefits of using an R Markdown document. One of the greatest benefits is that you can easily embed R code, run it, make plots, and use all of R’s functionality while creating an easy to read document. Let’s begin with some basics. R Markdown files will automatically interpret R code contained within chunks sandwiched by three back tick symbols.

R is basically a fancy calculator. But it can store objects such as variables or functions, which is useful. Instead of having to arithmetic over and over again, you can use functions in R to do your bidding. Let’s define some objects.

# assigning the value 3 to the variable x
x <- 3

# assigning the value 7 to the variable y

y = 7

# evaluating the sum of x and y
x + y
## [1] 10

We defined two different objects here. x was defined as the number 3, and y was defined as the number 7. Note that we can assign values to a variable using either <- or =. The equals sign may seem more intuitive, but equals signs have some other meanings in other contexts in R, so we’ll tend to use <-. As a shortcut, you can type either alt and - or cmd and - to make the symbol <-, if you have a pc or mac, respectively.

As you go through, you can use ctrl and enter to run a single line of code, or click the green play button at the right corner of the code block to run the whole block. You can also just copy and paste into the console, but that is slow and will eventually drive you insane.

Let’s define something more complicated.

# storing a string of base pairs at the variable dna_fragment
dna_fragment <- "ACTTCTTCACGCTGCTACGTACGATATTAACGGCCGATCACGATCGACGTAGCTAGCTAGCT"
dna_fragment
## [1] "ACTTCTTCACGCTGCTACGTACGATATTAACGGCCGATCACGATCGACGTAGCTAGCTAGCT"

Now we have a string stored as an object named dna_fragment – a string is a sequence of characters. These are always surrounded by quotation marks in R.
Even with this short string, counting the letters might get tedious. What functions can can R offer us?

fragment_length <- nchar(dna_fragment)
fragment_length  #this is the total number of characters in our string
## [1] 62

When you call functions in R, they will always have their argument (their input) surrounded by round brackets, (). So we can see that nchar() is a function. It can take a single strings as an argument, as we saw above. But it can also take a vector of multiple strings. Let’s define a vector and check it out.

# storing a vector with multiple strings
dna_fragments <- c("ATACCAT","GTTTGAGATC","CC")
dna_fragments
## [1] "ATACCAT"    "GTTTGAGATC" "CC"
#counting the number of characters in each string in the vector
nchar(dna_fragments)
## [1]  7 10  2
dna_fragments2 <- c("CCCCCCC",dna_fragments, 17)

Two things to unpack. First, we just used R’s most fundamental function, c(). This creates a vector out of the entries you put inside. Second, you just used the function nchar() on a whole vector, instead of a single string. Many functions in R can work with a single value, or a vector of values. These are called vectorized functions, and they’re convenient for doing a lot of work with a small bit of code.

What about the number of each of the 4 letters (our four different DNA base pairs)? As you’ll continually find out, R works better with vectors of individual characters (where each character in the string is its own entry in an array) than it does with single strings. Let’s reformat our original string, dna_fragment, using the function strsplit().

#breaking a string into individual characters
dna_list <- strsplit(dna_fragment,split = "")
dna_list # this looks almost like a vector, but it's actually a list
## [[1]]
##  [1] "A" "C" "T" "T" "C" "T" "T" "C" "A" "C" "G" "C" "T" "G" "C" "T" "A" "C" "G"
## [20] "T" "A" "C" "G" "A" "T" "A" "T" "T" "A" "A" "C" "G" "G" "C" "C" "G" "A" "T"
## [39] "C" "A" "C" "G" "A" "T" "C" "G" "A" "C" "G" "T" "A" "G" "C" "T" "A" "G" "C"
## [58] "T" "A" "G" "C" "T"
# we can recover the vector in two ways
unlist(dna_list)  # using this function that smooshes all list entries 
##  [1] "A" "C" "T" "T" "C" "T" "T" "C" "A" "C" "G" "C" "T" "G" "C" "T" "A" "C" "G"
## [20] "T" "A" "C" "G" "A" "T" "A" "T" "T" "A" "A" "C" "G" "G" "C" "C" "G" "A" "T"
## [39] "C" "A" "C" "G" "A" "T" "C" "G" "A" "C" "G" "T" "A" "G" "C" "T" "A" "G" "C"
## [58] "T" "A" "G" "C" "T"
# into a vector, OR

dna_list[[1]]  # it turns out the first entry in our list was a 
##  [1] "A" "C" "T" "T" "C" "T" "T" "C" "A" "C" "G" "C" "T" "G" "C" "T" "A" "C" "G"
## [20] "T" "A" "C" "G" "A" "T" "A" "T" "T" "A" "A" "C" "G" "G" "C" "C" "G" "A" "T"
## [39] "C" "A" "C" "G" "A" "T" "C" "G" "A" "C" "G" "T" "A" "G" "C" "T" "A" "G" "C"
## [58] "T" "A" "G" "C" "T"
# a vector of characters

# let's create a new object to store this vector
dna_vec <- unlist(dna_list)
dna_vec
##  [1] "A" "C" "T" "T" "C" "T" "T" "C" "A" "C" "G" "C" "T" "G" "C" "T" "A" "C" "G"
## [20] "T" "A" "C" "G" "A" "T" "A" "T" "T" "A" "A" "C" "G" "G" "C" "C" "G" "A" "T"
## [39] "C" "A" "C" "G" "A" "T" "C" "G" "A" "C" "G" "T" "A" "G" "C" "T" "A" "G" "C"
## [58] "T" "A" "G" "C" "T"

Now our data is a vector. What have we gained? Well, we can use some basic R functions like table() to make counts of the unique entries inside. We can easily take a subset of the vector, for example to just look at the last 6 entries.

table(dna_vec) # a table of the counts of each unique value in the vector
## dna_vec
##  A  C  G  T 
## 15 18 13 16
dna_vec[31] # the 31st entry in the vector
## [1] "C"
dna_vec[c(1,5,12)] # the 1st, 5th, and 12th entries
## [1] "A" "C" "C"
dna_vec[57:62]  #the final 6 entries in dna_vec
## [1] "C" "T" "A" "G" "C" "T"

Oh, the table() function is nice. It counts the unique entries in a vector (and can do more complex counting too). Then we took advantage of a nice feature of vectors; it’s easy to select smaller sets of entries inside. For vectors you use the square brackets, [], and can enter a single integer or a vector of multiple integers. The code 57:62 does exactly what you think. It makes a vector with the numbers 57,58, …,62.

Your turn to fiddle. Try these out on your own, and post questions in the chat if you’re getting stuck. We’ll put you into breakout rooms where you can ask questions and share screens with smaller groups of peers as needed.

##################################
###  Your code below           
##################################

# create a vector named dog, consisting of the values 5, 7, 9, and 126.3
# remember to wrap the values in the c() function


# use the function mean() to calculate the mean of the vector dog


# use the command ?mean to see R's documentation of the function.
# ? before a function name pulls a useful help file.
# mean() has some extra arguments that you can use. 
# what can you do by setting a trim argument?


# below is a string stored as the object bird
# use strsplit() and table() to count the number of occurrences of the letter r
bird <- "chirp_chirp_i_am_a_bird_chirp"


###
# extra things to play with, time permitting
# check out the help file for strsplit()


# we used the argument split="" to split every character apart.
# try using strsplit() with split="_" on the variable bird.


# what is your output?  (use the # symbol at the start of the line
# to write your response as a comment, otherwise
# R will get confused when you create your final document)



# Can you imagine any scenarios where this might be useful?



##################################
##################################

Pause here and wait for instructions on the next step.


1.0.5.1 A first step into bioinformatics

Okay, we now have almost enough tools to do some bioinformatics. Let’s compare two DNA sequences.

# a different dna sequence (conveniently already in vector form)
dna_vec_2 <- c("A", "G", "T", "T", "C", "T", "T", "C", "A", "C", "G", "C", 
"T", "G", "C", "T", "A", "G", "G", "T", "A", "C", "G", "A", "T", 
"A", "T", "T", "A", "A", "C", "G", "G", "C", "C", "G", "A", "T", 
"C", "A", "C", "G", "C", "T", "C", "G", "A", "C", "G", "T", "A", 
"G", "C", "T", "A", "T", "C", "T", "A", "G", "C", "T")

# checking which entries are equal 
matches <-  dna_vec == dna_vec_2
matches
##  [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [61]  TRUE  TRUE
# adding up the matches, and dividing by the length (the number of characters)
sum(matches)/length(matches)
## [1] 0.9354839

Check in with you lab neighbors. What did we just do? What is that final value?

We created an interesting vector called matches from the command dna_vec == dna_vec_2. It looks a little weird: a vector of TRUE and FALSE. As mentioned in the online reading, comparisons in R using symbols like <, ==, and != are like asking a question. For vectors, R makes the comparison entry by entry. So dna_vec == dna_vec_2 asked: Is the first entry in dna_vec the same as the first entry in dna_vec_2? Is the second entry the same in both? And so on. We can see that they were mostly the same, but the occasional FALSE values reveal places where these two DNA sequences differed. The output from this kind of comparison is a logical vector, so-called because we call TRUE and FALSE logical values that come from making logical comparisons betweem values.

We also took advantage of another useful feature in R. TRUE and FALSE are equivalent to 1 and 0, so you can find the number of TRUE values by just taking the sum() of your logical vector. We went ahead and divided by the length (the number of entries) of the vector, to get the overall fraction of the DNA sequences that were identical. This fraction comes up repeatedly during genomic analysis. It is often multiplied by 100 to become a percent, and is called percent identity. You will start seeing this value next week when we try to figure out which species have sequences that best match our halophile genomes.

# Following the approach above, find the percent identity 
# for these two sequences
seq1 <- "CCTAACGTCAGCAGATCAAACGCCGATTC"
seq2 <- "CCGAAGGACTAGCTAGCTAGCGCGGATTG"

##################################
###  Your code below           
##################################



# After calculating it, what do you think? Are these sequences similar?
# We'll gradually get more and more context for this as the course goes on.

##################################
##################################

Next week we’ll try out a simple genome assembly in R. To get ready for that, you’ll be getting used to wrangling and assembling a few short reads for your homework. Below we’ll combine a few tools we’ve seen above to get things started.

# Above we compared two sequences that were already reasonably
# well aligned. But actual reads coming from DNA sequencing start
# at random places. So, as we saw using the text, we instead want 
# to look for overlaps.

seq3 <- c("A", "T", "G", "C", "A", "G", "A", "C", "T", "C", "A", "A", 
"C", "T", "G", "T", "G")
seq4 <- c("A", "A", "C", "T", "G", "T", "G", "A", "C", "G", "C", "G", 
"C", "G", "C", "G", "A", "A", "G")

# To find the best overlap, we would need to slide the sequences along
# each other and check the percent identity each time.  
# Let's not worry about that yet.

# Instead, let's use a logical comparison
seq3[11:17] == seq4[1:7]
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# the last 7 base pairs in seq3 are idential to the first 7 in seq4.

# That is vanishingly unlikely to have occurred by chance, so we strongly
# suspect that these two reads overlap in the real genome.

# So, how do we combine them?  The c() function, yet again.

seq_joined <- c(seq3,seq4[8:19])  # using all of seq3, and the latter
# part of seq4


# one final tool, paste()... and also paste0()
# check out the help file for paste(), using ?paste


# we'll need to set the argument collapse = "", which
# tells the function to smush everything back to a single string

paste(seq_joined, sep = "", collapse = "")
## [1] "ATGCAGACTCAACTGTGACGCGCGCGAAG"
# paste0 saves some writing -- it defines sep = "" by default

paste0(seq_joined, collapse = "")
## [1] "ATGCAGACTCAACTGTGACGCGCGCGAAG"
# don't worry about sep yet; it's only relevant when you're pasting 
# multiple vectors.

1.1 Homework for Week 2, due Wednesday, 4/8, at 10:59pm, 10 points

All work should be completed inside of this Rmd file, then the file should be knit and uploaded to Canvas as a pdf or html file.


1.1.1 Finish up labwork (2 points)

Make sure that any short responses or coding snippets above have been filled in with your work.


1.1.2 Online discussion (part of Week 1 participation points)

Respond to the Week 1 discussion post on Piazza.


1.1.3 Writing (4 points)

1.1.3.1 Genome assembly algorithm, and practicing with R Markdown

In 200-300 words, write out a detailed algorithm for how you would assemble a set of reads into a genome. Keep in mind the concepts we discussed regarding algorithms, including quality control, validation, and conditional statements.

In addition, in a few sentences (keeping the article in mind, https://biology.ucdavis.edu/news/language-biology-how-heck-do-scientists-assemble-genome), explain why it might be helpful to have the genome sequences of more than one individual of a particular species.

Try to pay attention to formatting. Check out this RMarkdown cheat sheet, https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf, and use it to pick a few places in your response to add some style, for example creating a bulleted list:

  • check
  • out
  • my
  • list,

or adding italics to empasize something, or even adding a new header inside your response.

1.1.3.2 Your response here


1.1.4 Coding (4 points)

Below we define a vector that contains 3 short reads stored as vectors. Each read overlaps with at least one other, and the overlaps are always exactly 10 base pairs long and perfectly identical. Your task is to write R code to assemble these 3 reads into a single long contig, and then condense it down to a single string. Name that string genome. Finally, use an R function to count the numbers of A, T, G, and C nucleotides in your new string genome, and calculate the proportion of the nucleotides that are G or C. That is called the GC content of your sequence, and it’s a useful number to use when comparing sequences.

This main goal of this task is to get you fiddling around in R. Use comments (#) in the R code to make it clear what should be happening at each step. Even if something goes wrong or you just can’t make it run, clear comments that make your logic transparent are worth almost full credit. Don’t forget the value of the help function ? if a function isn’t working how you expect it to. If you’re feeling stuck, try reviewing the code above. Everything you need is something that we worked on during Lab 1. But also feel free to check with peers, post on Piazza, google, and email your instructors as needed.

lab1_hw_reads <- c("GTTATCTAAGACCCATCTTCTCACTGGTCACTCACTCCACTGGCATATTC",
                   "CGAATCTTTTTGTTGTTCGAGAGGCCTGTGACACCCCTGGGTTATCTAAG",
                   "TGGCATATTCGTCAAACAGTTCTGATGCCTGATACAACTGACAAATCTCA")

##################################
###  Your code below           
##################################

# Make sure you include lots of comments!
# Before each of line code where you're doing someting new,
# use # to add a comment to explain what you're doing.

# Remember to create the string named genome and also to 
# calculate GC content.




##################################
##################################

1.2 Code Appendix

When you knit, this automatically includes all of the inline code you wrote in the coding sections while you were exploring in this week’s lab. You don’t need to do anything here.

##################################
###  Your code below           
##################################

# create a vector named dog, consisting of the values 5, 7, 9, and 126.3
# remember to wrap the values in the c() function


# use the function mean() to calculate the mean of the vector dog


# use the command ?mean to see R's documentation of the function.
# ? before a function name pulls a useful help file.
# mean() has some extra arguments that you can use. 
# what can you do by setting a trim argument?


# below is a string stored as the object bird
# use strsplit() and table() to count the number of occurrences of the letter r
bird <- "chirp_chirp_i_am_a_bird_chirp"


###
# extra things to play with, time permitting
# check out the help file for strsplit()


# we used the argument split="" to split every character apart.
# try using strsplit() with split="_" on the variable bird.


# what is your output?  (use the # symbol at the start of the line
# to write your response as a comment, otherwise
# R will get confused when you create your final document)



# Can you imagine any scenarios where this might be useful?



##################################
##################################
# Following the approach above, find the percent identity 
# for these two sequences
seq1 <- "CCTAACGTCAGCAGATCAAACGCCGATTC"
seq2 <- "CCGAAGGACTAGCTAGCTAGCGCGGATTG"

##################################
###  Your code below           
##################################



# After calculating it, what do you think? Are these sequences similar?
# We'll gradually get more and more context for this as the course goes on.

##################################
##################################
lab1_hw_reads <- c("GTTATCTAAGACCCATCTTCTCACTGGTCACTCACTCCACTGGCATATTC",
                   "CGAATCTTTTTGTTGTTCGAGAGGCCTGTGACACCCCTGGGTTATCTAAG",
                   "TGGCATATTCGTCAAACAGTTCTGATGCCTGATACAACTGACAAATCTCA")

##################################
###  Your code below           
##################################

# Make sure you include lots of comments!
# Before each of line code where you're doing someting new,
# use # to add a comment to explain what you're doing.

# Remember to create the string named genome and also to 
# calculate GC content.




##################################
##################################

# Use this space to write the code for assembling the mystery sequences
# Assemble mystery2, 3, 4, 5, 6, and 7, and store them as
# a2, a3, a4, a5, a6, a7

##################################
###  Your code below           
##################################

# example for mystery2
a2 <- assemble(mystery2, min_overlap = 5)



##################################
##################################

# Here you have a pre-written function to calculate GC content.
# But, the lines of code ARE IN THE WRONG ORDER!
# Read it over, move the code into the right order,
# and add comments to explain each step.

##################################
###  Your code below           
##################################

count_gc <- function(assembly)
{
    GC <- C_total + G_total
    
    C_total <- sum(assembly_vec == "C")
    
    assembly_vec <- unlist(strsplit(assembly, split = ""))
    
    G_total <- sum(assembly_vec == "G")
    
    return(GC)
}


# When that's complete, use the code from count_gc, 
# but add a few lines so that you return the fraction
# of all nucleotides that are GC, not the total count
# I.e. it should be a number between 0 and 1.

fraction_gc <- function(assembly)
{
    
}

# VALIDATION
# your answer for fraction_gc(a1) should be 0.47 (rounded)
# your answer for fraction_gc(a2) should be 0.59 (rounded)


##################################
##################################
##################################
###  Your code below           
##################################
# use this space to load your FASTA file and get the data requested above.

# template for loading your data as a DNAStringSet
# you'll need to edit the filename to match the FASTA file for
# your strain
my_strain <- readDNAStringSet("data/ENTER_YOUR_STRAIN.fasta")

# for the example (strain 1926), we used the command
# strain_1926 <- readDNAStringSet("data/strain1926.fasta")
# we commented the line above so it won't automatically run






##################################
##################################
# storing our contig lengths as a vector
c_lengths <- width(strain_1926)

total <- sum(c_lengths)

# creating a cumulative sum
c_sum <- cumsum(c_lengths)
plot(c_sum) + abline(a = total/2, b = 0)

# here we have a function to calculate L50, but 
# the lines are out of order!  Let's fix it.

L_50 <- function(contig_lengths)
{
  contig_ind <- which(contig_sum > sum(contig_lengths)*0.5)
  return(L)
  L <- min(contig_ind)
  contig_sum <- cumsum(contig_lengths)
}


##################################
###       Your code below      
##################################
N_50 <- function(contig_lengths)
{
  # enter your code here
  # hint: you can use other functions inside this
  # function. For example, L_50() might be useful
}


# also, load your own strain into the environment right now as well
# uncomment the next line and replace with the correct numbers for your strain

# strain_XXXX <- readDNAStringSet("data/strainXXXX.fasta")

##################################
##################################

##################################
###  Your code below           
##################################

gc_content <- function(string_set)
{
  # add your code for the function here.
  # the R example just above this should be useful.
  
  # your input, which we named "string_set", will be a DNAStringSet
}


## below are some extra challenges if you have spare time
## you do not need to complete this as part of the HW/labwork.

# 1) Although it's rare, there will sometimes be a non-zero value
# in the "other" column from alphabetFrequency(). That means that
# adding up the G and C columns isn't quite right. Revisit
# your gc_content() function and adjust it to only calculate GC 
# based on the known A, T, C, and G nucleotides and ignore "others".

# 2) In the Piazza discussion for Week 2, it has come up that
# it's harder to amplify high GC content DNA segments. So one part 
# of Illumina sequencing, the bridge amplification step, might not work as
# well for species or genomic regions with high GC. What are some potential
# consequences of this? For a sample with multiple species, how might 
# this influence the number of contigs you get for each species?


##################################
##################################
#### Make the same plot for your strain
# You will first need to create a data frame by creating a
# vector for GC and a vector for contig length

##################################
###  Your code below           
##################################

# create a new vector that holds the contig lengths for your strain

# create a new vector with the GC of each contig

# Optional: use data.frame() to create a data frame with a column for each of those
# two vectors (make sure you store this as a new object in your environment)


# create your plot, either from the data frame columns, or
# directly from the vectors
# Make sure you have a title with your strain number


##################################
##################################

##############################
###    Your code below     
##############################

# enter the code for your plotting (and loading your data) here


##############################
##############################

##################################
###  Your code below           ###
##################################

# These tasks are all working with strain 1926

# 1) create a new DNAStringSet called strain_1926_f5000
# that has only contigs of length >= 5000


# 2) create a new data frame called strain_1926_gc_narrow
# that has only contigs of length >= 2000 and gc content < 0.5



##################################
##################################
# Our comments below are reminders of the key steps

##########################
### Your work below    ###
##########################

# Load your data and create a data frame with contig length and gc content
# e.g. strain_XXXX <- readDNAStringSet("data/strainXXXX.fasta")
# df_XXXX <- data.frame(contig_lengths = width(strain_XXXX),gc = gc_content(strain_XXXX))



# Plot the data and look for quality issues
# e.g. ggplot(data = df_xxxx, aes(x = contig_lengths, y = gc)) +
#  geom_point() + 
#  scale_x_log10()



# Use logical comparisons like >,<,>=, etc. combined with &
# to get a TRUE/FALSE vector of the entries you want to keep



# Use that vector to select only certain entries from your 
# DNAStringSet



# Save your new DNAStringSet as a fasta file with the following naming 
# convention: strainXXXX_clean.fasta (replace XXXX with your strain number)



##########################
##########################
# Your task, build the same tree using the infB gene.
# then build a tree combining data from both genes.

##################################
###  Your code below           ###
##################################

# building a tree from infB
# first create a distance matrix. You should
# only need to make a small change to the code above.


# then, use that distance matrix with hclust()


# finally, plot the results



##################################
##################################

##################################
###    Your work below         ###
##################################





##################################
##################################
# we are back to our old friend the DNAStringSet

# let's focus just on the second contig (for arbitrary reasons)
# we can select it with the double brackets
s26_2 <- s26[[2]]

# now s26_1 is in our environment as a single DNAString
# let's use our gc_content_single() function to get the overall GC content
gc26_2 <- gc_content_single(s26_2)

# it's 0.39

# as a quick example, let's get the GC content for just the first 500 bp
# all we have to do is use square brackets to select the base pairs from 
# position 1 up to 500.
gc_content_single(s26_2[1:500])

## live-coding will be written in this chunk