Section 2 Week 1: Genomics fundamentals

This week we are taking an introductory overview of genomics. While watching the “Genome: book of life” documentary film, you will learn how the genome represents a record of our past evolution. It explores how the genome contains relics of our ancient origins in Archaea and Bateria. It explores the quadrupling of the the genome that enabled the evolution of the vertebrates. It examines how the genome has been shaped by ‘successful mistakes’ and the evidence for “fossil genes.”

2.1 Human genes of archaeal and bacterial origin

2.1.1 Elongation factor 1-alpha 1 (eEF1a1)

Recall from the video the discussion about eEF1a1. Can you remember what is its relevance to genomic evolution?

Elongation factor 1 (EF-1 or eEF1a1) protein is a member of the translational GTPases (trGTPases). This is an ancient superfamily of proteins that has existed since before the last common ancestor of life. If you wish to know more about the evolution of these proteins, see the review by Atkinson (2015). In other words, these proteins existed in the early life-forms that are the ancestor of all life today, before organisms split into the eukaryotes, bacteria and archea. That is quite bold claim; what is the evidence to support it? We are going to take a look at the phylogenetic evidence for the ancient origin of this protein superfamily.

This superfamily of proteins also includes elongation factor EF-G and initiation factor 2 (IF-2), among other related proteins. Here are the amino-acid sequences of some examples of EF-1:

>Archaea_EF-G 
MGARVKVVSEIEKIMRNIDQIRNIGIIAHVDHGKTTTSDSLLAAAGIISERIAGEALVLDYLNVEKQRGITVKSANVSLY
HEYEGKPYVINLIDTPGHVDFSGKVTRSLRVLDGAIVVVDAVEGVMTQTETVIRQALEERVRPILFINKVDRLIKELKLP
PEKIQQRFVEIIKEVNNLIDLYAEPEFRKKWKLDPNAGMVAFGSAKDKWGISVPQVKKKGITFREIIQAYEKGKEAVAEL
SKKMPLHETLLDMVIKFVPNPREAQRYRIPKIWKGDINSEIGQAMLNADPDGPLVFFINDVRIEKAGLVATGRVFSGTLR
SGEEVYLLNAGKKSRLLQVSIYMGPFREVTKEIPAGNIGAVMGFEDVRAGETVVSLGYEENAAPFESLRYVSEPVVTIAV
EPVKIQDLPKMIEALRKLTIEDPNLVVKINEETGEYLLSGMGPLHLEIALTMLREKFGVEVKASPPIVVYRETVRQQSRV
FEGKSPNKHNKLYISVEPLNEETITLIQNGAVTEDQDPKDRARILADKAGWDYNEARKIWAIDENINVFVDKTAGVQYLR
EVKDTIIAGFRLALKEGPLAAEPVRGVKVVLHDAVIHEDPVHRGPGQLYPAVRNAIWAGILDGRPTLLEPLQKLDIRAPM
EYLSNITAVLTRKRGRIINVETTGVMARIIAAIPVAESFDLAGELRSATAGRAFWGVEFYGWAPVPDQMLQDLIAKIRQR
KGLPPSPPKIDDLIGP
>Human_EF-G 
MVNFTVDQIRAIMDKKANIRNMSVIAHVDHGKSTLTDSLVCKAGIIASARAGETRFTDTRKDEQERCITIKSTAISLFYE
LSENDLNFIKQSKDGAGFLINLIDSPGHVDFSSEVTAALRVTDGALVVVDCVSGVCVQTETVLRQAIAERIKPVLMMNKM
DRALLELQLEPEELYQTFQRIVENVNVIISTYGEGESGPMGNIMIDPVLGTVGFGSGLHGWAFTLKQFAEMYVAKFAAKG
EGQLGPAERAKKVEDMMKKLWGDRYFDPANGKFSKSATSPEGKKLPRTFCQLILDPIFKVFDAIMNFKKEETAKLIEKLD
IKLDSEDKDKEGKPLLKAVMRRWLPAGDALLQMITIHLPSPVTAQKYRCELLYEGPPDDEAAMGIKSCDPKGPLMMYISK
MVPTSDKGRFYAFGRVFSGLVSTGLKVRIMGPNYTPGKKEDLYLKPIQRTILMMGRYVEPIEDVPCGNIVGLVGVDQFLV
KTGTITTFEHAHNMRVMKFSVSPVVRVAVEAKNPADLPKLVEGLKRLAKSDPMVQCIIEESGEHIIAGAGELHLEICLKD
LEEDHACIPIKKSDPVVSYRETVSEESNVLCLSKSPNKHNRLYMKARPFPDGLAEDIDKGEVSARQELKQRARYLAEKYE
WDVAEARKIWCFGPDGTGPNILTDITKGVQYLNEIKDSVVAGFQWATKEGALCEENMRGVRFDVHDVTLHADAIHRGGGQ
IIPTARRCLYASVLTAQPRLMEPIYLVEIQCPEQVVGGIYGVLNRKRGHVFEESQVAGTPMFVVKAYLPVNESFGFTADL
RSNTGGQAFPQCVFDHWQILPGDPFDNSSRPSQVVAETRKRKGLKEGIPALDNFLDKL
>Bacteria_EF-G 
MVKFTAEELRRIMDFKHNIRNMSVIAHVDHGKSTLTDSLVAAAGIIAQEVAGDVRMTDTRADEAERGITIKSTGISLYYE
MTDESLRNFKGERNGNEYLINLIDSPGHVDFSSEVTAALRITDGALVVVDCVEGVCVQTETVLRQALGERIRPVLTVNKM
DRCFLELQVDGEEAYQTFQRVIENANVIMATYEDPLLGDVQVYPEKGTVAFSAGLHGWAFTLTNFAKMYASKFGVDESKM
MERLWGENFFDPATKKWTTKNSGTASCKRGFVQFCYEPIKQIINTCMNDQKDKLWPMLQKLGVTMKSDEKDLMGKALMKR
VMQTWLPASTALLEMMIYHLPSPAIAQRYRVENLYEGPLDDAYANAIRNCDPEGPLMLYVSKMIPASDKGRFFAFGRVFA
GKVCTGMKVRIMGPNYVPGEKKDLYVKNIQRTVIWMGKRQETVEDVPCGNTVAMVGLDQYITKNATLTNEKEVDAHPIRA
MKFSVSPVVRVAVQCKVASDLPKLVEGLKRLAKSDPMVVCSIEESGEHIIAGAGELHLEICLKDLQDDFMGGAEIIKSDP
VVSFRETVLEKSTRTVMSKSPNKHNRLYMEARPMEEGLAEAIDDGRIGPRDDPKVRSKILAEEFGWDKDLAKKIWCFGPE
TTGPNMVVDMCKGVQYLNEIKDSVVAGFQWASKEGALAEENMRGICFEVCDVVLHSDAIHRGGGQVIPTARRVIYASQLT
AKPRLLEPVYLVEIQAPEQALGGIYSVLNQKRGHVFEEMQRPGTPLYNIKAYLPVVESFGFSGTLRAATSGQAFPQCVFD
HWEMMSSDPLEVGSQANQLVLDIRKRKGLKEQMTPLSEFEDKL
>Archaea_EF-1A 
MAKEKPILNVAFIGHVDAGKSTTVGRLLLDGGAIDPQLIVRLRKEAEEKGKAGFEFAYVMDGLKEERERGVTIDVAHKKF
PTAKYEVTIVDCPGHRDFIKNMITGASQADAAVLVVNVDDAKSGIQPQTREHVFLSRTLGITQLAVAINKMDTVNFSEAD
YNEMKKMLGDQLLKMVGFNPDNINFIPVASLLGDNVFKKSDKTPWYNGPTLAEVIDGFQPPEKPTTLPLRLPIQDVYSIT
GVGTVPVGRVETGIIKPGDKVVFEPAGAVGEIKTVEMHHEQLPSAEPGDNIGFNVRGVGKKDIKRGDVLGHTTNPPTVAA
DFTAQIVVLQHPSVMTVGYTPVFHAHTAQIACTFMELQKKLNPATGEVLEENPDFLKAGDAAIVKLMPTKPLVMESVKEI
PQLGRFAIRDMGMTVAAGMAIQVTAKNK
>Human_EF-1A 
MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKF
ETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEP
PYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTR
PTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDV
RRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAA
IVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK
>Bacteria_EF-1A 
MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAQEMGKGSFKYAWVLDKLKAERERGITIDIALWKF
ETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLVVACGTGEFEAGISKNGQTREHALLAQTLGVKQLIVACNKMDSTEP
AFSEARFNEVTNEVSNYIKKIGYNPKAVAFVPISGFNGDNMLEPSPNMPWFKGWNVERKEGNASGKTLLEALDAVIPPSR
PTDKPLRLPLQDVYKIGGIGTVPVGRVETGILKPGMVVTFAPQNITTEVKSVEMHHEALQEALPGDNVGFNVKNVSIKEI
RRGSVASDSKNDPAKETKQFTAQVIIMNHPGQIAAGYTPVLDCHTAHIACKFAELKEKVDRRSGKKVEDNPKFLKSGDAG
IIDLIPTKPLCVETFTEYPPLGRFAVRDMRQTVAVGVIKGVEKTEGGAGKVTKAAQKAGTAPAGKKK
>Bacteria_IF-2 
MATINQVIDQETAQLVAEEMGHKVILRRENELEEAVMSDRDTGAAAEPRAPVVTIMGHVDHGKTSLLDYIRSTKVASGEA
GGITQHIGAYHVETENGMITFLDTPGHAAFTSMRARGAQATDIVVLVVAADDGVMPQTIEAIQHAKAAQVPVVVAVNKID
KPEADPDRVKNELSQYGILPEEWGGESQFVHVSAKAGTGIDELLDAILLQAEVLELKAVRKGMASGAVIESFLDKGRGPV
ATVLVREGTLHKGDIVLCGFEYGRVRAMRNELGREVLEAGPSIPVEILGLSGVPAAGDEVTVVRDEKKAREVALYRQGKF
REVKLARQQKSKLENMFANMTEGEVHEVNIVLKADVQGSVEAISDSLLKLSTDEVKVKIIGSGVGGITETDATLAAASNA
ILVGFNVRADASARKVIEAESLDLRYYSVIYNLIDEVKAAMSGMLSPELKQQIIGLAEVRDVFKSPKFGAIAGCMVTEGV
VKRHNPIRVLRDNVVIYEGELESLRRFKDDVNEVRNGMECGIGVKNYNDVRTGDVIEVFEIIEIQRTIA
>Archaea_IF-2 
MAGDKGGGDGERRLRQPIVVVLGHVDHGKTTLLDKIRRTAVAAKEAGGITQHIGASIVPADVIEKIAEPLKKVIPVKLVI
PGLLFIDTPGHELFSNLRRRGGSVADFAILVVDIMEGFKPQTYEALELLKERRVPFLIAANKIDRIPGWKPNPDAPFIET
IRRQDPKVREILEQRVYEIVGKMYEAGLPAELFTRIKDFRRKIAIVPVSARTGEGIPELLAVLAGLTQTYLKERLRYAEG
PAKGVVLEVKEMQGFGTVVDAVIYDGVLKKEDIIVVGGREGPIVTRVRALLMPAPLQDIRSREARFVQVDRVYAAAGVRI
AAPGLDDVIAGSPIYAAESEEEARKLMEAVQREIEELRFRTENIGVVVKADTLGTLEALVEALRRRGVPVRLADIGPVSR
SDVLDAAVTRKIDPYLGVVLAFNVKVLPEAEEEASRAGVKIFRESMIYKLIEDYEEWVKKEKEAERLKALNSLIRPGKFR
ILPGYVFRRSDPAIVGVEVLGGVIRPGYPVMDSQGRELGRIMAIKDRDRSLEEARLGAAVAVSIQGRILIGRHANEGDIL
YTNVPAQHAYKILTEFKDLVSKDELDVLREIAEIKRRAADHEYNKVLLRLKIKRVSQ

Let’s see how they are related to each other by aligning the sequences and then building a phylogenetic tree.

Navigate your web browser to https://www.ebi.ac.uk/Tools/msa/clustalo/. In the sequence box, paste the sequences given above and press the “Submit” button. After a few minutes, this will generate a multiple sequence alignment of your sequences. In the sequence alignment, note how similar the human, bacterial and archaeal EF-1 sequences are to each other, even though they are separated by a huge evolutionary distance.

Now, click on the “Phylogenetic tree” tab and you will see an estimate of the evolutionary relationships between these proteins. Notice that human EF-1 is closer to proteins from archaea and from bacteria than it is to other human proteins. This suggests that humans (eukaryotes), bacteria and archaea diverged from each other more recently than EF-1 diverged from EF-2 and from IF-2. This is indeed a very ancient superfamily of proteins and EF-1 is a very ancient protein, yet its amino-acid sequence has remained relatively unchanged over hundreds of millions or billions of years.

2.2 Whole-genome duplications and the evolution of the vertebrates

In the documentary film, you learned that whole-genome duplication was important for the evolution of the vertebrates. The existence of redundant copies of genes allowed the evolution of new complexity in structure and behaviour. It mentioned the work of McLysaght, Hokamp, and Wolfe (2002), who identified that this gene-duplication activity occurred in a burst around 350 to 650 million years ago. There were two rounds of duplication in the lineage leading to the vertebrates and a third round in the lineage leading to the teleost fish; for more details, see: Singh and Isambert (2020). But not all gene duplications arise from events as dramatic as duplication of the whole genome. Some arise by tandem duplication.

Whatever its origin, a newly duplicated gene may face different fates:

  • One of the copies could lose its function and become a pseudogene (or “fossil gene”). This means that gene has become degraded by accumulating mutations that inactivate its function. (Note, there are other mechanisms of forming pseudogenes, but they don’t concern us at the moment.)

  • Both copies can maintain the same function.

  • The two copies can accumulate different mutations leading to the duplicated genes taking on different roles that had previously been performed by the original gene, a process known as subfunctionalization.

  • One of the copies evolves a novel function, a process known as neofunctionalization. The other copy can still retain the original function. As pointed out in A. Lesk (2017), It is generally easier for evolution to ‘recruit’ and adapt an exisiting gene to a new function than to invent a new one from scratch.

Activity: Can you find an example of a genes that has arisen from neofunctionalisation and an example that has undergone subfunctionlisation?

2.3 Colour vision

In the documentary video, you learned that there are three types of photo-receptor proteins in the cones of our eyes, which respond respectively to red, green or blue light. The genes encoding these proteins are OPN1LW (red), OPN1MW (green) and OPN1SW (blue). The corresponding proteins are called OPN1LW, OPN1MW and OPN1SW. How are these three proteins related to each other? And where do they reside in our genome?

2.3.1 Genomic locations of the opsin genes

To locate the opsin genes within the human genome we are going to use the Ensembl web portal. Ensembl is a a large and complex database of genomic data and presents these data in a powerful web-based genome browser.

Activity: Navigate in your web browser to the Ensembl web page here: https://www.ensembl.org/Homo_sapiens/Info/Index.

In the top-left of the page you will find a search box. Use this to search for genes matching ‘OPN1LW.’ This should take you to a description of the human OPN1LW gene, whose Ensembl accession number is ‘ENSG00000102076.’ The accession number is a unique identifier (or ‘catalogue number’) that can be used to unambiguously identify a specific gene. We will see many more accession numbers for various objects in various databases as we proceed through our journey into genomics data over the coming weeks.

  • Can you obtain the genomic location of this OPN1LW gene?
  • On which chromosome is it located?
  • What are the start and end positions of the gene on that chromosome?
  • In which direction is the gene transcribed?

Hint: the answer is on the Ensembl web page that you are looking at. If you can’t find it, click here, which will take you to the correct location on the genome.

Now let’s take a look at the OPN1MW and OPN1SW genes.

  • Can you obtain the genomic locations of the OPN1MW and OPN1SW genes?
  • On which chromosomes are they located?
  • What are the start and end positions?

What do you notice about the locations of these three genes relative to each other?

2.3.2 Phylogenetic relationships between opsin proteins

Now that you have ascertained the the genomic locations of the opsin genes, let’s examine their protein sequences, and assess how closely related they are to each other.

From the previous section, you now know how to search for specific genes in the Ensembl database. You should be able to figure out how to search for proteins too.

Activity:

  • Obtain the amino-acid sequences of the OPN1LW, OPN1MW and OPN1SW proteins in fastA format.
  • Copy and paste these into the multiple alignment tool that you used previously for the EF-1 analysis: https://www.ebi.ac.uk/Tools/msa/clustalo/.
  • Create a multiple sequence alignment and phylogenetic tree by pressing the appropriate buttons.
  • Which two opsins are most closely related to each other?
  • How does this relate to their genomic positions?
  • What is the explanation for this pattern of phylogeny and genomic position? ** What is the role of whole-genome duplication? ** What is the role of tandem duplication?

2.3.3 Ohnologues (or ohnologs)

Ohnologues are pairs of genes who are related to each other by duplication of a common ancestor during a whole-genome duplication. Thet are named after famous geneticist Susumu Ohno.

Activity: Find out whether OPN1LW, OPN1MW and OPN1SW are ohnologues of each other.

To find the answer, let’s use the OHNOLOGS v2 database, here: http://ohnologs.curie.fr/. Enter the name of the gene into the Search function and try to interpret the output. Note that “2R WGD” refers to the two rounds of whole-genome duplication leading to the vertebrates. This is to distinguish it from the three rounds leading to teleost fish. Also note that the database software uses two different levels of strictness when deciding whether or not genes are ohnologous or not (strict versus relaxed) and gives a confidence score.

2.3.4 Further reading

If you want to take your exploration of the evolution of opsins to the next level, you could try expanding your phylogenetic analysis to include the mouse green and ultra-violet sensing opsins. You can read more about the evolution of opsins in a review paper by Dulai et al. (1999).

2.4 Genomic variation

These activities are based on the “Exercises and Problems” listed at the end of Chapter 1 in A. Lesk (2017).

2.4.1 Gene density

Calculate the approximate density of protein-coding genes in the human genome. Base your estimate on the following statistics:

  • Total genome size: ~ 3 x 109 base pairs (bp).
  • Total number of genes: ~ 3 x 104 genes.

2.4.2 Nucleotide diversity

  1. Estimate the number of differences in the total sequences between two randomly chosen humans.
  2. Estimate roughly the number of differences in the total sequence between a human and a chimpanzee.

Base your estimates on the following statistics:

  • A randomly chosen pair of humans will show an average nucleotide diversity of between 1 bp in 1000 and 1 bp in 1500.
  • A human and a chimpanzee differ at approximately 1 bp in 100.

2.5 Genome browsing

Earlier in these tasks, you encountered the Ensembl genome browser. We are going to visit this resource again to tackle these questions, which are based on excercises at the end of Chapter 1 in A. Lesk (2017).

2.5.1 Chromosome bands

Which of the following bands on human chromosome 16 are gene-rich?

  • p13.3?
  • q22.1?
  • q11.2?

To find the answer, use Ensembl’s karyotype plotter here: https://www.ensembl.org/Homo_sapiens/Location/Chromosome?r=16%3A1-1000.

Do you notice any other interesting patterns?

2.5.2 Understanding the genome browser image

Figure 1.31 from Lesk’s textbook (A. Lesk 2017) corresponds approximately to this in the Ensembl database: https://nov2020.archive.ensembl.org/Homo_sapiens/Location/View?db=core;r=16:152045-194547;g=ENSG00000188536.

On that image, can you:

2.5.3 Understanding a “flat file” description of a gene

Here is the sequence of the human alpha-globin gene as shown in Figure 1.34 of Lesk’s textbook (A. Lesk 2017):

        1 tgcccccgcg ccccaagcat aaaccctggc gcgctcgcgg cccggcactc ttctggtccc
       61 cacagactca gagagaaccc accatggtgc tgtctcctgc cgacaagacc aacgtcaagg
      121 ccgcctgggg taaggtcggc gcgcacgctg gcgagtatgg tgcggaggcc ctggagaggt
      181 gaggctccct cccctgctcc gacccgggct cctcgcccgc ccggacccac aggccaccct
      241 caaccgtcct ggccccggac ccaaacccca cccctcactc tgcttctccc cgcaggatgt
      301 tcctgtcctt ccccaccacc aagacctact tcccgcactt cgacctgagc cacggctctg
      361 cccaggttaa gggccacggc aagaaggtgg ccgacgcgct gaccaacgcc gtggcgcacg
      421 tggacgacat gcccaacgcg ctgtccgccc tgagcgacct gcacgcgcac aagcttcggg
      481 tggacccggt caacttcaag gtgagcggcg ggccgggagc gatctgggtc gaggggcgag
      541 atggcgcctt cctcgcaggg cagaggatca cgcgggttgc gggaggtgta gcgcaggcgg
      601 cggctgcgga cctgggccct cggccccact gaccctcttc tctgcacagc tcctaagcca
      661 ctgcctgctg gtgaccctgg ccgcccacct ccccgccgag ttcacccctg cggtgcacgc
      721 ctccctggac aagttcctgg cttctgtgag caccgtgctg acctccaaat accgttaagc
      781 tggagcctcg gtggccatgc ttcttgcccc ttgggcctcc ccccagcccc tcctcccctt
      841 cctgcacccg tacccccgtg gtctttgaat aaagtctgag tgggcggcag cctgtgtgtg

Using either the textbook or using https://www.ncbi.nlm.nih.gov/nuccore/v00491, mark the regions of the sequence corresponding to the three exons.