Chapter 3 Week 1 Part A - Introduction to the Shell

3.1 Learning Objectives

At the conclusion of today’s workshop students are expected to be able to:

  • To compare and contrast command line interfaces with graphical user interfaces

  • To articulate the principles of Bash programming, including the common syntax of commands

  • To execute common commands to solve routine problems using Bash

3.2 Workshop setup

Please execute the following commands immediately after logging into the terminal. You will learn what these commands do in this and future workshop.

  • mkdir bms5021

  • cp -r /apps/data/bms5021/week_1 bms5021/

3.4 Creating files and directories

We can create files and directories using the touch and mkdir commands. Touch will create an empty file and mkdir will create a new directory. Lets give it a try:

Exercise 4:

Within the ~/bms5021/week_1 directory, create a directory called “src”. Within this directory, create an empty file called script_1.sh

3.5 Copying, moving and deleting files

There are occasions when we want to copy, move or even delete files from the shell. This can be achieved using the cp, mv and rm functions, respectively.

  • The basic syntax for cp and mv is as follows: cp source_file target_file

Exercise 5:

Within the `~/bms5021/week_1` directory, create a couple of files using the touch command. Explore the cp and mv commands by moving these files around and creating new copies. When you are done clean up the files you have created with rm.

3.6 Wild cards

Sometimes when we are performing tasks in bash we would like to identify patterns in text strings. For instance, I might have a large set of fastq files from a paired-end sequencing run. Read pairs are usually denoted by R1 and R2. I can use wild cards to identify all the R1 fastq files in a directory in a directory by using the following command

ls *R1.fastq

There are several wildcards in Bash:

These can be integrated into any commands where finding strings and patterns are useful. Several wildcards can be combined within a single string, making them a very powerful tool.
Wildcard Description Example Output
* Matches any number of characters. ls *.pdf All files with the .pdf file extension
? Matches any single character. ls sample_1.?am All files with an extension that starts with any character and is followed by am. Examples might include sample_1.bam or sample_1.sam
[] Matches any single character within a range or set ls sample_[123].sam Returns sample_1.sam, sample_2.sam, and sample_3.sam if they exist
{} Expands a comma-separated list of strings or characters ls sample_1.{sam,bam} Returns sample_1.sam and sample_1.bam

Exercise 6:

Let’s try a few examples. In the ~/bms5021/week_1/data/sequences you will find a set of DNA sequences in FASTA format. Each file contains a different number of sequences. FASTA files contain a descriptor line, which is indicated by the > symbol, and a line containing the sequence of nucleotides. We can use the wc tool to evaluate the number of lines in a file. Use the wc tool and wild-card expressions to determine the number of sequences in each file. (Hint - if you are unsure about wc, check out the manual using man wc

Solution

grep ">" *.fasta | wc -l

3.7 Viewing file content

There are several ways we can work with the contents of files in Bash. To view files we can leverage a suite of in-built tools, including cat, less, head and tail

Command Description Use case
cat This command will concatenate files and return the contents to standard output (the terminal).
  • To view the entire contents of a file

  • To combine several files

less This command will display the contents of a file in your terminal window. You can move through the file using the up and down arrows. You exit the program by typing q.
  • To view and navigate the contents of a file
head This command will print out the first n lines of a file. By default this is 10. Different options can be set to modify this.
  • View the header of a file
tail This command will print out the last n lines of a file. By default this is 10. Different options can be set to modify this.
  • View the end of a file

Exercise 7:

In the ~/bms5021/week_1/data/vcf you will find a large variant call file (we will learn what the contents of this file mean in later weeks). Using the commands above, explore this file. What do you think the most effective way to navigate this file?

Solution

My typical approach here is to use head and tail first to inspect the top and bottom of files and then to use less to navigate them more easily

3.8 Searching file content

Sometimes we will want to search and extract certain phrases from files. For this we can use the grep tool. grep will search for patterns in a file and return them to the terminal. The grep tool has several handy options including regular pattern matching, inverted search, and pattern counting. This is an important tool that you will use in several workflows, check out the manual to see all the relevant options.

Exercise 8:

Earlier we counted the number of sequences in each fasta file in ~/bms5021/week_1/data/sequence. Now, using grep and the wildcards that you learned earlier:

  • Search for all sequences in the fasta files containing the string TTG.

  • Search for all sequences that don’t contain TTG

  • Search for all sequences that contain AAA in mock_sequence_1.fasta to mock_sequence_5.fasta

3.9 Editing files - Sed

There are two main ways we can edit files:

  • Interactively using tools like nano and vim

  • Non-interactively using tools like sed

Lets talk about sed first. This tool is most useful when incorporated into scripts as it can search for patterns and perform operations such as replacing the string, deleting the string and inserting characters before and after the string. The power in the tool lies in its ability to perform these operations at speed and scale. One of the most popular implementations of the sed tool is the substitute program. It follows the following syntax:

  • sed [OPTIONS] 'command(s)' input_file(s)

The most common applications of sed is the substitute function

  • Substitute - This command can be used to replace patterns with new strings. The syntax for sed substitute is sed 's/old_text/new_text/g' input_file

Exercise 9: Some genome reference builds contain the chr prefix to chromosomes, while others omit this information. The vcf file we have provided you contains the chr prefix. Using sed and other commands perform the following tasks:

  • Identify the number of variants in this file (Hint: You might consider using the grep and wc tools.

  • Using the sed tool, remove the chr prefix from both the header and the variant entries.

3.10 Editing files - Nano

We can also edit files interactively, much like you’d use word or notepad on a GUI-based interface. To get started, watch this short video on how to use nano:

https://youtu.be/cLyUZAabf40

Exercise 10:

Let’s construct our first bash script using nano. Remember earlier we created a file called script_1.sh in the src directory. Using nano:

  • Open the file

  • Type #!/bin/bash, this is what we call a shebang. This provides a path to the bash interpreter and should be at the top of all of your scripts.

  • On a new line, write a line of code that uses echo tool to print out “This is my first script in bms5021”. If you are not familar with this tool, use the man tool to review the manual.

  • Save the script and execute it by typing bash script_1.sh into the terminal. If you get a path error, you may need to use cd to ensure you are in the right directory

Exercise 11:

Let’s now circle back to sed. Using sed substitute:

  • Replace bms5021 with Applied Bioinformatics

  • Use the > operator to save the output to a new file called script_2.sh

  • Execute the new script.