Chapter 4 Week 1 Part B - Introduction shell scripting and pipelines

In the second part of this workshop, we will apply the skills we have learned thus far and extend them to scripting. We will be constructing basic pipelines, using redirects and writing shell scripts which we can execute to perform a series of functions.

4.1 Learning Objectives

At the conclusion of today’s workshop students are expected to be able to:

  • To combine a series of commands to manipulate common files and save the corresponding outputs using pipes and redirects

  • To construct and execute Bash scripts to solve multi-step programming problems

  • To formulate basic loops to execute a series of commands over multiple files

4.2 Workshop setup

To download today’s workshop data enter the following commands

  • cd ~/bms5021/week_1/data/

  • wget http://lochlanf.github.io/workshop_2.tar.gz

  • tar -xf workshop_2.tar.gz

4.3 Redirects and pipes

In workshop A we learned several useful commands, including grep, sed, wc, and head. These and other tools are often most powerful when linked together in a pipeline and saved to disk. For example, we might be interested in finding out how many variants in a .vcf file map to chromosome 1. To perform this task, we would need to first perform a grep search to isolate variants that are located on chromosome 1 and then use wc to count them. The most efficient way to perform this task is to use a pipe. A pipe will pass the standard output of one command to the input stream of another. In bash, pipes use the | operator (SHIFT+\ on most keyboards). Here is an example of how we might use perform the task outlined above using pipes:

grep -v '#' HG002.vcf | grep 'chr1' | wc -l

Exercise 1: At your table discuss what is occurring at each step of this pipeline. Explain the logic behind these steps.

Redirects are use to pass information to and from bash streams. Each program deals with streams differently and has default behaviour, however this can be modified using redirects. Lets recap what the standard bash streams are:

Input stream File Descriptor Explanation Redirect operator
Standard Input (stdin) 0 This stream the provides input to a program. Data can be passed to stdin by typing into the terminal or by redirecting a file/text. <
Standard Output (stdout) 1 This is the stream to which output data from a program is written. This stream can return the output data to the terminal (ls, cat etc), write it to a file (many bioinformatics tools with output options) or be silent (ie. mv, cp etc) depending on the program. >
Standard Error (stderr) 2 This is the error stream. Programs pass error messages to this stream. 2>

As outlined above, the default behaviour of a program with respect to streams can be changed. We can use the redirect operators to pass information to a command (stdin, <), write information to a file (via redirecting stdout >), pass the output of a command to a subsequent program (piping, |), and write errors to a log (redirecting stderr, >&).

Exercise 2: Lets practice using redirects and pipes.

  • In week_1/data/workshop_2 you will find a file called animals.csv. Use the one of the viewing functions we learned last week to take a peek at this file. Our task is to search for entries containing the string rabbit and sort the output by the third column. This corresponds to the number of rabbits observed on that date.

    • You will need to use two commands, with standard output of the first piped to the second command. (Hint: The second command will be sort. You will need to provide a few options for this to work successfully.) Review the manual to gain some insight into the options of sort.

    • On what date was the least number of rabbits observed?

  • Now lets try an example using the vcf file in the week_1/data/vcf directory. Lets imagine that we are only interested in T>C single nucleotide substitutions:

    • Construct a pipeline that will remove the header from the vcf file, subset variants with a T reference allele and C alternative allele, and counts the number of variants.

    • Redirect the output to a file called T_to_C_cts.txt

    • Note: You will need to use the awk command to achieve this. Here is an example of an awk command that will return entries with the string hello in the fourth column from a file called example.txt: awk '$4 =="hello"' example.txt . The $4 indicates the fourth column. Check out the manual for awk for more details.

4.4 Constructing basic FOR loops

Loops are essential constructs that allow us to repeat tasks efficiently, making them fundamental tools in automating repetitive tasks. In bioinformatics this might mean trimming reads from multiple samples or calling variants across different patients. The most basic bash loop is the FOR loop.

A “FOR” loop is a loop used in programming and scripting languages to repeatedly execute a block of code based on a defined sequence of values or a list of items. In the context of bash scripting, the “FOR” loop is a construct that allows you to perform a specific action for each item in a list, array, or a specified range of values. Here is an example of a FOR loop

## [1] "hi"
for i in 1 2 3 4 5 
do
echo "This is iteration number $i"
done
## This is iteration number 1
## This is iteration number 2
## This is iteration number 3
## This is iteration number 4
## This is iteration number 5

Lets deconstruct this loop.

  • First we are assign values 1 to 5 to the variable i. This is done iteratively (ie. first i=1, the body of the loop is ran, and then i=2).

  • do instructs the bash interpreter to complete all commands listed in the body of the loop.

  • The body of the loop is comprise of any number of commands, in this case we are just running the echo program.

  • done tells the interpreter that the loop is complete.

  • This process repeats until all items in the list have been iterated over.

Loops are best constructed in scripts. Last workshop we created a basic script to print out text into the terminal. Lets have a go at creating another script, but this time our script will contain a FOR loop.

Exercise 3: FOR loops

  • Last workshop we created a directory called src within week_1. Using cd navigate to this directory and create a new script called my_first_loop.sh using nano

  • Create a FOR loop that will find the pattern CCC in the .dat files in the creatures directory in week_1/data/workshop_2

  • Save the output to a file called CCC_Creatures.txt. Make sure you append the sequences for each creature to this file, with the sequences separated by the file name.

  • Hint: You will need to use echo, grep, redirects (specifically the >> redirect), wildcards and variables for this exercise.

4.5 Constructing basic WHILE loops

The second type of loop that we can employ in bash is the WHILE loop. The while loop is used when you want to repeat a block of code while a certain condition is true. We can use while loops in combination with the read function to construct powerful loops. These loops read each line of a file and continue until there are no lines remaining. For these loops to work we must provide an input file via the standard input stream. Here is an example:

Suppose we have an input text file that contains the states of Australia, each on a new line. We can use a while read loop to print each of these lines to the terminal using the following code:

while read states
do
echo ${states}
done < example.txt
## Victoria
## New South Wales
## Tasmania
## Queensland
## South Australia
## Western Australia

Exercise 4: Constructing a basic WHILE read loop

  • In the week_1/data/workshop_2/sequences directory you will find some fastq files. Using ls, grep, sed and redirects, create a text file called samples.txt that contains the sample name without the .fastq suffix

  • Use a while read loop to search for the headers in each sequence (indicated by the @ symbol) and redirect them to a new file which has a file name format of sample_sequence_headers.txt

4.6 Setting variables

In bash we can assign values to variables. Bash variables are used to store data, such as text strings or numbers, within a Bash script or in the command line. They provide a way to reference and manipulate data as the script or command runs. Variables are very useful as they allow us to assign a string (such as a directory) to a variable and avoid typing it multiple times when writing a script. Here is an example of how we assign and access variables:

#!/bin/bash
variable1="My First Variable"
echo $variable1
echo ${variable1}_text
## My First Variable
## My First Variable_text

In this example, I have assigned the string My First Variable to variable1. I can then access that variable using the $ operator. The curly braces on the third line allow us to access the variable and separate it from pre- and suffix text.

Exercise 5: Creating and using variables.

  • Create a new directory called results in the week_1 directory.

  • In the src directory, create a new script called variables.sh

  • In your script set two variables: results and data pointing to the absolute path of both of these directories

  • Use the echo command to call these variables and print them out

  • Execute your script

We can also use command line arguments to set variables that are provided when executing a script. These are indexed according to the position of the argument. For example, consider the following script called course.sh:

#!/bin/bash
echo "The title of this course is ${1}"
echo "The code for this course is ${2}"

When executing the script, we can pass information to $1 and $2 by providing additional arguments:

bash course.sh Introductory_Bioinformatics bms5021

## The title of this course is Introductory_Bioinformatics
## The code for this course is bms5021

Exercise 6: Creating and using variables provided as arguments

  • In the src directory, create a new script called variables_2.sh

  • Write a script that uses command line argument to set results and data variables

  • Use the echo command to call these variables and print them out

  • Execute your script