Chapter 4 Week 1 Part B - Introduction shell scripting and pipelines
In the second part of this workshop, we will apply the skills we have learned thus far and extend them to scripting. We will be constructing basic pipelines, using redirects and writing shell scripts which we can execute to perform a series of functions.
4.1 Learning Objectives
At the conclusion of today’s workshop students are expected to be able to:
To combine a series of commands to manipulate common files and save the corresponding outputs using pipes and redirects
To construct and execute Bash scripts to solve multi-step programming problems
To formulate basic loops to execute a series of commands over multiple files
4.2 Workshop setup
To download today’s workshop data enter the following commands
cd ~/bms5021/week_1/data/
wget http://lochlanf.github.io/workshop_2.tar.gz
tar -xf workshop_2.tar.gz
4.3 Redirects and pipes
In workshop A we learned several useful commands, including grep
, sed
, wc
, and head
. These and other tools are often most powerful when linked together in a pipeline and saved to disk. For example, we might be interested in finding out how many variants in a .vcf
file map to chromosome 1. To perform this task, we would need to first perform a grep
search to isolate variants that are located on chromosome 1 and then use wc
to count them. The most efficient way to perform this task is to use a pipe. A pipe will pass the standard output of one command to the input stream of another. In bash, pipes use the |
operator (SHIFT+\ on most keyboards). Here is an example of how we might use perform the task outlined above using pipes:
grep -v '#' HG002.vcf | grep 'chr1' | wc -l
Exercise 1: At your table discuss what is occurring at each step of this pipeline. Explain the logic behind these steps.
Redirects are use to pass information to and from bash streams. Each program deals with streams differently and has default behaviour, however this can be modified using redirects. Lets recap what the standard bash streams are:
Input stream | File Descriptor | Explanation | Redirect operator |
---|---|---|---|
Standard Input (stdin) | 0 | This stream the provides input to a program. Data can be passed to stdin by typing into the terminal or by redirecting a file/text. | < |
Standard Output (stdout) | 1 | This is the stream to which output data from a program is written. This stream can return the output data to the terminal (ls, cat etc), write it to a file (many bioinformatics tools with output options) or be silent (ie. mv, cp etc) depending on the program. | > |
Standard Error (stderr) | 2 | This is the error stream. Programs pass error messages to this stream. | 2> |
As outlined above, the default behaviour of a program with respect to streams can be changed. We can use the redirect operators to pass information to a command (stdin, <
), write information to a file (via redirecting stdout >
), pass the output of a command to a subsequent program (piping, |
), and write errors to a log (redirecting stderr, >&
).
Exercise 2: Lets practice using redirects and pipes.
In
week_1/data/workshop_2
you will find a file calledanimals.csv
. Use the one of the viewing functions we learned last week to take a peek at this file. Our task is to search for entries containing the string rabbit and sort the output by the third column. This corresponds to the number of rabbits observed on that date.
You will need to use two commands, with standard output of the first piped to the second command. (Hint: The second command will be
sort
. You will need to provide a few options for this to work successfully.) Review the manual to gain some insight into the options ofsort
.On what date was the least number of rabbits observed?
Now lets try an example using the vcf file in the
week_1/data/vcf
directory. Lets imagine that we are only interested in T>C single nucleotide substitutions:
Construct a pipeline that will remove the header from the vcf file, subset variants with a T reference allele and C alternative allele, and counts the number of variants.
Redirect the output to a file called
T_to_C_cts.txt
Note: You will need to use the
awk
command to achieve this. Here is an example of an awk command that will return entries with the stringhello
in the fourth column from a file calledexample.txt
:awk '$4 =="hello"' example.txt
. The$4
indicates the fourth column. Check out the manual for awk for more details.
4.4 Constructing basic FOR loops
Loops are essential constructs that allow us to repeat tasks efficiently, making them fundamental tools in automating repetitive tasks. In bioinformatics this might mean trimming reads from multiple samples or calling variants across different patients. The most basic bash loop is the FOR loop.
A “FOR” loop is a loop used in programming and scripting languages to repeatedly execute a block of code based on a defined sequence of values or a list of items. In the context of bash scripting, the “FOR” loop is a construct that allows you to perform a specific action for each item in a list, array, or a specified range of values. Here is an example of a FOR loop
## [1] "hi"
for i in 1 2 3 4 5
do
echo "This is iteration number $i"
done
## This is iteration number 1
## This is iteration number 2
## This is iteration number 3
## This is iteration number 4
## This is iteration number 5
Lets deconstruct this loop.
First we are assign values 1 to 5 to the variable
i
. This is done iteratively (ie. firsti=1
, the body of the loop is ran, and theni=2
).do
instructs the bash interpreter to complete all commands listed in the body of the loop.The body of the loop is comprise of any number of commands, in this case we are just running the
echo
program.done
tells the interpreter that the loop is complete.This process repeats until all items in the list have been iterated over.
Loops are best constructed in scripts. Last workshop we created a basic script to print out text into the terminal. Lets have a go at creating another script, but this time our script will contain a FOR loop.
Exercise 3: FOR loops
Last workshop we created a directory called
src
withinweek_1
. Usingcd
navigate to this directory and create a new script calledmy_first_loop.sh
usingnano
Create a FOR loop that will find the pattern
CCC
in the.dat
files in the creatures directory inweek_1/data/workshop_2
Save the output to a file called CCC_Creatures.txt. Make sure you append the sequences for each creature to this file, with the sequences separated by the file name.
Hint: You will need to use
echo
,grep
, redirects (specifically the>>
redirect), wildcards and variables for this exercise.
4.5 Constructing basic WHILE loops
The second type of loop that we can employ in bash is the WHILE loop. The while
loop is used when you want to repeat a block of code while a certain condition is true. We can use while
loops in combination with the read
function to construct powerful loops. These loops read each line of a file and continue until there are no lines remaining. For these loops to work we must provide an input file via the standard input stream. Here is an example:
Suppose we have an input text file that contains the states of Australia, each on a new line. We can use a while read
loop to print each of these lines to the terminal using the following code:
while read states
do
echo ${states}
done < example.txt
## Victoria
## New South Wales
## Tasmania
## Queensland
## South Australia
## Western Australia
Exercise 4: Constructing a basic WHILE read loop
In the
week_1/data/workshop_2/sequences
directory you will find some fastq files. Usingls
,grep
,sed
and redirects, create a text file calledsamples.txt
that contains the sample name without the.fastq
suffixUse a
while read
loop to search for the headers in each sequence (indicated by the@
symbol) and redirect them to a new file which has a file name format of sample_sequence_headers.txt
4.6 Setting variables
In bash we can assign values to variables. Bash variables are used to store data, such as text strings or numbers, within a Bash script or in the command line. They provide a way to reference and manipulate data as the script or command runs. Variables are very useful as they allow us to assign a string (such as a directory) to a variable and avoid typing it multiple times when writing a script. Here is an example of how we assign and access variables:
#!/bin/bash
variable1="My First Variable"
echo $variable1
echo ${variable1}_text
## My First Variable
## My First Variable_text
In this example, I have assigned the string My First Variable
to variable1
. I can then access that variable using the $
operator. The curly braces on the third line allow us to access the variable and separate it from pre- and suffix text.
Exercise 5: Creating and using variables.
Create a new directory called
results
in theweek_1
directory.In the
src
directory, create a new script calledvariables.sh
In your script set two variables:
results
anddata
pointing to the absolute path of both of these directoriesUse the echo command to call these variables and print them out
Execute your script
We can also use command line arguments to set variables that are provided when executing a script. These are indexed according to the position of the argument. For example, consider the following script called course.sh
:
#!/bin/bash
echo "The title of this course is ${1}"
echo "The code for this course is ${2}"
When executing the script, we can pass information to $1
and $2
by providing additional arguments:
bash course.sh Introductory_Bioinformatics bms5021
## The title of this course is Introductory_Bioinformatics
## The code for this course is bms5021
Exercise 6: Creating and using variables provided as arguments
In the
src
directory, create a new script calledvariables_2.sh
Write a script that uses command line argument to set results and data variables
Use the echo command to call these variables and print them out
Execute your script