Chapter 3 Week 1 Part A - Introduction to the Shell
3.1 Learning Objectives
At the conclusion of today’s workshop students are expected to be able to:
To compare and contrast command line interfaces with graphical user interfaces
To articulate the principles of Bash programming, including the common syntax of commands
To execute common commands to solve routine problems using Bash
3.2 Workshop setup
Please execute the following commands immediately after logging into the terminal. You will learn what these commands do in this and future workshop.
mkdir bms5021
cp -r /apps/data/bms5021/week_1 bms5021/
3.4 Creating files and directories
We can create files and directories using the touch
and mkdir
commands. Touch will create an empty file and mkdir will create a new directory. Lets give it a try:
Exercise 4:
Within the
~/bms5021/week_1
directory, create a directory called “src”. Within this directory, create an empty file called script_1.sh
3.5 Copying, moving and deleting files
There are occasions when we want to copy, move or even delete files from the shell. This can be achieved using the cp
, mv
and rm
functions, respectively.
- The basic syntax for
cp
andmv
is as follows:cp source_file target_file
Exercise 5:
Within the `
~/bms5021/week_1`
directory, create a couple of files using the touch command. Explore thecp
andmv
commands by moving these files around and creating new copies. When you are done clean up the files you have created withrm.
3.6 Wild cards
Sometimes when we are performing tasks in bash we would like to identify patterns in text strings. For instance, I might have a large set of fastq files from a paired-end sequencing run. Read pairs are usually denoted by R1 and R2. I can use wild cards to identify all the R1 fastq files in a directory in a directory by using the following command
ls *R1.fastq
There are several wildcards in Bash:
Wildcard | Description | Example | Output |
---|---|---|---|
* | Matches any number of characters. | ls *.pdf |
All files with the .pdf file extension |
? | Matches any single character. | ls sample_1.?am |
All files with an extension that starts with any character and is followed by am . Examples might include sample_1.bam or sample_1.sam |
[] | Matches any single character within a range or set | ls sample_[123].sam |
Returns sample_1.sam , sample_2.sam , and sample_3.sam if they exist |
{} | Expands a comma-separated list of strings or characters | ls sample_1.{sam,bam} |
Returns sample_1.sam and sample_1.bam |
Exercise 6:
Let’s try a few examples. In the
~/bms5021/week_1/data/sequences
you will find a set of DNA sequences in FASTA format. Each file contains a different number of sequences. FASTA files contain a descriptor line, which is indicated by the>
symbol, and a line containing the sequence of nucleotides. We can use thewc
tool to evaluate the number of lines in a file. Use thewc
tool and wild-card expressions to determine the number of sequences in each file. (Hint - if you are unsure aboutwc
, check out the manual usingman wc
Solution
grep ">" *.fasta | wc -l
3.7 Viewing file content
There are several ways we can work with the contents of files in Bash. To view files we can leverage a suite of in-built tools, including cat
, less
, head
and tail
Command | Description | Use case |
---|---|---|
cat |
This command will concatenate files and return the contents to standard output (the terminal). |
|
less |
This command will display the contents of a file in your terminal window. You can move through the file using the up and down arrows. You exit the program by typing q . |
|
head |
This command will print out the first n lines of a file. By default this is 10. Different options can be set to modify this. |
|
tail |
This command will print out the last n lines of a file. By default this is 10. Different options can be set to modify this. |
|
Exercise 7:
In the
~/bms5021/week_1/data/vcf
you will find a large variant call file (we will learn what the contents of this file mean in later weeks). Using the commands above, explore this file. What do you think the most effective way to navigate this file?
Solution
My typical approach here is to use head
and tail
first to inspect the top and bottom of files and then to use less
to navigate them more easily
3.8 Searching file content
Sometimes we will want to search and extract certain phrases from files. For this we can use the grep
tool. grep
will search for patterns in a file and return them to the terminal. The grep
tool has several handy options including regular pattern matching, inverted search, and pattern counting. This is an important tool that you will use in several workflows, check out the manual to see all the relevant options.
Exercise 8:
Earlier we counted the number of sequences in each fasta file in
~/bms5021/week_1/data/sequence
. Now, usinggrep
and the wildcards that you learned earlier:
Search for all sequences in the fasta files containing the string
TTG
.Search for all sequences that don’t contain
TTG
Search for all sequences that contain
AAA
inmock_sequence_1.fasta
tomock_sequence_5.fasta
3.9 Editing files - Sed
There are two main ways we can edit files:
Interactively using tools like
nano
andvim
Non-interactively using tools like
sed
Lets talk about sed
first. This tool is most useful when incorporated into scripts as it can search for patterns and perform operations such as replacing the string, deleting the string and inserting characters before and after the string. The power in the tool lies in its ability to perform these operations at speed and scale. One of the most popular implementations of the sed
tool is the substitute program. It follows the following syntax:
sed [OPTIONS] 'command(s)' input_file(s)
The most common applications of sed
is the substitute function
- Substitute - This command can be used to replace patterns with new strings. The syntax for
sed
substitute issed 's/old_text/new_text/g' input_file
Exercise 9: Some genome reference builds contain the
chr
prefix to chromosomes, while others omit this information. Thevcf
file we have provided you contains thechr
prefix. Usingsed
and other commands perform the following tasks:
Identify the number of variants in this file (Hint: You might consider using the
grep
andwc
tools.Using the
sed
tool, remove thechr
prefix from both the header and the variant entries.
3.10 Editing files - Nano
We can also edit files interactively, much like you’d use word or notepad on a GUI-based interface. To get started, watch this short video on how to use nano:
Exercise 10:
Let’s construct our first bash script using nano. Remember earlier we created a file called
script_1.sh
in the src directory. Using nano:
Open the file
Type
#!/bin/bash
, this is what we call a shebang. This provides a path to the bash interpreter and should be at the top of all of your scripts.On a new line, write a line of code that uses
echo
tool to print out “This is my first script in bms5021”. If you are not familar with this tool, use theman
tool to review the manual.Save the script and execute it by typing
bash script_1.sh
into the terminal. If you get a path error, you may need to usecd
to ensure you are in the right directory
Exercise 11:
Let’s now circle back to
sed
. Usingsed
substitute:
Replace bms5021 with Applied Bioinformatics
Use the
>
operator to save the output to a new file calledscript_2.sh
Execute the new script.