Chapter 3 Week 1 Part A - Introduction to the Shell
3.1 Learning Objectives
At the conclusion of today’s workshop students are expected to be able to:
To compare and contrast command line interfaces with graphical user interfaces
To articulate the principles of Bash programming, including the common syntax of commands
To execute common commands to solve routine problems using Bash
3.2 Workshop setup
Please execute the following commands immediately after logging into the terminal. You will learn what these commands do in this and future workshop.
mkdir bms5021cp -r /apps/data/bms5021/week_1 bms5021/
3.4 Creating files and directories
We can create files and directories using the touch and mkdir commands. Touch will create an empty file and mkdir will create a new directory. Lets give it a try:
Exercise 4:
Within the
~/bms5021/week_1directory, create a directory called “src”. Within this directory, create an empty file called script_1.sh
3.5 Copying, moving and deleting files
There are occasions when we want to copy, move or even delete files from the shell. This can be achieved using the cp, mv and rm functions, respectively.
- The basic syntax for
cpandmvis as follows:cp source_file target_file
Exercise 5:
Within the `
~/bms5021/week_1`directory, create a couple of files using the touch command. Explore thecpandmvcommands by moving these files around and creating new copies. When you are done clean up the files you have created withrm.
3.6 Wild cards
Sometimes when we are performing tasks in bash we would like to identify patterns in text strings. For instance, I might have a large set of fastq files from a paired-end sequencing run. Read pairs are usually denoted by R1 and R2. I can use wild cards to identify all the R1 fastq files in a directory in a directory by using the following command
ls *R1.fastq
There are several wildcards in Bash:
| Wildcard | Description | Example | Output |
|---|---|---|---|
| * | Matches any number of characters. | ls *.pdf |
All files with the .pdf file extension |
| ? | Matches any single character. | ls sample_1.?am |
All files with an extension that starts with any character and is followed by am. Examples might include sample_1.bam or sample_1.sam |
| [] | Matches any single character within a range or set | ls sample_[123].sam |
Returns sample_1.sam, sample_2.sam, and sample_3.sam if they exist |
| {} | Expands a comma-separated list of strings or characters | ls sample_1.{sam,bam} |
Returns sample_1.sam and sample_1.bam |
Exercise 6:
Let’s try a few examples. In the
~/bms5021/week_1/data/sequencesyou will find a set of DNA sequences in FASTA format. Each file contains a different number of sequences. FASTA files contain a descriptor line, which is indicated by the>symbol, and a line containing the sequence of nucleotides. We can use thewctool to evaluate the number of lines in a file. Use thewctool and wild-card expressions to determine the number of sequences in each file. (Hint - if you are unsure aboutwc, check out the manual usingman wc
Solution
grep ">" *.fasta | wc -l3.7 Viewing file content
There are several ways we can work with the contents of files in Bash. To view files we can leverage a suite of in-built tools, including cat, less, head and tail
| Command | Description | Use case |
|---|---|---|
cat |
This command will concatenate files and return the contents to standard output (the terminal). |
|
less |
This command will display the contents of a file in your terminal window. You can move through the file using the up and down arrows. You exit the program by typing q. |
|
head |
This command will print out the first n lines of a file. By default this is 10. Different options can be set to modify this. |
|
tail |
This command will print out the last n lines of a file. By default this is 10. Different options can be set to modify this. |
|
Exercise 7:
In the
~/bms5021/week_1/data/vcfyou will find a large variant call file (we will learn what the contents of this file mean in later weeks). Using the commands above, explore this file. What do you think the most effective way to navigate this file?
Solution
My typical approach here is to use head and tail first to inspect the top and bottom of files and then to use less to navigate them more easily
3.8 Searching file content
Sometimes we will want to search and extract certain phrases from files. For this we can use the grep tool. grep will search for patterns in a file and return them to the terminal. The grep tool has several handy options including regular pattern matching, inverted search, and pattern counting. This is an important tool that you will use in several workflows, check out the manual to see all the relevant options.
Exercise 8:
Earlier we counted the number of sequences in each fasta file in
~/bms5021/week_1/data/sequence. Now, usinggrepand the wildcards that you learned earlier:
Search for all sequences in the fasta files containing the string
TTG.Search for all sequences that don’t contain
TTGSearch for all sequences that contain
AAAinmock_sequence_1.fastatomock_sequence_5.fasta
3.9 Editing files - Sed
There are two main ways we can edit files:
Interactively using tools like
nanoandvimNon-interactively using tools like
sed
Lets talk about sed first. This tool is most useful when incorporated into scripts as it can search for patterns and perform operations such as replacing the string, deleting the string and inserting characters before and after the string. The power in the tool lies in its ability to perform these operations at speed and scale. One of the most popular implementations of the sed tool is the substitute program. It follows the following syntax:
sed [OPTIONS] 'command(s)' input_file(s)
The most common applications of sed is the substitute function
- Substitute - This command can be used to replace patterns with new strings. The syntax for
sedsubstitute issed 's/old_text/new_text/g' input_file
Exercise 9: Some genome reference builds contain the
chrprefix to chromosomes, while others omit this information. Thevcffile we have provided you contains thechrprefix. Usingsedand other commands perform the following tasks:
Identify the number of variants in this file (Hint: You might consider using the
grepandwctools.Using the
sedtool, remove thechrprefix from both the header and the variant entries.
3.10 Editing files - Nano
We can also edit files interactively, much like you’d use word or notepad on a GUI-based interface. To get started, watch this short video on how to use nano:
Exercise 10:
Let’s construct our first bash script using nano. Remember earlier we created a file called
script_1.shin the src directory. Using nano:
Open the file
Type
#!/bin/bash, this is what we call a shebang. This provides a path to the bash interpreter and should be at the top of all of your scripts.On a new line, write a line of code that uses
echotool to print out “This is my first script in bms5021”. If you are not familar with this tool, use themantool to review the manual.Save the script and execute it by typing
bash script_1.shinto the terminal. If you get a path error, you may need to usecdto ensure you are in the right directory
Exercise 11:
Let’s now circle back to
sed. Usingsedsubstitute:
Replace bms5021 with Applied Bioinformatics
Use the
>operator to save the output to a new file calledscript_2.shExecute the new script.