Chapter 3 Week 1 Part A - Introduction to the Shell

3.1 Learning Objectives

At the conclusion of today’s workshop students are expected to be able to:

To compare and contrast command line interfaces with graphical user interfaces
To articulate the principles of Bash programming, including the common syntax of commands
To execute common commands to solve routine problems using Bash

3.2 Workshop setup

Please execute the following commands immediately after logging into the terminal. You will learn what these commands do in this and future workshop.

mkdir bms5021
cp -r /apps/data/bms5021/week_1 bms5021/

3.3 Navigating on the command line

Navigating on the command line is an important skill and mastery of navigation will substantially reduce your time troubleshooting basic path related errors. Before we introduce the basic commands used for navigation it is important to introduce the concept of relative and absolute paths. Both can be used to specify the location of a directory or file but there is an important distinction between them.

Absolute Path: The absolute path of a file or directory begins at the root directory and specifies the full location of a file in the file system. This is the exact path from the root / to the file, and does not rely on the current working directory. Absolute pathways always begin with a forward slash /. Here is an example of an absolute path:

/home/lfennell/bms5021

Relative path: The relative path of a file begins at the current working directory and tracks the path from this directory to a given file. Relative paths do not begin with /. Imagine I am currently the home directory. The relative path to the bms5021/week_1 subdirectory would be:

bms5021/week_1

When things go wrong: By far the most common error we see when navigating or pointing to a file occurs when the path to that file is set incorrectly. If this is the case you will see the following error: No such file or directory. If you see this error you need to double check the accuracy of the path you have provided.

There are three key commands that we use for navigation.

pwd - This command will tell you where you are currently working.

Exercise 1:

Use the pwd command to print out your current working directory to the terminal. What do you notice about the path that is returned? Is it absolute or relative? How can you tell?

cd - This command will change your current working directory (hence the name cd). The syntax for this command is cd desired_path. desired_path can be absolute or relative. You will use this command to move around within your virtual environment.

Let’s imagine we are in our home directory, which in this case is /users/student1

Current working directory	Command	New working directory	Explanation
`/users/student1`	`cd .`	`/users/student1`	The `.` operator stands for your present directory. `cd` will not alter the working directory in response to this
`/users/student1`	`cd ..`	`/users`	The `..` operator will move you one level up the tree to the parent directory.
`/users/student1`	`cd week_1`	`/users/student/week_1`	This command will move you to the `week_1` directory which resides in your working directory.
`/users/student/week_1`	`cd ~`	`/users/student1`	The `~` symbol represents your home directory and we are thus returned there
`/users/student1`	`cd /`	`/`	The `/` symbol represents the root of the filesystem. This command will take you to the very top of the tree!

Exercise 2:

Use the cd command to navigate to the week_1 directory. You can use either relative or absolute paths to achieve this. Note that if you use the absolute path, you should take note of your current working directory as noted in exercise 1.

ls - The last command in this series is the ls command. This command with list the contents of your current working directory. This is akin to viewing the contents of a directory using a graphical user interface.

Exercise 3:

Use the ls command to list the contents of the week_1 directory. There are several options associated with this command. You can view these (and options associated with any other command) using the man command. Try out a few, especially the -a option. What do you notice?

3.4 Creating files and directories

We can create files and directories using the touch and mkdir commands. Touch will create an empty file and mkdir will create a new directory. Lets give it a try:

Exercise 4:

Within the ~/bms5021/week_1 directory, create a directory called “src”. Within this directory, create an empty file called script_1.sh

3.5 Copying, moving and deleting files

There are occasions when we want to copy, move or even delete files from the shell. This can be achieved using the cp, mv and rm functions, respectively.

The basic syntax for cp and mv is as follows: cp source_file target_file

Exercise 5:

Within the `~/bms5021/week_1` directory, create a couple of files using the touch command. Explore the cp and mv commands by moving these files around and creating new copies. When you are done clean up the files you have created with rm.

3.6 Wild cards

Sometimes when we are performing tasks in bash we would like to identify patterns in text strings. For instance, I might have a large set of fastq files from a paired-end sequencing run. Read pairs are usually denoted by R1 and R2. I can use wild cards to identify all the R1 fastq files in a directory in a directory by using the following command

ls *R1.fastq

There are several wildcards in Bash:

These can be integrated into any commands where finding strings and patterns are useful. Several wildcards can be combined within a single string, making them a very powerful tool.
Wildcard	Description	Example	Output
*	Matches any number of characters.	`ls *.pdf`	All files with the `.pdf` file extension
?	Matches any single character.	`ls sample_1.?am`	All files with an extension that starts with any character and is followed by `am`. Examples might include `sample_1.bam` or `sample_1.sam`
[]	Matches any single character within a range or set	`ls sample_[123].sam`	Returns `sample_1.sam`, `sample_2.sam`, and `sample_3.sam` if they exist
{}	Expands a comma-separated list of strings or characters	`ls sample_1.{sam,bam}`	Returns `sample_1.sam` and `sample_1.bam`

Exercise 6:

Let’s try a few examples. In the ~/bms5021/week_1/data/sequences you will find a set of DNA sequences in FASTA format. Each file contains a different number of sequences. FASTA files contain a descriptor line, which is indicated by the > symbol, and a line containing the sequence of nucleotides. We can use the wc tool to evaluate the number of lines in a file. Use the wc tool and wild-card expressions to determine the number of sequences in each file. (Hint - if you are unsure about wc, check out the manual using man wc

Solution

grep ">" *.fasta | wc -l

3.7 Viewing file content

There are several ways we can work with the contents of files in Bash. To view files we can leverage a suite of in-built tools, including cat, less, head and tail

Command	Description	Use case
`cat`	This command will concatenate files and return the contents to standard output (the terminal).	To view the entire contents of a file To combine several files
`less`	This command will display the contents of a file in your terminal window. You can move through the file using the up and down arrows. You exit the program by typing `q`.	To view and navigate the contents of a file
`head`	This command will print out the first n lines of a file. By default this is 10. Different options can be set to modify this.	View the header of a file
`tail`	This command will print out the last n lines of a file. By default this is 10. Different options can be set to modify this.	View the end of a file

Exercise 7:

In the ~/bms5021/week_1/data/vcf you will find a large variant call file (we will learn what the contents of this file mean in later weeks). Using the commands above, explore this file. What do you think the most effective way to navigate this file?

Solution

My typical approach here is to use head and tail first to inspect the top and bottom of files and then to use less to navigate them more easily

3.8 Searching file content

Sometimes we will want to search and extract certain phrases from files. For this we can use the grep tool. grep will search for patterns in a file and return them to the terminal. The grep tool has several handy options including regular pattern matching, inverted search, and pattern counting. This is an important tool that you will use in several workflows, check out the manual to see all the relevant options.

Exercise 8:

Earlier we counted the number of sequences in each fasta file in ~/bms5021/week_1/data/sequence. Now, using grep and the wildcards that you learned earlier:

Search for all sequences in the fasta files containing the string TTG.

Search for all sequences that don’t contain TTG

Search for all sequences that contain AAA in mock_sequence_1.fasta to mock_sequence_5.fasta

3.9 Editing files - Sed

There are two main ways we can edit files:

Interactively using tools like nano and vim
Non-interactively using tools like sed

Lets talk about sed first. This tool is most useful when incorporated into scripts as it can search for patterns and perform operations such as replacing the string, deleting the string and inserting characters before and after the string. The power in the tool lies in its ability to perform these operations at speed and scale. One of the most popular implementations of the sed tool is the substitute program. It follows the following syntax:

sed [OPTIONS] 'command(s)' input_file(s)

The most common applications of sed is the substitute function

Substitute - This command can be used to replace patterns with new strings. The syntax for sed substitute is sed 's/old_text/new_text/g' input_file

Exercise 9: Some genome reference builds contain the chr prefix to chromosomes, while others omit this information. The vcf file we have provided you contains the chr prefix. Using sed and other commands perform the following tasks:

Identify the number of variants in this file (Hint: You might consider using the grep and wc tools.

Using the sed tool, remove the chr prefix from both the header and the variant entries.

3.10 Editing files - Nano

We can also edit files interactively, much like you’d use word or notepad on a GUI-based interface. To get started, watch this short video on how to use nano:

https://youtu.be/cLyUZAabf40

Exercise 10:

Let’s construct our first bash script using nano. Remember earlier we created a file called script_1.sh in the src directory. Using nano:

Open the file

Type #!/bin/bash, this is what we call a shebang. This provides a path to the bash interpreter and should be at the top of all of your scripts.

On a new line, write a line of code that uses echo tool to print out “This is my first script in bms5021”. If you are not familar with this tool, use the man tool to review the manual.

Save the script and execute it by typing bash script_1.sh into the terminal. If you get a path error, you may need to use cd to ensure you are in the right directory

Exercise 11:

Let’s now circle back to sed. Using sed substitute:

Replace bms5021 with Applied Bioinformatics

Use the > operator to save the output to a new file called script_2.sh

Execute the new script.