SnakemakeThe why and the how
AG Walther
2024-07-05
Why reproducibility?
You see a computational tool you want to use.
You spend a full day installing it.
How can we all solve this?
by starting small
There are programming languages to solve this exact problem.
Why would you use it
You work in bioinformatics
You know python
You want to start quickly.
Your work is based on files.
Good documentation.
Why wouldn’t you use it
You do not need to make a workflow
If your dependencies can’t be installed using conda.
If your programme needs to be used by non experts.
Get Snakemake with the following commands1
# Global instalation, assuming a bashlike environment...
conda install -n base -c conda-forge mamba
mamba create -c conda-forge--c bioconda -n snakemakes snakemake
mamba activate snakemakeLets get started with an example problem…
Two strains in an evolution experiment.
We have a culture of Ancestor bacteria, we apply an evolution experiment to two cultures.
$ cd data
$ wget https://osf.io/2jc4a/download -O download.tar.gz
$ tar -xzf download.tar.gz
$ ls data
anc_R1.fastq.gz anc_R2.fastq.gz evol1_R1.fastq.gz
evol1_R2.fastq.gz evol2_R1.fastq.gz evol2_R2.fastq.gzNow quality control:
We need fastqc and multiqc.
How?
From prefix.dev.
Then add the following contents \(\dots\)
name: quality-control
channels:
- conda-forge
- bioconda
dependencies:
- fastqc # quality scoring
- multiqc # for reporting
- fastp # Trimming readsFrom now we need to run with the --use-conda parameter.
Now we type the rule\(\dots\)
$ open workflow/Snakefile
Let’s try \(\dots\)
rule fastqc:
conda: "envs/qc.yml"
input: "data/data/{seqname}.fastq.gz"
output: directory("results/quality-control")
shell: "mkdir -p {output}; fastqc -i {input} -o {output}"Test it \(\dots\)
rule fastqc:
conda: "envs/qc.yml"
input: expand("data/data/{seqname}.fastq.gz",
seqname=glob_wildcards("data/data/{seqname}.fastq.gz").seqname)
output: directory("results/quality-control/fastqc")
shell: "mkdir {output}; fastqc {input} -o {output}"Wildcards are determined from the output!
rule multiqc:
conda: "envs/qc.yml"
input: rules.fastqc.output
output: directory("results/quality-control/multiqc")
shell: "multiqc {input} -o {output}"Multiqc output
see the documentation for more information
cli options to Snakemake.killall -TERM snakemakesnakemake --dry-run --printshellcmdssnakemake -c 1 --forceall --dag | dot -Tpng > dag.pngPypi dependenciesSo now we have snakemake, conda and possibly other dependencies not yet handled \(\dots\).
pixi package manager$ pixi init project
✔ Initialized project in /home/user/project
➜ cd project
$ pixi add python numpy cmake
✔ Added python 3.11.4.*
✔ Added numpy 1.25.2.*
✔ Added cmake 3.26.4.*
$ pixi run python --version
Python 3.11.4
$ pixi add python==3.10
✔ Added python ==3.10.0.*
$ pixi run python --version
Python 3.10.toml file[project]
name = "project"
version = "0.1.0"
description = "Example project"
authors = ["Pixi <hi@prefix.dev>"]
channels = ["conda-forge"]
platforms = ["linux-64", "win-64", "osx-arm64", "osx-64"]
[tasks]
start = "python main.py"
[dependencies]
python = "3.11.4.*"
numpy = "1.25.2.*"
cmake = "3.26.4.*"$ tree
├── LICENSE
├── pixi.toml
├── pixi.lock
├── README.md
|── results/
|── data/
|── workflow/
| ├── envs/
| |── scripts/
| └── Snakefile
|──── .snakemake/
└──── .pixi/
Well enough to explain code and design simple rules.
you escape them as such:
This presentation:
mpi-snakemake-overview.
The tutorial for readmapping:
snakemake tutorial.
Example workflow for the raven cluster:
snakemake cluster.