tags |
---|
ggg, ggg2023, ggg201b |
[toc]
Quick overview: what are we trying to do?
- find out where reads map to a reference genome
- "pile up" the reads to see where there are differences
- calculate significant differences
Today we will run through a ~complete workflow to do this for one E. coli sample, and then edit our snakemake workflow so we can more easily do additional samples.
If you are looking for reference information on snakemake, here is some draft book text on learning snakemake that you might find interesting. I am drafting a second section here.
Start your binder:
Now, go to Lab 1 - Running some things on binder: link
Edit Snakefile
and add:
and add output: "SRR2584857_1.fastq.gz"
to the download_data
rule.
Be sure to match the indentation within the rule.
It should look like this:
rule download_data:
output: "SRR2584857_1.fastq.gz"
shell: """
curl -JLO https://osf.io/4rdza/download
"""
Now try running the rule:
snakemake -j 1 -p SRR2584857_1.fastq.gz
Hey look, it doesn't do anything! Why??
Try removing the file: rm SRR2584857_1.fastq.gz
. Now run the rule again, and again.
What's happening here is that snakemake is "smart" enough to know that if the output file exists, it doesn't need to be recreated. So once you tell it what the output of a rule is, it gets smarter about running it.
To the map_reads
rule, add:
input: "ecoli-rel606.fa.gz", "SRR2584857_1.fastq.gz"
What does this do? (And does it work?)
Try something: remove the reads file, SRR2584857_1.fastq.gz
like so:
rm SRR2584857_1.fastq.gz
and now try running map_reads:
snakemake -j 1 -p map_reads
Snakemake will automatically figure out what rules to run based on matching input and output blocks, so you no longer need to first run download_data and then run map_reads!
To the download_genome
rule, add:
output: "ecoli-rel606.fa.gz"
and to the map_reads
rule, adjust it to look like this:
input:
ref="ecoli-rel606.fa.gz",
reads="SRR2584857_1.fastq.gz"
What does this do?
This tells snakemake that map_reads
depends on having the input files ecoli-rel606.fa.gz
and SRR2584857_1.fastq.gz
in this directory, and that download_genome
can produce the ecoli file!!
What's extra cool is that snakemake will now automatically run the rule that outputs this file (which is download_genome
) before running map_reads
!
Try it:
rm *.gz *.sam
snakemake -p -j 1 map_reads
and you'll see that it runs two things: first the download_genome
rule, then the map_reads
rule.
What is the output of the rule map_reads
, and how would you find that out in general?
Add "output:" with the correct filename to the rule map_reads
.
Hint: use the RStudio file browser window to look at what changes in the directory when you run the command. You can also use the RStudio terminal window to run ls -l --sort=time
after running the rule.
A few quick notes:
- input and output (and other things) can be in any order, as long as they are before shell.
- you can either put things all on one line, or form a block by indenting.
- you can make lists for multiple input or output files by separating filenames with a comma
- rule names can be any valid variable, which basically means letters and underscores; you can use numbers after a first character.
If you look at the rules, you'll see that input and output filenames are
specified multiple times. That's rude! (And potentially confusing, if you
change the input:
block and forget to change the shell:
block, or
something. Not that I've ever done that.)
Luckily, snakemake lets you use template variables in the shell:
block
so that you can just say {output}
and it will figure out what to put there.
For download_data
, change the shell command to be:
wget https://osf.io/4rdza/download -O {output}
(This will also help us out later in other ways, as you'll see!)
You can do the same with {input}
.
How would you fix the rules download_genome
and map_reads
to have
shell commands that make use of {input}
and {output}
?
Note: you can rerun everything by doing rm *.sam *.gz
followed by
snakemake -j 1 map_reads
How would you fix the rules sam_to_bam
, sort_bam
and call_variants
to have the appropriate input: and output:, and {input}
and {output}
?
Hint: look at the shell command for those rules.
If you've fixed the rules properly, you should be able to run everything up to the rule call_variants
by just specifying call_variants
- try it!
rm *.gz *.sam
snakemake -p -j 1 call_variants
You can tell snakemake to delete everything it knows how to make for a particular rule (including all preceding rules) by running
snakemake -j 1 --delete-all-output call_variants
(This can also be a good way to check to make sure you have all the output: information right, because you'll have files left over if you forgot to put them in output: :)
You don't actually need to use the rule names (this will be important later on!). You can just tell snakemake what file you want produced, and run that.
So:
snakemake -p -j 1 SRR2584857_1.ecoli-rel606.vcf
will also work to run the rule call_variants
, but you don't have to remember the rule name.
(Later on we'll see why this is important for other things.)
By default, if you don't specify a rule snakemake runs the first rule in the file. (This is actually the only case where the order of rules in the Snakefile matters!)
So it's conventional to define a rule named all
at the top; try it!
rule all:
input:
"SRR2584857_1.ecoli-rel606.vcf"
Question: why specify the input for this rule, and no output or shell command!?
(There's no good reason. it's just a cute trick that matches the way snakemake thinks. 🤷)
- Snakefile contains a snakemake workflow definition
- the rules specify steps in the workflow
- At the moment (and in general), they run shell commands
- You can "decorate" the rules to tell snakemake how they depend on each other.
- This decoration comes in the form of "input:" and "output:" lists of one vor more files, quoted, separated by commas.
- You can then use
{input}
and{output}
in your shell commands, too! - input and output are how you connect rules: by saying which rules take which files as inputs and/or produce what outputs.
- Snakemake cares about tabs :)
...and why is this all useful, anyway? We'll explore that in a bit more detail next Friday!
Once you're done creating that sorted bam file, you can also run
samtools index SRR2584857_1.ecoli-rel606.bam.sorted
samtools tview -p ecoli:4314717 --reference ecoli-rel606.fa SRR2584857_1.ecoli-rel606.bam.sorted
to actually look at the aligned reads.
- Snakefile contains a snakemake workflow definition
- the rules specify steps in the workflow
- At the moment (and in general), they run shell commands - this can include R and Python scripts, for example.
- You can "decorate" the rules to tell snakemake how they depend on each other.
- There are other reasons to do this that we'll get to next week!
Revisit: what workflow are we running?