Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: python #158

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
e1e892d
fix: removed superflous conditional
cmeesters Sep 26, 2024
8b2a28a
fix: merge conflict
cmeesters Sep 27, 2024
70786dc
fix: bcftools seemed missing in environment.yaml
cmeesters Oct 6, 2024
f8ceb9d
feat: added seaborn to environment.yaml
cmeesters Oct 6, 2024
8f30ac8
feat: new python rule demonstrating the run directive
cmeesters Oct 7, 2024
91e0c97
feat: added Snakefile to illustrate the python run directive and a so…
cmeesters Oct 7, 2024
4a5a453
feat: introducing the parson problem
cmeesters Oct 7, 2024
3787145
feat: extended reach of the Python section's scope
cmeesters Oct 7, 2024
9ffbd7b
fix: typo
cmeesters Oct 7, 2024
90f0a55
fix: added pip install of pysam to environment yaml file
cmeesters Oct 7, 2024
99ba91e
fix: 06_Snakefile_run task setting now syntactically correct
cmeesters Oct 7, 2024
e9d905e
feat: new slide introducing the Parson problem from the last commits …
cmeesters Oct 7, 2024
697a4f1
feat: added results positions file to illustrate the solution
cmeesters Oct 7, 2024
4f35fa5
fix: removed outdated solution file 06
cmeesters Oct 7, 2024
612a7ec
fix: added filnalized version of for the run directive
cmeesters Oct 7, 2024
013cb83
feat: no cloze for 07 script, but a solution
cmeesters Oct 7, 2024
c6f061f
feat: plot quals to rule creation, only discussing script in the future
cmeesters Oct 7, 2024
4153c67
fix: bcftools with minimum version
cmeesters Oct 7, 2024
0b36482
fix: removed doubled bcftools entry
cmeesters Oct 7, 2024
eb3b42b
feat: added exercise to move run to script content
cmeesters Oct 7, 2024
b7074ae
fix: typo
cmeesters Oct 7, 2024
be6da61
fix: grammar
cmeesters Oct 7, 2024
4fe7d7f
fix: layout glitches
cmeesters Oct 10, 2024
e7ede28
fix: layout (tabs to spaces)
cmeesters Oct 10, 2024
0328110
fix: a little clearer message
cmeesters Oct 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added images/results/positions.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 4 additions & 2 deletions setup_creators/environment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,13 @@ dependencies:
- jinja2
- matplotlib
- graphviz
- bcftools =1.19
- bcftools >=1.19
- samtools =1.19.2
- bwa =0.7.17
# - pysam =0.22
- pip:
- pysam
# at the time of writing - 7. Feb 24 - pysam will require
# a lower python version than snakemake, install pysam
# using pip
- pygments
- seaborn
83 changes: 83 additions & 0 deletions setup_creators/solutions/06_Snakefile_run
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# our samples are pre-configured
SAMPLES = ["A", "B"]

rule all:
input:
"calls/all.vcf",
"calls/positions.png"

rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"


rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"


rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"


rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"bcftools mpileup -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"

rule plot_positions:
input:
rules.bcftools_call.output
output:
"calls/positions.png"
run:
import matplotlib
matplotlib.use("Agg") # to suppress interactive plotting
import matplotlib.pyplot as plt
import numpy as np
from pysam import VariantFile
import seaborn as sns
#TODO: this parameter needs to be configurable
# see one of the next exercises
window_size = 500
cmeesters marked this conversation as resolved.
Show resolved Hide resolved

pos = [record.pos for record in VariantFile(input[0])]
# setup windows
bins = np.arange(0, max(pos), window_size)

# use window midpoints as x coordinate
x = (bins[1:] + bins[:-1])/2

# compute variant density in each window
h, _ = np.histogram(pos, bins=bins)
y = h / window_size

# plot
fig, ax = plt.subplots(figsize=(12, 3))
sns.despine(ax=ax, offset=10)
ax.plot(x, y)
ax.set_xlabel('Chromosome position (bp)')
ax.set_ylabel('Variant density (bp$^{-1}$)')
plt.savefig(output[0])

cmeesters marked this conversation as resolved.
Show resolved Hide resolved
93 changes: 93 additions & 0 deletions setup_creators/solutions/07_Snakefile_script
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# our samples are pre-configured
SAMPLES = ["A", "B"]


rule all:
input:
"calls/all.vcf",
"calls/positions.png",
"calls/quals.svg"

rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
Comment on lines +17 to +18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Incorrect usage of {input} in shell command

In the rule bwa_map, the shell command uses {input} directly, which may not correctly reference the input files as expected. Since {input} is a list of input files, you should reference the individual elements explicitly.

Apply this diff to fix the shell command:

-            "bwa mem {input} | samtools view -Sb - > {output}"
+            "bwa mem {input[0]} {input[1]} | samtools view -Sb - > {output}"

This ensures that bwa mem receives the reference genome and the sample FASTQ file as separate arguments.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
shell:
"bwa mem {input[0]} {input[1]} | samtools view -Sb - > {output}"



rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"


rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"


rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"bcftools mpileup -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"

rule plot_positions:
input:
rules.bcftools_call.output
output:
"calls/positions.png"
run:
import matplotlib
matplotlib.use("Agg") # to suppress interactive plotting
import matplotlib.pyplot as plt
import numpy as np
from pysam import VariantFile
import seaborn as sns
#TODO: this parameter needs to be configurable
# see one of the next exercises
window_size = 500
cmeesters marked this conversation as resolved.
Show resolved Hide resolved

pos = [record.pos for record in VariantFile(input[0])]
# setup windows
bins = np.arange(0, max(pos), window_size)
cmeesters marked this conversation as resolved.
Show resolved Hide resolved

# use window midpoints as x coordinate
x = (bins[1:] + bins[:-1])/2

# compute variant density in each window
h, _ = np.histogram(pos, bins=bins)
y = h / window_size

# plot
fig, ax = plt.subplots(figsize=(12, 3))
sns.despine(ax=ax, offset=10)
ax.plot(x, y)
ax.set_xlabel('Chromosome position (bp)')
ax.set_ylabel('Variant density (bp$^{-1}$)')
plt.savefig(output[0])

rule plot_quals:
input:
"calls/all.vcf"
output:
"calls/quals.svg"
script:
"scripts/plot-quals.py"

Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# our samples are pre-configured
configfile: "config/config.yaml"
SAMPLES = ["A", "B"]
cmeesters marked this conversation as resolved.
Show resolved Hide resolved


rule all:
input:
"plots/quals.svg",
"calls/all.vcf"
"calls/all.vcf",
"calls/positions.png",
"calls/quals.svg"

rule bwa_map:
input:
Expand Down Expand Up @@ -38,20 +40,27 @@ rule samtools_index:
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"bcftools mpileup -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"

rule plot_positions:
input:
rules.bcftools_call.output
output:
"calls/positions.png"
script:
"scripts/plot-positions.py"

rule plot_quals:
input:
"calls/all.vcf"
"calls/all.vcf"
output:
"plots/quals.svg"
"calls/quals.svg"
script:
"scripts/plot-quals.py"

"scripts/plot-quals.py"

92 changes: 92 additions & 0 deletions setup_creators/tutorial/06_Snakefile_run
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# our samples are pre-configured
SAMPLES = ["A", "B"]

#TASK: This is an almost workfing Snakefile!
# But it is scrambled. Bring it
# into order and argue for each
# component, _why_ it should be
# placed where you think it is
# to be placed.
# Next, run Snakemake with the
# '--debug' flag to find out,
# where the error is.

rule all:
input:
"calls/all.vcf",

rule bwa_map:
output:
"mapped_reads/{sample}.bam"
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
Comment on lines +21 to +25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use named inputs in bwa_map for clarity and correctness

By specifying named inputs, you can ensure the correct file order and improve the readability of your code. This is particularly important because bwa mem expects the reference genome and the read files in a specific order.

Apply this diff to refactor the input and shell command:

 rule bwa_map:
     output:
         "mapped_reads/{sample}.bam"
     input:
-        "data/genome.fa",
-        "data/samples/{sample}.fastq"
+        ref="data/genome.fa",
+        reads="data/samples/{sample}.fastq"
     shell:
-        "bwa mem {input} | samtools view -Sb - > {output}"
+        "bwa mem {input.ref} {input.reads} | samtools view -Sb - > {output}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
input:
ref="data/genome.fa",
reads="data/samples/{sample}.fastq"
shell:
"bwa mem {input.ref} {input.reads} | samtools view -Sb - > {output}"



rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"

rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"bcftools mpileup -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"

rule plot_positions:
input:
rules.bcftools_call.output
output:
"calls/positions.png"
run:
import matplotlib
matplotlib.use("Agg") # to suppress interactive plotting
import matplotlib.pyplot as plt
import numpy as np
from pysam import VariantFile
import seaborn as sns
#TODO: This parameter needs to be configurable
# See one of the next exercises.
# For the time being set it to a value in
# the hundreds.
window_size = ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Assign a numeric value to window_size to prevent errors

The variable window_size is currently set to ..., which is an Ellipsis object in Python. This will cause a TypeError when used in numerical operations like np.arange. Assign a numeric value to window_size as suggested in the comments.

Apply this diff to fix the error:

-window_size = ...
+window_size = 100  # Set to a value in the hundreds as per the comment
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
window_size = ...
window_size = 100 # Set to a value in the hundreds as per the comment


pos = [record.pos for record in VariantFile(input[0])]
# setup windows
bins = np.arange(0, max(pos), window_size)

# use window midpoints as x coordinate
x = (bins[1:] + bins[:-1])/2

# compute variant density in each window
h, _ = np.histogram(pos, bins=bins)
y = h / window_size

# plot
fig, ax = plt.subplots(figsize=(12, 3))
sns.despine(ax=ax, offset=10)
ax.plot(x, y)
ax.set_xlabel('Chromosome position (bp)')
ax.set_ylabel('Variant density (bp$^{-1}$)')
plt.savefig(output[0])

rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"

Loading
Loading