-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty .fasta file at beginning of Bismark alignment #305
Comments
Weird! When you look in the work directory for the BismarkIndex task, does the Fasta file look normal there? I don't think that this is a bug in this pipeline particularly, sounds to more more likely to be a core-Nextflow problem.
|
Same bug for me this morning when ran locally with docker Reproducible exemple : OS: Ubuntu LTS 20.03 Dataset : SRR10532131 nextflow run nf-core/methylseq \
-profile docker \
--outdir out \
--max_cpus 12 \
--input ./samplesheet.csv \
--max_memory 30GB \
--fasta ./GRCh38_latest_genomic.fa \
-bg \
-with-report
EDIT : Is it possible that the Mine looked like this :
|
Thanks for posting a reproducible example - sorry Phil I missed your initial response. I've certainly seen the bug with the standard 3 column samplesheet ( |
@PatrickMaclean Could it be linked with nf-core/methylseq#637 somehow ? @ewels How do you change the stageInMode ? The doc only says what are the different modes, not how to change them Tbh I'm a bit confused/overwhelmed between the 6 ways I've found online to specify a custom genome:
Not sure I understand what is the point of all these options It seems like there are priorities of some kind between these different options, but the hierarchy is not clear to me. |
Hi @GDelevoye, Ok, here goes:
These three are the same thing, just different ways of setting the
These are two different ways of getting pre-built Bismark reference genome indices into Nextflow. AWS-iGenomes is simply a bucket on s3 with a load of common refs + indices hosted, and nf-core pipelines ship with a config that points to these locations. So if you do RefGenie is a similar thing which we are hoping to migrate to, with assets on s3. It's also a standalone CLI tool for managing local reference genomes. We have a plugin in the
This is different to everything else. Earlier versions of the pipeline included the option to specify a different reference genome for every sample. Then this was dropped, but we want to add it back again (I don't think that's been done yet.... right? Ties into the probable bug described here). It'll end up being functionally the same as the stuff above, but defined on a per-sample basis if you want it to be. This is all usage questions really, most of this issue relates to what is likely a specific bug in particular usage of the above. |
@PatrickMaclean could you share how you have your sample-sheet formatted? |
Just the minimum columns - sample, fastq_1, fastq_2 - haven't ever specified a genome on the sample sheet. |
Pretty sure that the igenomes thing is a red herring (likely the AWS download didn't work because it's in Europe and you're in a different region - this is a known limitation). If you're using Also pretty sure that sample-sheet related things are a red-herring. Fields are loaded by header column title, so order and presence of non-required columns should have no effect. This is the key question I'm missing:
Basically, trying to isolate the stage at which this fasta file is being truncated. Is it the bismark index generation step where it's getting messed up, or is it after that - somewhere in the process of staging that file as an input for the alignment process. Sometimes weird bugs can happen where a process incorrectly tries to write over a file - if it's hardlinked into the work directory then this can also overwrite the original source file. This is why I was wondering about
|
Ok, trying to replicate this. Getting the dataFetching the data, links from https://sra-explorer.info/ #!/usr/bin/env bash
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR105/031/SRR10532131/SRR10532131_1.fastq.gz -o SRR10532131_EM-seq_10_ng_replicate_1_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR105/031/SRR10532131/SRR10532131_2.fastq.gz -o SRR10532131_EM-seq_10_ng_replicate_1_2.fastq.gz Wasn't sure exactly where your reference fasta came from, so downloaded a GRCh38 fasta file from AWS-iGenomes: wget https://ngi-igenomes.s3.eu-west-1.amazonaws.com/igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa Ended up with this: $ ls -lh
total 18G
-rw-r--r-- 1 gitpod gitpod 300 May 3 20:52 download.sh
-rw-r--r-- 1 gitpod gitpod 3.0G Apr 13 2017 genome.fa
-rw-r--r-- 1 gitpod gitpod 166 May 3 21:00 samplesheet.csv
-rw-r--r-- 1 gitpod gitpod 7.2G May 3 20:56 SRR10532131_EM-seq_10_ng_replicate_1_1.fastq.gz
-rw-r--r-- 1 gitpod gitpod 7.9G May 3 21:00 SRR10532131_EM-seq_10_ng_replicate_1_2.fastq.gz Sample sheetBased on the above, which has unused columns (
CommandRunning on Gitpod, but tried to keep it basically the same: nextflow run nf-core/methylseq \
-profile docker \
--outdir out \
--max_cpus 4 \
--input ./samplesheet.csv \
--max_memory 8GB \
--fasta ./genome.fa \
-with-report ....and, GitPod is going to take about 100 years to run indexing on a full genome + processing a full dataset :/ Come back soon to hear the exciting finale! (or probably not, just that it wouldn't run on GitPod resources).. |
(bonus points if anyone can replicate using |
Hi, I'm having the same issue while trying to run a test using sample E. coli data. I was able to get the pipeline to complete successfully using
My files are:
My sample sheet is attached The command I'm using is: nextflow run nf-core/methylseq -profile docker --input samplesheet_test_GVFedit.csv --outdir test_results_bisbt2 --fasta GCF_000005845.2_ASM584v2_genomic.fasta --fasta_index GCF_000005845.2_ASM584v2_genomic.fasta.fai –aligner bismark –save_trimmed --save_align_intermeds Thanks in advance! |
Hi, I have similar problem that the pipeline doesn't find the reference genome although the genome is there and the file is not empty. This happens both $ nextflow run nf-core/methylseq --input /scratch/project_2010912/hannu/sample_list_test.csv --fasta /scratch/project_2010912/hannu/salmon_major_chromosomes.fasta --save_reference --outdir /scratch/project_2010912/methylseq_output --multiqc_title test_report -profile singularity -resume
Workflow execution completed unsuccessfully!
The exit status of the task that caused the workflow execution to fail was: 2.
The full error message was:
Error executing process > 'NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_ALIGN (14)'
Caused by:
Process `NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_ALIGN (14)` terminated with an error exit status (2)
Command executed:
bismark \
-1 14_1_val_1.fq.gz -2 14_2_val_2.fq.gz \
--genome BismarkIndex \
--bam \
--bowtie2 --multicore 4
cat <<-END_VERSIONS > versions.yml
"NFCORE_METHYLSEQ:METHYLSEQ:BISMARK:BISMARK_ALIGN":
bismark: $(echo $(bismark -v 2>&1) | sed 's/^.*Bismark Version: v//; s/Copyright.*$//')
END_VERSIONS
Command exit status:
2
Command output:
(empty)
Command error:
INFO: Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
INFO: Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
INFO: Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
Bowtie 2 seems to be working fine (tested command 'bowtie2 --version' [2.4.5])
Output format is BAM (default)
Alignments will be written out in BAM format. Samtools found here: '/usr/local/bin/samtools'
Reference genome folder provided is BismarkIndex/ (absolute path is '/scratch/project_2010912/hannu/work/a3/eb731a5587a70df3a2881d8e50a37b/BismarkIndex/)'
FastQ format assumed (by default)
Input files to be analysed (in current folder '/scratch/project_2010912/hannu/work/d9/a345e5eae651450e6e1bc1e4f30f69'):
14_1_val_1.fq.gz
14_2_val_2.fq.gz
Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!)
Summary of all aligner options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500
Running Bismark Parallel version. Number of parallel instances to be spawned: 4
Current working directory is: /scratch/project_2010912/hannu/work/d9/a345e5eae651450e6e1bc1e4f30f69
Now reading in and storing sequence information of the genome specified in: /scratch/project_2010912/hannu/work/a3/eb731a5587a70df3a2881d8e50a37b/BismarkIndex/
Failed to read from sequence file salmon_major_chromosomes.fasta No such file or directory
Work dir:
/scratch/project_2010912/hannu/work/d9/a345e5eae651450e6e1bc1e4f30f69
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run` $ nextflow run nf-core/methylseq --input /scratch/project_2010912/hannu/sample_list_test.csv --fasta /scratch/project_2010912/hannu/salmon_major_chromosomes.fasta --save_reference --outdir /scratch/project_2010912/methylseq_output --multiqc_title test_report --aligner bwameth
Workflow execution completed unsuccessfully!
The exit status of the task that caused the workflow execution to fail was: null.
The full error message was:
Error executing process > 'NFCORE_METHYLSEQ:METHYLSEQ:PREPARE_GENOME:SAMTOOLS_FAIDX'
Caused by:
Not a valid path value type: groovyx.gpars.dataflow.DataflowVariable (DataflowVariable(value=/scratch/project_2010912/hannu/salmon_major_chromosomes.fasta))
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line |
The second |
that bug with bwameth has been fixed and runs successfully
|
Description of the bug
Hi - thanks for all your work on the pipeline.
I have a recurrent issue which is easy to overcome but I presume is caused by a bug somewhere. When I specify a
.fasta
reference using--fasta
, I'm finding that the pipeline fails at the beginning of alignment because the generated Bismark index contains an empty version of the supplied.fasta
file - see error message below:Inside the work dir,
BismarkIndex/
contains a correctly named, empty.fasta
. I overcome this by copying the original.fasta
into the working directory and restarting the pipeline with-resume
. Not a major issue but it took me a little while to figure it out.Thanks
Patrick
Command used and terminal output
System information
Methylseq V2.3.0
Script name main.nf
Script ID d420d96c87e85cb9eb0749a6d4f01610
Workflow session 2fbd7f38-7eaf-41c1-967c-8bdef68bdc9d
Workflow profile standard
Nextflow version version 22.10.1, build 5828 (27-10-2022 16:58 UTC)
Executor: Slurm
Container engine: Singularity
OS: Unix
The text was updated successfully, but these errors were encountered: