Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation request for dorado polish per-base quality score output #1200

Open
eernst opened this issue Dec 27, 2024 · 3 comments
Open

Documentation request for dorado polish per-base quality score output #1200

eernst opened this issue Dec 27, 2024 · 3 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request polish Issues related to polishing

Comments

@eernst
Copy link

eernst commented Dec 27, 2024

Issue Report

Please describe the issue:

Passing the --qualities flag produces FASTQ output where the vast majority of the bases in the polished assembly have phred Q==! (0). Is this the intended output? Are non-zero Q scores only written for regions where polishing edits were made?

Steps to reproduce the issue:

Run dorado polish with per-base quality output

Run environment:

  • Dorado version: 0.9.0
  • Dorado command: dorado polish --qualities --ignore-read-groups ...
  • Operating system: Rocky Linux 8.10
  • Hardware (CPUs, Memory, GPUs): 56-core Intel x86_64, 1.5TB, 2x Tesla V100 32GB
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): merged multilibrary aligned BAM (with move table)
  • Source data location (on device or networked drive - NFS, etc.): NFS
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
  • Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

  • Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)
@HalfPhoton HalfPhoton added documentation Improvements or additions to documentation enhancement New feature or request polish Issues related to polishing labels Jan 2, 2025
@svc-jstone
Copy link
Contributor

Hi @eernst ,
dorado polish will use ! for locations of zero coverage where the output bases are (by default) copied (gap-filled) from the input draft sequence (! is used as a dummy QV).
Does your data have many low coverage regions?
If not, it could be a bug.

@eernst
Copy link
Author

eernst commented Jan 7, 2025

Hi @svc-jstone, thanks for following up.

The coverage is even and high across the target assembly, which contains both haplotypes assembled by hifiasm.

For example, one contig with length 21,725,053 has 20,572,861 !s in the dorado polish fastq output, with a mean coverage of 69.64 as computed by mosdepth with default alignment filters (-F 1796) on the dorado aligner output. >99% of the bases in the contig have coverage 40X or higher.

The data are simplex 5Khz reads with the move table emitted into the base called BAM.

Commands (edited for brevity) were:
dorado aligner -t 192 assembly.hic.hap1+2.p_ctg.fasta.gz reads.kit14.5khz.simplex.bam | samtools sort -T ~/tmp/samtools-sort -@ 192 -m 3G > reads.kit14.5khz.simplex.aligned.bam

dorado polish --qualities --ignore-read-groups -t 48 -o assembly.hic.hap1+2.p_ctg.polished.fastq reads.kit14.5khz.simplex.aligned.bam assembly.hic.hap1+2.p_ctg.fasta.gz

@svc-jstone
Copy link
Contributor

Thanks! In this case there might be a bug somewhere. Any chance you can share data for a small contig which exhibits this issue so we can look into it locally?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request polish Issues related to polishing
Projects
None yet
Development

No branches or pull requests

3 participants