-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
strange handling of SVs by bcftools norm --fasta-ref #2330
Labels
Comments
Can you provide a small test case? It is not possible to comment on these specific cases without seeing the data |
Sure. Here are two vcfs with the variant I mentioned. One of the vcfs is before bcftools normalization, and the other is after.
And I'll just reiterate that this vcf is representing variants in a pangenome graph. So perhaps, relying on a single reference instead of the GFA of the graph to normalize is causing issues. Let me know if you need more test examples. I'm a little busy, but I'm happy to help. You may be interested in this correspondence about other related issues.
ComparativeGenomicsToolkit/cactus#1557 (comment)
[https://opengraph.githubassets.com/6159a8caa56f300b368335b3df268c1fc9d029bacf18d6cd597f892d5ad19c57/ComparativeGenomicsToolkit/cactus/issues/1557]<https://github.com/ComparativeGenomicsToolkit/cactus/issues/1557#issuecomment-2528989634>
Issue #1557 · ComparativeGenomicsToolkit/cactus - GitHub<ComparativeGenomicsToolkit/cactus#1557 (comment)>
Hi, Thanks for your help with my previous question, Glenn. I have another question about a different pangenome I'm working on. We built it last year with cactus-minigraph pangenome pipeline (v2.5.1). The input for our pangenome are four ...
github.com
Maggs
…________________________________
From: Petr Danecek ***@***.***>
Sent: Tuesday, December 10, 2024 12:28 AM
To: samtools/bcftools ***@***.***>
Cc: maggs-x ***@***.***>; Author ***@***.***>
Subject: Re: [samtools/bcftools] strange handling of SVs by bcftools norm --fasta-ref (Issue #2330)
Can you provide a small test case? It is not possible to comment on these specific cases without seeing the data
—
Reply to this email directly, view it on GitHub<#2330 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7HSYLW2RONK46DCJT63S7L2EWLJHAVCNFSM6AAAAABTD4W6KOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRXHE2TQMBVGM>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Unfortunately, this is not sufficient, we need the fasta reference and the input VCF file. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi bcftools,
I'm part of a project building a pangenome and we noticed some strange output by bcftools norm in terms of how it handles structural variants. I thought you may have some good recommendations, or would want to be aware of the strange behavior.
Here is the command I ran:
bcftools norm --fasta-ref $REF_FASTA input.vcf -o output.vcf
We notice two main problems. The POS of larger structural variants is shifted many more base pairs away in some cases then we'd anticipate. For example, a 8887bp insertion at CHR1:41577 is shifted 140bp away to CHR1:41437.
We also noticed a particularly difficult case. In the original vcf (prior to normalization/left alignment), there are 5 structural variants distributed across two sites. At CHR1:671683 the individuals Moly and Tany have a >200bp insertion. The third individual, Pach, has reference genotype. At site CHR1:671691, all three individuals have >150bp insertion. After normalization and left alignment, all of the variants at the second site (CHR1:671691) are reassigned to CHR1:671683. This makes it appear as if there are conflicting alleles at the same site in our vcf.
I'm aware that there are options to rm-dups for example, or collapse variants. However, that's removing information we know is there based on the other outputs from the pangenome. For example, I would like to avoid representing the inserted sequence in Moly as one 200bp insertion when we know that at least 350bp are inserted in this region. Any feedback is appreciated. Thank you!
The text was updated successfully, but these errors were encountered: