Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange handling of SVs by bcftools norm --fasta-ref #2330

Open
maggs-x opened this issue Dec 6, 2024 · 3 comments
Open

strange handling of SVs by bcftools norm --fasta-ref #2330

maggs-x opened this issue Dec 6, 2024 · 3 comments

Comments

@maggs-x
Copy link

maggs-x commented Dec 6, 2024

Hi bcftools,

I'm part of a project building a pangenome and we noticed some strange output by bcftools norm in terms of how it handles structural variants. I thought you may have some good recommendations, or would want to be aware of the strange behavior.

Here is the command I ran:

bcftools norm --fasta-ref $REF_FASTA input.vcf -o output.vcf

We notice two main problems. The POS of larger structural variants is shifted many more base pairs away in some cases then we'd anticipate. For example, a 8887bp insertion at CHR1:41577 is shifted 140bp away to CHR1:41437.

We also noticed a particularly difficult case. In the original vcf (prior to normalization/left alignment), there are 5 structural variants distributed across two sites. At CHR1:671683 the individuals Moly and Tany have a >200bp insertion. The third individual, Pach, has reference genotype. At site CHR1:671691, all three individuals have >150bp insertion. After normalization and left alignment, all of the variants at the second site (CHR1:671691) are reassigned to CHR1:671683. This makes it appear as if there are conflicting alleles at the same site in our vcf.

I'm aware that there are options to rm-dups for example, or collapse variants. However, that's removing information we know is there based on the other outputs from the pangenome. For example, I would like to avoid representing the inserted sequence in Moly as one 200bp insertion when we know that at least 350bp are inserted in this region. Any feedback is appreciated. Thank you!

@pd3
Copy link
Member

pd3 commented Dec 9, 2024

Can you provide a small test case? It is not possible to comment on these specific cases without seeing the data

@maggs-x
Copy link
Author

maggs-x commented Dec 10, 2024 via email

@pd3
Copy link
Member

pd3 commented Dec 11, 2024

Unfortunately, this is not sufficient, we need the fasta reference and the input VCF file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants