You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using bcftools 1.20 to split multiallelic variants in one .vcf, merge it with a second .vcf, and then recover the multiallelic variants. For some variants, reference alleles are being counted as missing in the final file. Here are the ACs for one position along the way:
file_A.vcf.gz
chr21:1000000:G:T,GT
allele: G GT T missing
AC: 20431 5617 2 0
file_B.vcf.gz
chr21:1000000:G:GT
allele: G GT missing
AC: 1775 387 0
bcftools norm -a --atom-overlaps . --check-ref s -f reference_fasta.fa -m -both --multi-overlaps 0 -o file_A2.vcf.gz -O z file_A.vcf.gz
file_A2.vcf.gz
chr21:1000000:G:GT
allele: G GT missing
AC: 20433 5617 0
chr21:1000000:G:T
allele: G T missing
AC: 26048 2 0
bcftools norm --check-ref s -f reference_fasta.fa -o file_B2.vcf.gz -O z file_B.vcf.gz
file_B2.vcf.gz
chr21:1000000:G:GT
allele: G GT missing
AC: 1775 387 0
(I create a file named 'file_list.txt' with the names of file_A2.vcf.gz and file_B2.vcf.gz)
bcftools merge -m none -O z -o file_C.vcf.gz -l file_list.txt
file_C.vcf.gz:
chr21:1000000:G:GT
allele: G T missing
AC: 26048 2 2162
chr21:1000000:G:T
allele: G GT missing
AC: 22208 6004 0
bcftools norm -m +any -o file_C2.vcf.gz -O z file_C.vcf.gz
file_C2.vcf.gz
chr21:1000000:G:T,GT
allele: G GT T missing
AC: 20431 2 6004 1775
I'm seeing this same pattern (the ref alleles from file_B appear as missing in file_C2) for a number of variants. Is there a way that I can get bcftools to keep them as actual ref alleles? It's likely that I just need to use the correct options, but I've tried many combinations without success.
The text was updated successfully, but these errors were encountered:
Certainly. The attached .zip contains two files (file_A.vcf.gz and file_B.vcf.gz) with one variant, chr21:14483696, that displays the behavior described.
The .zip also contains the files that were created during norming and merging. The exact commands that I used are included in the attached .txt file, as well as the code I used to count alleles. The reference file is too big to attach (of course), but it can be found here.
I'm using bcftools 1.20 to split multiallelic variants in one .vcf, merge it with a second .vcf, and then recover the multiallelic variants. For some variants, reference alleles are being counted as missing in the final file. Here are the ACs for one position along the way:
file_A.vcf.gz
chr21:1000000:G:T,GT
allele: G GT T missing
AC: 20431 5617 2 0
file_B.vcf.gz
chr21:1000000:G:GT
allele: G GT missing
AC: 1775 387 0
bcftools norm -a --atom-overlaps . --check-ref s -f reference_fasta.fa -m -both --multi-overlaps 0 -o file_A2.vcf.gz -O z file_A.vcf.gz
file_A2.vcf.gz
chr21:1000000:G:GT
allele: G GT missing
AC: 20433 5617 0
chr21:1000000:G:T
allele: G T missing
AC: 26048 2 0
bcftools norm --check-ref s -f reference_fasta.fa -o file_B2.vcf.gz -O z file_B.vcf.gz
file_B2.vcf.gz
chr21:1000000:G:GT
allele: G GT missing
AC: 1775 387 0
(I create a file named 'file_list.txt' with the names of file_A2.vcf.gz and file_B2.vcf.gz)
bcftools merge -m none -O z -o file_C.vcf.gz -l file_list.txt
file_C.vcf.gz:
chr21:1000000:G:GT
allele: G T missing
AC: 26048 2 2162
chr21:1000000:G:T
allele: G GT missing
AC: 22208 6004 0
bcftools norm -m +any -o file_C2.vcf.gz -O z file_C.vcf.gz
file_C2.vcf.gz
chr21:1000000:G:T,GT
allele: G GT T missing
AC: 20431 2 6004 1775
I'm seeing this same pattern (the ref alleles from file_B appear as missing in file_C2) for a number of variants. Is there a way that I can get bcftools to keep them as actual ref alleles? It's likely that I just need to use the correct options, but I've tried many combinations without success.
The text was updated successfully, but these errors were encountered: