Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A null vcf file #377

Closed
dudududu12138 opened this issue Mar 18, 2024 · 3 comments
Closed

A null vcf file #377

dudududu12138 opened this issue Mar 18, 2024 · 3 comments

Comments

@dudududu12138
Copy link

Hello,
I just test pggb on a simulated data. I simulated a reference file(.fasta), the length of this reference is 3941bp. Then I simulated a query file(.fasta),the length of this query is 600bp. This query sequence was taken from bases 1-300 and 800-1200 of the ref sequence. I use this 2 sequences to simulate deletion. But the final result report a null vcf file.
This is my code:

pggb -i tmp.fa -o output -n 2 -t 2  -p 50 -s 100  -V 'ref'

Below is the final vcf file:
image
And below is my simulated data which contains the reference and query sequences:
tmp.txt

@subwaystation
Copy link
Member

@dudududu12138 I can reproduce. Your *.seqwish.gfa should contain still both your input sequences as paths. However, the query path is lost after the smoothxg step. @AndreaGuarracino you have any idea why?

@AndreaGuarracino
Copy link
Member

Working on that! pangenome/smoothxg#206

@AndreaGuarracino
Copy link
Member

@dudududu12138, I've fixed the issue. Please update your pggb installation (I see you are not using the latest), being sure to update also smoothxg.

For your specific case, I suggest 3 modifications:

  • use PanSN for sequence naming
  • hack pggb script to add -c 100 where wfmash is called (unfortunately, this parameter is not exposed in pggb's interface)
  • add -l 0 to your command line, so pggb -i tmp.fa -o output -n 2 -t 2 -p 50 -s 100 -l 0 -V 'ref'

The 2nd and 3rd modifications are needed because pggb and its tools are tuned for longer sequences. If you apply all those 3 changes, you will get such a VCF file:

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=CONFLICT,Number=.,Type=String,Description="Sample names for which there are multiple paths in the graph with conflicting alleles">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=LV,Number=1,Type=Integer,Description="Level in the snarl tree (0=top level)">
##INFO=<ID=PS,Number=1,Type=String,Description="ID of variant corresponding to parent snarl">
##INFO=<ID=AT,Number=R,Type=String,Description="Allele Traversal as path in graph">
##contig=<ID=ref#1#r1,length=3941>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	query#1#q1
ref#1#r1	300	>1>3	CCCTTGTGATCTGCTTAGTTCCCACCCCCCTTTAAGAATTCAATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCAGTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAAGCAGTGGTCGCCAGGAATGGGGAAGCAAGGCGGAGTTGGGCAGCTCGTGTTCAATGGGTAGAGTTTCAGGCTGGGGTGATGGAAGGGTGCTGGAAATGAGTGGTAGTGATGGCGGCACAACAGTGTGAATCTACTTAATCCCACTGAACTGTATGCTGAAAAATGGTTTAGACGGTGAATTTTAGGTTATGTATGTTTTACCACAATTTTTAAAAAGCTAGTGAAAAGCTGGTAAAAAGAAAGAAAAGAGGCTTTTTTAAAAAGTTAAATATATAAAAAGAGCATCATCAGTCCAAAGTCCAGCAGTTGTCCCTCCTGGAATCCGTTGGCTTGCCTCCGGCATTTTTGGCCCTTGCCTTTTAGGGTTGCCAGATTAAAAGACAGGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAAATATGTCCCAAGCTTAGTTTGGGACATACTTATGC	C	60	.	AC=1;AF=1;AN=1;AT=>1>2>3,>1>3;NS=1;LV=0	GT	1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants