A null vcf file #377

dudududu12138 · 2024-03-18T07:46:25Z

Hello,
I just test pggb on a simulated data. I simulated a reference file(.fasta), the length of this reference is 3941bp. Then I simulated a query file(.fasta),the length of this query is 600bp. This query sequence was taken from bases 1-300 and 800-1200 of the ref sequence. I use this 2 sequences to simulate deletion. But the final result report a null vcf file.
This is my code:

pggb -i tmp.fa -o output -n 2 -t 2  -p 50 -s 100  -V 'ref'

Below is the final vcf file:

And below is my simulated data which contains the reference and query sequences:
tmp.txt

The text was updated successfully, but these errors were encountered:

subwaystation · 2024-03-18T09:12:08Z

@dudududu12138 I can reproduce. Your *.seqwish.gfa should contain still both your input sequences as paths. However, the query path is lost after the smoothxg step. @AndreaGuarracino you have any idea why?

AndreaGuarracino · 2024-03-18T16:42:12Z

Working on that! pangenome/smoothxg#206

AndreaGuarracino · 2024-03-18T17:58:47Z

@dudududu12138, I've fixed the issue. Please update your pggb installation (I see you are not using the latest), being sure to update also smoothxg.

For your specific case, I suggest 3 modifications:

use PanSN for sequence naming
hack pggb script to add -c 100 where wfmash is called (unfortunately, this parameter is not exposed in pggb's interface)
add -l 0 to your command line, so pggb -i tmp.fa -o output -n 2 -t 2 -p 50 -s 100 -l 0 -V 'ref'

The 2nd and 3rd modifications are needed because pggb and its tools are tuned for longer sequences. If you apply all those 3 changes, you will get such a VCF file:

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=CONFLICT,Number=.,Type=String,Description="Sample names for which there are multiple paths in the graph with conflicting alleles">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=LV,Number=1,Type=Integer,Description="Level in the snarl tree (0=top level)">
##INFO=<ID=PS,Number=1,Type=String,Description="ID of variant corresponding to parent snarl">
##INFO=<ID=AT,Number=R,Type=String,Description="Allele Traversal as path in graph">
##contig=<ID=ref#1#r1,length=3941>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	query#1#q1
ref#1#r1	300	>1>3	CCCTTGTGATCTGCTTAGTTCCCACCCCCCTTTAAGAATTCAATAGAGAAGCCAGACGCAAAACTACAGATATCGTATGAGTCCAGTTTTGTGAAGTGCCTAGAATAGTCAAAATTCACAGAGACAGAAGCAGTGGTCGCCAGGAATGGGGAAGCAAGGCGGAGTTGGGCAGCTCGTGTTCAATGGGTAGAGTTTCAGGCTGGGGTGATGGAAGGGTGCTGGAAATGAGTGGTAGTGATGGCGGCACAACAGTGTGAATCTACTTAATCCCACTGAACTGTATGCTGAAAAATGGTTTAGACGGTGAATTTTAGGTTATGTATGTTTTACCACAATTTTTAAAAAGCTAGTGAAAAGCTGGTAAAAAGAAAGAAAAGAGGCTTTTTTAAAAAGTTAAATATATAAAAAGAGCATCATCAGTCCAAAGTCCAGCAGTTGTCCCTCCTGGAATCCGTTGGCTTGCCTCCGGCATTTTTGGCCCTTGCCTTTTAGGGTTGCCAGATTAAAAGACAGGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGCATAAATATGTCCCAAGCTTAGTTTGGGACATACTTATGC	C	60	.	AC=1;AF=1;AN=1;AT=>1>2>3,>1>3;NS=1;LV=0	GT	1

AndreaGuarracino mentioned this issue Mar 18, 2024

Fix path embedding pangenome/smoothxg#206

Merged

AndreaGuarracino mentioned this issue Mar 18, 2024

update smoothxg #380

Merged

AndreaGuarracino closed this as completed Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A null vcf file #377

A null vcf file #377

dudududu12138 commented Mar 18, 2024

subwaystation commented Mar 18, 2024

AndreaGuarracino commented Mar 18, 2024

AndreaGuarracino commented Mar 18, 2024

A null vcf file #377

A null vcf file #377

Comments

dudududu12138 commented Mar 18, 2024

subwaystation commented Mar 18, 2024

AndreaGuarracino commented Mar 18, 2024

AndreaGuarracino commented Mar 18, 2024