Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Improve off-reference coordinate and rGFA support #3891

Closed
wants to merge 21 commits into from
Closed

Conversation

glennhickey
Copy link
Contributor

The overall goal here is to allow calling within large insertions: With the HPRC human pangenomes, we presently project to hg38 using vg surject for Giraffe/DeepVariant or vg's VCF projection for deconstruct and call. In all cases, this makes looking at variants within insertions difficult: they could be lost in surject or get buried in giant nested alleles in VCF.

As far as the vg tools are concerned, we could support variants within insertions provided they get unique "reference' paths to project on. rGFA is what minigraph uses to define these paths: each node in the graph gets a "stable" coordinate on one of the input haplotypes. This is effectively a path cover of the graph that gives each position a unique coordinate.

There are some challenges with incorporating this into vg:

  • Since we also (unlike minigraph) store the full haplotype as a path, we have some redundant information. For example we could have a haplotype path HG002#0#chr1 for all of chromosome one, but then we may need a reference path for HG002#0#chr1[100-200] that defines an interval in the cover. This may confuse tools that aren't looking out for this.
  • rGFA itself is annoying as it is yet another way (in addition to P and W lines) to store paths in GFA. But to round trip it into and out of vg we need to bake in a flag into the path names, which are already overloaded with the current metadata stuff
  • Having a cover for 100% of the sequence in the graph (like minigraph) is going to be impractical for base-level graphs (ie snps would require 1bp path fragments).
  • Things get more complicated with multiple references. If I have a graph with GRCh38 and CHM13 reference sense paths, add an rGFA cover based on GRCh38, then how to easily tell surject to ignore CHM13?

As of right now, this PR adds an option to vg paths to compute the rGFA cover using a greedy algorithm, allowing specification of a minimum fragment size. This is nearly enough to start playing around with some applications. But there's still work to be done on the metadata questions above before it can be merged.

@glennhickey
Copy link
Contributor Author

Replaced by #4113

@glennhickey glennhickey closed this Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants