Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple consensus paths #37

Open
brettChapman opened this issue Nov 2, 2020 · 2 comments
Open

Multiple consensus paths #37

brettChapman opened this issue Nov 2, 2020 · 2 comments
Labels
question Further information is requested

Comments

@brettChapman
Copy link

Hi Erik

I've followed your PGGB updates and am now using smoothxg with added consensus paths. I've put the graph into VG and sequenceTubeMap. I notice I have 8 different consensus paths in my graph, and also when I VG deconstruct I have variants called on each of these consensus paths. Would I have multiple different consensus paths because of breaks and jumps in the graph, based on -C (which I have set to 10,100,1000,10000, like in PGGB). Would lowering the -C parameter reduce the number of consensus paths generated?

Would you mind explaining the benefit of having the consensus paths in the graph? Is it basically the graph collapsed down to represent all common regions across the pangenome? How could I use these paths to investigate the pangenome? From what I can see from the alignments, the consensus is representing the most common paths (core sequences). Would this be an accurate description of the censensus paths? Thanks.

@ekg
Copy link
Collaborator

ekg commented Nov 2, 2020

Hi Brett,

The idea with the consensus graphs is to build lower-resolution versions of the pangenome that are still very "close" in terms of sequence content to the genomes in the graph. These low-resolution versions of the graph can help us inspect the graph in interactive systems, or compare it to other graphs. They are faster to work with than the full graph, which has advantages in many settings.

Right now, the consensus sequences are a kind of reference set of coordinates that cover the graph. The idea is that we can go from a low resolution graph to find the corresponding region of the full graph or MAF. We look up the consensus paths in the given region of the consensus graph at a given C threshold. We'd then subset the base graph to these consensus paths, or search in the MAF, etc.

There are some quirks. The consensus path set contains both the heaviest-bundle POA consensus paths from the block MSAs represented in the MAF file, and it contains "Link" paths that walk from the end of one path to the beginning of another, or include any sequence variation (approximately) greater than the given C threshold, but which would otherwise be fully contained in a given consensus. This latter part contains large SVs of all types. These alleles are aggregated progressively by working through the set of potential links in order of frequency and divergence from the reference.

The exact nomenclature, naming, and organization of this isn't fully implemented, and will probably evolve. The link paths aren't yet embedded in the graph, but they should be.

@brettChapman
Copy link
Author

Thanks for the explanation. I'll keep an eye on how the use of these consensus paths evolves over time.

@AndreaGuarracino AndreaGuarracino added the question Further information is requested label Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants