Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-path support #29

Open
glennhickey opened this issue Aug 20, 2019 · 9 comments
Open

Sub-path support #29

glennhickey opened this issue Aug 20, 2019 · 9 comments

Comments

@glennhickey
Copy link
Contributor

Subpath information is useful when working with subgraphs. It will also help with rGFA compatibility (SN/SO fields).

The following ideas emerged from the last meeting:

  • A subpath is a path.
  • It contains information linking it to the name and positional offset of its source path
  • It represents a contiguous range in its source path
  • All paths in a graph (regardless of whether or not they are subpaths) must have unique non-empty names. There's no naming structure enforced beyond this, though implementations will probably adopt conventions for subpath naming under the hood.

The subpath interface would be added to PathHandleGraph. It could look something like

/// is a path a subpath
virtual bool is_subpath(path_handle_t path_handle) const = 0;

/// get the name of the source path
virtual string get_source_path_name(path_handle_t path_handle) const = 0;

/// get the offset in the source path
virtual off_t get_source_path_offset(path_handle_t path_handle) const = 0;

I'm not sure if, implementation-wise, it's better to combine the above? In theory, they could be a single method like

/// get subpath information, not a subpath if returned name is empty
virtual pair<string, off_t> get_source_path_position(path_handle_t path_handle) const = 0;

Similar story for querying. We can provide default implementations of these using the above methods:

/// get subpaths for a path
vector<path_handle_t> get_subpaths(const string& source_path_name) const;

/// get_subpaths for a range in a path (start/end 0-based inclusive).  if last flag is true, only paths
/// fully contained in the range are returned
vector<path_handle_t> get_subpaths_for_range(const string& source_path_name, off_t start, off_t end, bool fully_contained = false)

I think implementing this should be pretty straightforward, but making sure that existing algorithms that modify paths use it properly may take some doing.

@ekg @jeizenga @adamnovak thoughts?

@glennhickey
Copy link
Contributor Author

There's probably a method (that could have a default implementation) or flag needed in the above to walk up all the way to the "root" source path and/or compute the equivalent to rGFA's SR/Rank in the case of multiple nesting layers.

@adamnovak
Copy link
Member

In this formulation, I think we definitely need a set of functions which operate on either the path itself if not a subpath, or the parent path if there is one. When we get the path name and offset for output (or read them in as input), we want to automatically articulate things in terms of the base path, and interpret them in terms of the subpaths we actually happen to have available. Something like get_root_path_name(path_handle_t) and get_offset_in_root_path(occurrence_handle_t). We also would want some more accessors for working with root paths. We want to be able to iterate over the subpaths on a root path even when we don't have any idea of their coordinates along the path, for example. We also want to be able to visit every occurrence along a root path in order, regardless of the subpath that each comes from, with some notion of when we cross a discontinuity. And we should be able to iterate over all of the root paths, so we don't waste our time looping over all paths and filtering.

Given that, I'm not sure that we don't want a handle instead of just a string to identify root paths that have subpaths but are not themselves present in the graph.

I think the Right Way to structure things might be to make all actual paths into subpaths, and to separate the root path namespace from the subpath namespace. You can no longer have subpaths of subpaths; they just become possibly overlapping subpaths of the same ultimate root path. You can have a subpath called "chr1" with an offset of 0 that runs the length of chr1, but it is still a subpath and not the actual "chr1" root path, and the only thing that makes it better than some other subpath of the actual "chr1" root path is that it is probably longer. Subpaths would then be very firmly views and not actual paths, so updating one would update all other overlapping subpaths. A design like this would make it harder to accidentally ignore the existence of subpaths.

@glennhickey
Copy link
Contributor Author

Good points. I like get_root_path_name(path_handle_t) and get_offset_in_root_path(occurrence_handle_t).

Personally, I'm mostly interested in getting global coordinates out of subgraphs, for which the above two methods are sufficient. In addition to vg chunk, this allows for things like extracting an MHC graph from a genome graph, and focusing on it transparently using real chr6 coordinates. I think these methods would also be sufficient for a fairly simple 2-pass subgraph merging algorithm.

The interface where we can iterate and query root paths transparently via a collection of subgraphs sounds cool, but seems like a far taller order implementation wise. What would the use case be for it?

@glennhickey
Copy link
Contributor Author

I guess on my MHC example, you may want to go from a root path position to a step in the subgraph which would indeed require some kind of root path handle to do consistently.

@glennhickey
Copy link
Contributor Author

glennhickey commented Aug 21, 2019

Can we do things The Right Way while being backward compatible with the existing interface? Perhaps if we keep everything as is and assume the current path interface gives handles to root paths. Then behind the scenes the implementation can manage subpaths if they exist. An additional subpath interface could be used to map root-path steps to their actual subpaths and detect discontinuities during iteration, examine the subpaths themselves etc.

@adamnovak
Copy link
Member

The interface where we can iterate and query root paths transparently via a collection of subgraphs sounds cool, but seems like a far taller order implementation wise. What would the use case be for it?

I'm not quite sure what you mean here; I think you meant to say "subpaths" and not "subgraphs"? The use case for being able to query the root path name and coordinates from a handle to a subpath is for VCF output when calling on a subgraph. The use case for being able to see the root path through transparent subpath views is any time we want to modify a path, because we want the modification to affect all the subpaths to maintain consistency.

Can we do things The Right Way while being backward compatible with the existing interface?

I don't think we can have the existing interface work for root paths, because the existing iteration interface has no way to signal a discontinuity. We don't want to feed a graph with only parts of a root path into some existing algorithm and have it ignore the discontinuities between the actually-provided subpath parts of the path, without intending to. I think we need a new interface for iterating along the represented parts of a root path, or through the subpaths of a root path, that doesn't let the iteratee ignore the discontinuities accidentally.

@jeizenga
Copy link
Contributor

One comment about the suggestions above: if we're going to do efficient range queries we'll also need to build a BBT index over subpath segments. I don't think that's horrible, tbh, but it is a design consideration.

I think we also need to extend the MutablePathHandleGraph interface to have some methods for creating subpaths. The implementation here actually seems somewhat more challenging to me. Off the top of my head, I see two options here: one offset-based, one step-based.

path_handle_t create_subpath(const string& parent_name, const size_t& offset, const size_t& length);
path_handle_t create_subpath(const PathHandleGraph& from, const step_handle_t& begin, const step_handle_t& end);

In the first option, there doesn't seem to be any simple mechanism for ensuring consistency between the subpath and the parent path. For example, what if the offset does not fall at a node boundary? Where do the steps come from, and how would we ensure that they match the parent?

In the second option, the only way to calculate the offset is by traversing the entire prefix of the path. That's not difficult to implement, but it's not very efficient.

Any suggestions for improving the mutable interface?

@ekg
Copy link
Member

ekg commented Aug 21, 2019 via email

@jeizenga
Copy link
Contributor

BBT = balanced binary tree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants