Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GBZ database backend #367

Open
adamnovak opened this issue Nov 20, 2023 · 2 comments
Open

Implement GBZ database backend #367

adamnovak opened this issue Nov 20, 2023 · 2 comments
Assignees

Comments

@adamnovak
Copy link
Member

@jltsiren has a version of the GBZ that is actually a database (meaning it supports fast random access, potentially by URL), and he has code that can query coordinates and dump a subgraph and the local paths, with which one was the reference indicated.

We should add this as a backend for the tube map. Instead of vg chunk, we could call Jouni's implementation when we see our graph is a GBZ database. If the database is a URL, we should use Jouni's code against that URL, rather than downloading it to the server and running against it locally.

Ideally, we would do the extraction client-side in WebAssembly, but to start we might have to run Jouni's tool on the server, until we can get a WASM build of Jouni's implementation.

@jltsiren
Copy link

The current prototype is a separate repository, but I'm eventually going to make it an optional feature in the Rust GBWT.

@jltsiren
Copy link

jltsiren commented Nov 29, 2023

If you want to try it, the prototype is starting to be somewhat functional. Things can change without warning, so you may have to update gbz-base and its dependencies and rebuild the database frequently. You can do that with the following commands, and even 8 GB memory should be enough with a HPRC v1.1 graph:

git pull
cargo update
cargo build --release
target/release/gbz2db --overwrite graph.gbz

You can query the database like this:

target/release/query --sample GRCh38 --contig chr1 --interval 1234500..1234600 \
    --context 100 --distinct graph.gbz.db > out.gfa

Only generic and reference paths are indexed for querying. You can replace --interval with --offset if necessary. Sequence offsets are 0-based and intervals are semiopen. All haplotypes except the queried one will have sample name unknown. Option --distinct outputs only distinct haplotypes and stores the weight (number of duplicates) using tag WT.

The above command should output something like this:

H       VN:Z:1.1        RS:Z:GRCh38
S       49698   TGCAGGTCACTGACCCACTGACCACAGATCACCACTCTCTCCCGCCCCCGTTTCTTCCTCCTGAGACCACTGCTGCCTTCAAA
S       49699   G
S       49700   T
S       49701   GTCCAAGAACAAACTGTGCACACACGGGCGGTCAGTCCTGGATTCAAGGTGGCCCATCGGAGCTCGAACTCAACGGCCTGACCTAGGGACGCCTCCAGTCCCACCCCTACCTTTTGGAGGTGGGAGAATCACC
S       49702   C
S       49703   T
S       49704   GGGTCCCCAGCCCTCGATCCTCACGAGTCCT
S       49705   G
S       49706   A
S       49707   TTCCTCAAGCAATTCCAAACTCCAGAGAAGACATCCCAGGTAAAATCGGATGGCAGGGACCTGGGACACCGCTGTGACGGGCAGCCCCTGGGCTCGAGTCCACAACTCCCACGGGAGGGCAGGGCCCCATCCTGCTCACACC
L       49698   +       49699   +       *
L       49698   +       49700   -       *
L       49699   +       49701   +       *
L       49700   -       49701   +       *
L       49701   +       49702   +       *
L       49701   +       49703   +       *
L       49702   +       49704   +       *
L       49703   +       49704   +       *
L       49704   +       49705   +       *
L       49704   +       49706   +       *
L       49705   +       49707   +       *
L       49706   +       49707   +       *
W       GRCh38  0       chr1    1234375 1234767 >49698>49699>49701>49702>49704>49705>49707      WT:i:79
W       unknown 1       chr1    0       392     >49698>49699>49701>49702>49704>49706>49707      WT:i:5
W       unknown 2       chr1    0       392     >49698>49699>49701>49703>49704>49705>49707      WT:i:5
W       unknown 3       chr1    0       392     >49698<49700>49701>49702>49704>49705>49707      WT:i:1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants