Skip to content
This repository has been archived by the owner on Oct 15, 2020. It is now read-only.

Bgen to zarr (v1) #21

Closed
wants to merge 1 commit into from
Closed

Conversation

eric-czech
Copy link
Collaborator

#16

This adds two functions:

  • rechunk_to_zarr
    • Saves a dataset with no chunking in the samples dimension into a temporary store with some desired sample chunking
    • Provides options for packing (via pack_variables) and compressing variables (via encode_variables) into a more efficient representation
  • rechunk_from_zarr
    • Applies rechunking to a desired variant chunk length from the temporary store
    • Has a companion unpack_variables function that can be used to undo the original packing

The first function takes a Dataset that a user would have created via read_bgen and the second returns a Dataset that a user could then save elsewhere. This isn't the full bgen_to_zarr implementation, but it is all the inner workings that would be needed to add a layer on top like vcf_to_zarr. This is enough code though that I wanted to push it up for review first.

Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it has a useful set of components to implement bgen_to_zarr, but I'm a bit confused about the relationship between rechunk_to_zarr and rechunk_from_zarr. The former only rechunks samples (width), while the latter rechunks both variants (length) and samples (width). Would it be possible to have a single rechunking operation? This might become clearer with documentation and use cases.

For VCF, vcf_to_zarr is broken down into vcf_to_zarrs and zarrs_to_dataset, which are really about managing parallelism (and Dask operations). I'm wondering if there are primitives that both high-level to_zarr functions for VCF and BGEN share that will make them more consistent to users? (There may not be, but we should try to make the high-level functions consistent at least.)

return ds


def unpack_variables(ds: Dataset, dtype: Any = "float32") -> Dataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use DType from sgkit.typing here.

@eric-czech
Copy link
Collaborator Author

I'm a bit confused about the relationship between rechunk_to_zarr and rechunk_from_zarr. The former only rechunks samples (width), while the latter rechunks both variants (length) and samples (width). Would it be possible to have a single rechunking operation? This might become clearer with documentation and use cases.

It is definitely confusing w/o more docs, but this is the intended flow:

  1. Run read_bgen(..., chunks=(50, -1)) yourself, where 50 is some short chunk height since all of the samples will need to fit in memory for those 50 rows
  2. Run rechunk_to_zarr with a new chunk width (target chunk length still checked here to ensure that you won't get performance warnings for split blocks on read)
  3. Run rechunk_from_zarr to get a dataset with the overall desired chunking
  4. Write the dataset out however you'd like

Would it be possible to have a single rechunking operation?

I had started trying to do something like what you did where you set the chunks encoding for the store before writing smaller arrays to it. That didn't work when operating on a whole dataset though. I'm sure it would work if I instead made a loop for the reads and did the appends myself. I'm not sure if the extra code is worth it in that case -- I would definitely say yes if there was a way to read from cloud stores w/o a FUSE mount. I'll give that one more go though this week and see if there's an elegant way to write into a store with a longer chunk length without having to first load all the necessary chunks into memory.

For VCF, vcf_to_zarr is broken down into vcf_to_zarrs and zarrs_to_dataset, which are really about managing parallelism (and Dask operations). I'm wondering if there are primitives that both high-level to_zarr functions for VCF and BGEN share that will make them more consistent to users?

Hm I don't have a good answer to that one, but I do think it's a little different in this case since there isn't any custom reading/writing code involved (yet). I don't see any reason we couldn't have a bgen_to_zarrs function with the same interface. I didn't have a need for it in my UKB experiments since I wanted any looping / range subsets to live in the control plane (i.e. the snakemake pipeline).

The optimize/de-optimize variables functionality could be something worthy of being a shared primitive across the formats. Even after compression, that reduces space used to about 20% of the original for the bgen data I tested with.

@eric-czech eric-czech changed the title Bgen to zarr (part 1) Bgen to zarr (v1) Sep 22, 2020
@eric-czech eric-czech mentioned this pull request Sep 22, 2020
@eric-czech
Copy link
Collaborator Author

Closing in favor of #22.

@eric-czech eric-czech closed this Sep 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants