Bgen to zarr (v1) #21

eric-czech · 2020-09-03T21:39:08Z

This adds two functions:

rechunk_to_zarr
- Saves a dataset with no chunking in the samples dimension into a temporary store with some desired sample chunking
- Provides options for packing (via pack_variables) and compressing variables (via encode_variables) into a more efficient representation
rechunk_from_zarr
- Applies rechunking to a desired variant chunk length from the temporary store
- Has a companion unpack_variables function that can be used to undo the original packing

The first function takes a Dataset that a user would have created via read_bgen and the second returns a Dataset that a user could then save elsewhere. This isn't the full bgen_to_zarr implementation, but it is all the inner workings that would be needed to add a layer on top like vcf_to_zarr. This is enough code though that I wanted to push it up for review first.

tomwhite

This looks like it has a useful set of components to implement bgen_to_zarr, but I'm a bit confused about the relationship between rechunk_to_zarr and rechunk_from_zarr. The former only rechunks samples (width), while the latter rechunks both variants (length) and samples (width). Would it be possible to have a single rechunking operation? This might become clearer with documentation and use cases.

For VCF, vcf_to_zarr is broken down into vcf_to_zarrs and zarrs_to_dataset, which are really about managing parallelism (and Dask operations). I'm wondering if there are primitives that both high-level to_zarr functions for VCF and BGEN share that will make them more consistent to users? (There may not be, but we should try to make the high-level functions consistent at least.)

tomwhite · 2020-09-07T10:18:19Z

sgkit_bgen/bgen_reader.py

+    return ds
+
+
+def unpack_variables(ds: Dataset, dtype: Any = "float32") -> Dataset:


Could use DType from sgkit.typing here.

eric-czech · 2020-09-08T15:53:20Z

I'm a bit confused about the relationship between rechunk_to_zarr and rechunk_from_zarr. The former only rechunks samples (width), while the latter rechunks both variants (length) and samples (width). Would it be possible to have a single rechunking operation? This might become clearer with documentation and use cases.

It is definitely confusing w/o more docs, but this is the intended flow:

Run read_bgen(..., chunks=(50, -1)) yourself, where 50 is some short chunk height since all of the samples will need to fit in memory for those 50 rows
Run rechunk_to_zarr with a new chunk width (target chunk length still checked here to ensure that you won't get performance warnings for split blocks on read)
Run rechunk_from_zarr to get a dataset with the overall desired chunking
Write the dataset out however you'd like

Would it be possible to have a single rechunking operation?

I had started trying to do something like what you did where you set the chunks encoding for the store before writing smaller arrays to it. That didn't work when operating on a whole dataset though. I'm sure it would work if I instead made a loop for the reads and did the appends myself. I'm not sure if the extra code is worth it in that case -- I would definitely say yes if there was a way to read from cloud stores w/o a FUSE mount. I'll give that one more go though this week and see if there's an elegant way to write into a store with a longer chunk length without having to first load all the necessary chunks into memory.

For VCF, vcf_to_zarr is broken down into vcf_to_zarrs and zarrs_to_dataset, which are really about managing parallelism (and Dask operations). I'm wondering if there are primitives that both high-level to_zarr functions for VCF and BGEN share that will make them more consistent to users?

Hm I don't have a good answer to that one, but I do think it's a little different in this case since there isn't any custom reading/writing code involved (yet). I don't see any reason we couldn't have a bgen_to_zarrs function with the same interface. I didn't have a need for it in my UKB experiments since I wanted any looping / range subsets to live in the control plane (i.e. the snakemake pipeline).

The optimize/de-optimize variables functionality could be something worthy of being a shared primitive across the formats. Even after compression, that reduces space used to about 20% of the original for the bgen data I tested with.

eric-czech · 2020-09-22T20:46:36Z

Closing in favor of #22.

eric-czech force-pushed the bgen_to_zarr branch from 64cd634 to d132756 Compare September 3, 2020 21:42

bgen_to_zarr implementation sgkit-dev#16

4ee4ee2

eric-czech force-pushed the bgen_to_zarr branch from d132756 to 4ee4ee2 Compare September 3, 2020 21:53

tomwhite reviewed Sep 7, 2020

View reviewed changes

eric-czech changed the title ~~Bgen to zarr (part 1)~~ Bgen to zarr (v1) Sep 22, 2020

eric-czech mentioned this pull request Sep 22, 2020

Bgen to zarr (v2) #22

Merged

eric-czech closed this Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bgen to zarr (v1) #21

Bgen to zarr (v1) #21

eric-czech commented Sep 3, 2020

tomwhite left a comment

tomwhite Sep 7, 2020

eric-czech commented Sep 8, 2020

eric-czech commented Sep 22, 2020

		return ds


		def unpack_variables(ds: Dataset, dtype: Any = "float32") -> Dataset:

Bgen to zarr (v1) #21

Bgen to zarr (v1) #21

Conversation

eric-czech commented Sep 3, 2020

tomwhite left a comment

Choose a reason for hiding this comment

tomwhite Sep 7, 2020

Choose a reason for hiding this comment

eric-czech commented Sep 8, 2020

eric-czech commented Sep 22, 2020