-
Notifications
You must be signed in to change notification settings - Fork 3
Conversation
64cd634
to
d132756
Compare
d132756
to
4ee4ee2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it has a useful set of components to implement bgen_to_zarr
, but I'm a bit confused about the relationship between rechunk_to_zarr
and rechunk_from_zarr
. The former only rechunks samples (width), while the latter rechunks both variants (length) and samples (width). Would it be possible to have a single rechunking operation? This might become clearer with documentation and use cases.
For VCF, vcf_to_zarr
is broken down into vcf_to_zarrs
and zarrs_to_dataset
, which are really about managing parallelism (and Dask operations). I'm wondering if there are primitives that both high-level to_zarr
functions for VCF and BGEN share that will make them more consistent to users? (There may not be, but we should try to make the high-level functions consistent at least.)
return ds | ||
|
||
|
||
def unpack_variables(ds: Dataset, dtype: Any = "float32") -> Dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use DType
from sgkit.typing
here.
It is definitely confusing w/o more docs, but this is the intended flow:
I had started trying to do something like what you did where you set the chunks encoding for the store before writing smaller arrays to it. That didn't work when operating on a whole dataset though. I'm sure it would work if I instead made a loop for the reads and did the appends myself. I'm not sure if the extra code is worth it in that case -- I would definitely say yes if there was a way to read from cloud stores w/o a FUSE mount. I'll give that one more go though this week and see if there's an elegant way to write into a store with a longer chunk length without having to first load all the necessary chunks into memory.
Hm I don't have a good answer to that one, but I do think it's a little different in this case since there isn't any custom reading/writing code involved (yet). I don't see any reason we couldn't have a The optimize/de-optimize variables functionality could be something worthy of being a shared primitive across the formats. Even after compression, that reduces space used to about 20% of the original for the bgen data I tested with. |
Closing in favor of #22. |
#16
This adds two functions:
rechunk_to_zarr
pack_variables
) and compressing variables (viaencode_variables
) into a more efficient representationrechunk_from_zarr
unpack_variables
function that can be used to undo the original packingThe first function takes a
Dataset
that a user would have created viaread_bgen
and the second returns aDataset
that a user could then save elsewhere. This isn't the fullbgen_to_zarr
implementation, but it is all the inner workings that would be needed to add a layer on top like vcf_to_zarr. This is enough code though that I wanted to push it up for review first.