Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vendor kerchunk readers? #377

Closed
TomNicholas opened this issue Jan 13, 2025 · 4 comments
Closed

Vendor kerchunk readers? #377

TomNicholas opened this issue Jan 13, 2025 · 4 comments
Labels
dependencies Updates a dependency internals Kerchunk Relating to the kerchunk library / specification itself references generation Reading byte ranges from archival files zarr-python Relevant to zarr-python upstream

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Jan 13, 2025

We still have dependency issues with kerchunk not supporting zarr-python v3, but icechunk requiring it (see #321). It unfortunately doesn't look like these will be resolved in kerchunk any time soon.

Right now we only actually use kerchunk code directly within VirtualiZarr for:

  • the netCDF3 reader
  • the HDF5 reader (which we now have Sean's alternative HDF reader for)
  • the FITS reader (which is cool but fewer people use)

We really want to be able to pin zarr-python>=3.0.0, for so many reasons. But right now doing so would break those readers in VirtualiZarr because of the kerchunk incompatibility.

I realised yesterday that the most expedient thing to do here might be to just vendor (i.e. copy-paste) the code for the FITS and netCDF3 readers into VirtualiZarr.

The implementations for these are:

  • the only part of kerchunk we need to keep right now to make our tests pass (as we don't currently have a working TIFF reader - see open_virtual_dataset fails to open tiffs #291),
  • they already work with zarr v3 (I think - the netCDF3 implementation doesn't even import zarr so that one really should work),
  • they have barely been updated in over a year so we're not likely to miss out on regular new changes,
  • they're an easily abstracted component that we can fix/replace later,
  • They're not that big, ~300 lines each.

That would allow us to go full steam ahead with all the other things we need to do to be able to pin zarr-python>=3.0.0 (e.g. #374, #182, #175).

FYI @bamford - I see you're the last person that committed to the FITS reader (fsspec/kerchunk#531), so you will want to be aware of us potentially forking it!

@TomNicholas TomNicholas added dependencies Updates a dependency internals Kerchunk Relating to the kerchunk library / specification itself references generation Reading byte ranges from archival files zarr-python Relevant to zarr-python upstream labels Jan 13, 2025
@maxrjones
Copy link
Member

We still have dependency issues with kerchunk not supporting zarr-python v3, but icechunk requiring it (see #321). It unfortunately doesn't look like these will be resolved in kerchunk any time soon.

@martindurant offered to work on kerchunk <-> zarr-python v3 compatibility over in fsspec/kerchunk#516 (comment). While I understand there may be times in which vendoring is necessary, I'm a -1 on this approach unless we first try to resolve any communication gaps about the urgency of kerchunk adopting zarr-python v3 for downstream packages.

@TomNicholas understanding that the ship has sailed for kerchunk compatibility prior to the zarr-python v3 release, can you provide a timeline on which you'd wait on Martin finishing the work started in fsspec/kerchunk#516? e.g., a day, one week, two weeks? This would help guide the conversation of whether an extreme solution like vendering most of kerchunk's reader code is necessary. Thanks for your continuous work on making sure this ecosystem moves as fast as possible!

@martindurant, would you be able to offer some insight into your potential timeline for working on fsspec/kerchunk#516? Thank you for offering to take that up - it will be really impactful for people wanting to use Kerchunk and VirtualiZarr with Icechunk!

@martindurant
Copy link
Member

would you be able to offer some insight into your potential timeline for working on

I am assessing now. The combine module will take some time for sure, but that's not necessary for Vzarr.

@TomNicholas
Copy link
Member Author

I'm a -1 on this approach unless we first try to resolve any communication gaps about the urgency of kerchunk adopting zarr-python v3 for downstream packages.

@maxrjones that's totally fair, and thank you for doing the work of reaching back out again here. I had not seen your comment on @mpiannucci 's branch (now subscribed) when I raised this issue - the last activity I had seen was in November.

Obviously if @martindurant is able to get get kerchunk updated to work with zarr-python v3 then that would be perfect! The suggestion in this issue is that by vendoring (a relatively small amount of) code we could remove any time pressure on Martin, and move forwards ourselves.

The combine module will take some time for sure, but that's not necessary for Vzarr.

Exactly - in VirtualiZarr we're currently blocked by components we don't actually need.

can you provide a timeline on which you'd wait on Martin finishing the work

I'm unlikely to have time to actually write code for this library for the next 3 weeks, but after that I want to prioritize pushing it forwards, so that people can start getting their data into Icechunk without unnecessary friction. So let's say a month? If it doesn't work by then I would prefer to just vendor things so we can move forwards. Obviously if kerchunk compatibility with v3 is achieved sometime after that we can still easily un-vendor and re-introduce the optional dependency (which is the ideal situation).

@maxrjones
Copy link
Member

Closing with the plan to wait on a kerchunk release rather than vendoring the readers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Updates a dependency internals Kerchunk Relating to the kerchunk library / specification itself references generation Reading byte ranges from archival files zarr-python Relevant to zarr-python upstream
Projects
Development

No branches or pull requests

3 participants