Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netcdf4-python support for s3 #95

Open
bnlawrence opened this issue Jun 16, 2023 · 18 comments
Open

netcdf4-python support for s3 #95

bnlawrence opened this issue Jun 16, 2023 · 18 comments
Assignees
Labels
enhancement New feature or request

Comments

@bnlawrence
Copy link
Collaborator

We believe that netcdf-c should support natural operations on a file on an S3 storage system. We need to test and demonstrate that we can do this with netcdf4-python (perhaps via a version with a specially compiled c library).

(We need this to avoid having to rebuild all of cf-python with h5netcdf with unknown implications for functionality and performance).

@valeriupredoi valeriupredoi added the enhancement New feature or request label Jun 20, 2023
@valeriupredoi
Copy link
Collaborator

@bnlawrence @davidhassell @markgoddard looks like we are very, very close to an S3-abled netCDF4-python Unidata/netcdf4-python#1264

@valeriupredoi
Copy link
Collaborator

that is now possible via this biutiful thing done by my bud @zklaus conda-forge/libnetcdf-feedstock#178 - we'll have to wait for netCDF4-python folk to build new wheels and conda binaries, but that's now just a matter of time, and not a matter of "if" 😁

@zklaus
Copy link

zklaus commented Jul 6, 2023

Cheers 😄. Note that conda-forge/libnetcdf-feedstock#178 is not quite enough; we also need conda-forge/libnetcdf-feedstock#180.

@valeriupredoi
Copy link
Collaborator

Right on, cheers, Klaus! But the bigh heavy thing was that that you've done in 178, am assuming now it's just a matter of time, not of will and blah anymore 😁 I owes you a 🍺 bud

@valeriupredoi
Copy link
Collaborator

also pay attention to what Klaus mentions here conda-forge/libnetcdf-feedstock#178 (comment)

@valeriupredoi
Copy link
Collaborator

valeriupredoi commented Jul 18, 2023

gentlemen @bnlawrence @davidhassell @markgoddard conda-forge/libnetcdf-feedstock#180 has now been merged, so in theory, we should be good to go to implement our use of netCDF4-python with S3 - also in light of #95 (comment) it looks like some mucking about will still be needed 😁 Of course, one needs to wait for wheels now ☸️

@markgoddard
Copy link

Something I noticed while working on byte order and missing data support - Kerchunk's SingleHdf5ToZarr uses h5py to read the netcdf metadata when creating a Zarr array. There definitely seems to be some oddness there in what we get back. In particular, when a byte order/endianness is specified on a variable, it always seems to show as big endian in the Zarr array. I have a workaround for this in #132, but I expect there are other issues, in particular around compressed data.

@valeriupredoi
Copy link
Collaborator

in S3-netCDF4 news there are bad news - S3 support had to be reverted from libnetcdf since it was causing flakiness apparently conda-forge/libnetcdf-feedstock#188 - I created a test branch to see when/how that's ready to work https://github.com/valeriupredoi/PyActiveStorage/tree/try_netcdf4

@valeriupredoi
Copy link
Collaborator

valeriupredoi commented Sep 18, 2023

OK in light of the bad news listed above, I got some fairly solid progress done on this: it proves out we don't actually need S3-support in netCDF4 per se, all we have to do is be clever about how we read the file as a remote file:

    with fs.open(s3_testfile_uri, 'rb') as s3f:
        nc_bytes = s3f.read()

    ds = netCDF4.Dataset(f'inmemory.nc', memory=nc_bytes)

will read the S3 file into memory and dump it a very nice netCDF4 Dataset at the client: you can see it in action in this GA https://github.com/valeriupredoi/PyActiveStorage/actions/runs/6225016972/job/16894516376 - the problem is that calling netCDF4.Dataset on the memory object that'll dump all the data into a real and heavy netCDF4-like item. All we need to do is to understand how not should do that 😁

@bnlawrence
Copy link
Collaborator Author

I think the problem with that is that you've read the entire file across the network. Ideally we'd be using range-get and only getting metadata in the first instance, so it would remain lazy for actual data.

@zklaus
Copy link

zklaus commented Sep 19, 2023

You should be able to construct a simple https URL using bucket name and object name as https://localhost:9000/bucketname/object. Then you should be able to use #byte-range to open the whole thing without downloading it.

The caveats are:

  • for this to work the policy needs to be set correctly on minio (see Public URL access not working minio/minio#7393)
  • this negates part of the S3 advantage, since the URL must be public and might not offer the same resilience (particular wrt different hosts...)

But for this testing, that is appropriate. Hopefully, we'll resolve the S3 support before things move into production, pending the resolution of Unidata/netcdf-c#2739.

@valeriupredoi
Copy link
Collaborator

valeriupredoi commented Sep 19, 2023

@bnlawrence exactly - I think there is a way to MacGyver that 😁 @zklaus cheers, but the problem is not solved by creating a public URL - we need to have support for s3:// types of file URLs + netCDF4python, but I'll try see what I can do with what we currently have. Hoping this will be sorted out upstream, like you did before (and that was so close!) 🍺

@markgoddard
Copy link

Something I noticed while working on byte order and missing data support - Kerchunk's SingleHdf5ToZarr uses h5py to read the netcdf metadata when creating a Zarr array. There definitely seems to be some oddness there in what we get back. In particular, when a byte order/endianness is specified on a variable, it always seems to show as big endian in the Zarr array. I have a workaround for this in #132, but I expect there are other issues, in particular around compressed data.

Just bumping this ^ While it may be an easy enough switch to netcdf4-python in PyActiveStorage when reading the metadata, swapping out h5py for netcdf4-python in kerchunk would be more difficult, and potentially more contentious.

@bnlawrence
Copy link
Collaborator Author

bnlawrence commented Sep 19, 2023

Good point wrt kerchunk h5py dependency ... need to think about that.

Without doing proper diligence, I suspect that the kerchunk to Zarr transition uses the endian specification to ensure that Zarr sees the data always with the same endianness, regardless of what is in the netcdf file ... IFIRC they do that in the chunk load step.

@valeriupredoi
Copy link
Collaborator

Something I noticed while working on byte order and missing data support - Kerchunk's SingleHdf5ToZarr uses h5py to read the netcdf metadata when creating a Zarr array. There definitely seems to be some oddness there in what we get back. In particular, when a byte order/endianness is specified on a variable, it always seems to show as big endian in the Zarr array. I have a workaround for this in #132, but I expect there are other issues, in particular around compressed data.

Just bumping this ^ While it may be an easy enough switch to netcdf4-python in PyActiveStorage when reading the metadata, swapping out h5py for netcdf4-python in kerchunk would be more difficult, and potentially more contentious.

good move to bump it! These things happen because we cobbled together this package using a bunch of other tools that have their own peculiarities for which others have workarounds for. Specifically about endianness no clue, but about the other thing @markgoddard has raised (chunk gets downloaded by SingleHDF5toZarr) I think we have to be very careful with how we build the fs that kerchunk uses

@valeriupredoi
Copy link
Collaborator

Something I noticed while working on byte order and missing data support - Kerchunk's SingleHdf5ToZarr uses h5py to read the netcdf metadata when creating a Zarr array. There definitely seems to be some oddness there in what we get back. In particular, when a byte order/endianness is specified on a variable, it always seems to show as big endian in the Zarr array. I have a workaround for this in #132, but I expect there are other issues, in particular around compressed data.

Just bumping this ^ While it may be an easy enough switch to netcdf4-python in PyActiveStorage when reading the metadata, swapping out h5py for netcdf4-python in kerchunk would be more difficult, and potentially more contentious.

In fact, I'd make this a separate issue 👍

@bnlawrence
Copy link
Collaborator Author

+1 for a separate issue.

@valeriupredoi
Copy link
Collaborator

just keeping this alive via a wee summary: here are the netcdf-C bother points that would need be fixed Unidata/netcdf-c#2739 and Unidata/netcdf-c#2777 and my comment #95 (comment) pointing to the reversal of S3 capability in libnetcdf due to flakiness

As you can see from the second issue listed there, there are already strides to get Zarry support via reference FS and chunks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants