netcdf4-python support for s3 #95

bnlawrence · 2023-06-16T15:49:30Z

We believe that netcdf-c should support natural operations on a file on an S3 storage system. We need to test and demonstrate that we can do this with netcdf4-python (perhaps via a version with a specially compiled c library).

(We need this to avoid having to rebuild all of cf-python with h5netcdf with unknown implications for functionality and performance).

valeriupredoi · 2023-07-06T15:26:46Z

@bnlawrence @davidhassell @markgoddard looks like we are very, very close to an S3-abled netCDF4-python Unidata/netcdf4-python#1264

valeriupredoi · 2023-07-06T15:32:22Z

that is now possible via this biutiful thing done by my bud @zklaus conda-forge/libnetcdf-feedstock#178 - we'll have to wait for netCDF4-python folk to build new wheels and conda binaries, but that's now just a matter of time, and not a matter of "if" 😁

zklaus · 2023-07-06T15:54:55Z

Cheers 😄. Note that conda-forge/libnetcdf-feedstock#178 is not quite enough; we also need conda-forge/libnetcdf-feedstock#180.

valeriupredoi · 2023-07-06T16:09:49Z

Right on, cheers, Klaus! But the bigh heavy thing was that that you've done in 178, am assuming now it's just a matter of time, not of will and blah anymore 😁 I owes you a 🍺 bud

valeriupredoi · 2023-07-06T18:58:31Z

also pay attention to what Klaus mentions here conda-forge/libnetcdf-feedstock#178 (comment)

valeriupredoi · 2023-07-18T09:10:47Z

gentlemen @bnlawrence @davidhassell @markgoddard conda-forge/libnetcdf-feedstock#180 has now been merged, so in theory, we should be good to go to implement our use of netCDF4-python with S3 - also in light of #95 (comment) it looks like some mucking about will still be needed 😁 Of course, one needs to wait for wheels now ☸️

markgoddard · 2023-08-03T13:43:39Z

Something I noticed while working on byte order and missing data support - Kerchunk's SingleHdf5ToZarr uses h5py to read the netcdf metadata when creating a Zarr array. There definitely seems to be some oddness there in what we get back. In particular, when a byte order/endianness is specified on a variable, it always seems to show as big endian in the Zarr array. I have a workaround for this in #132, but I expect there are other issues, in particular around compressed data.

valeriupredoi · 2023-09-18T12:57:32Z

in S3-netCDF4 news there are bad news - S3 support had to be reverted from libnetcdf since it was causing flakiness apparently conda-forge/libnetcdf-feedstock#188 - I created a test branch to see when/how that's ready to work https://github.com/valeriupredoi/PyActiveStorage/tree/try_netcdf4

valeriupredoi · 2023-09-18T16:10:26Z

OK in light of the bad news listed above, I got some fairly solid progress done on this: it proves out we don't actually need S3-support in netCDF4 per se, all we have to do is be clever about how we read the file as a remote file:

    with fs.open(s3_testfile_uri, 'rb') as s3f:
        nc_bytes = s3f.read()

    ds = netCDF4.Dataset(f'inmemory.nc', memory=nc_bytes)

will read the S3 file into memory and dump it a very nice netCDF4 Dataset at the client: you can see it in action in this GA https://github.com/valeriupredoi/PyActiveStorage/actions/runs/6225016972/job/16894516376 - the problem is that calling netCDF4.Dataset on the memory object that'll dump all the data into a real and heavy netCDF4-like item. All we need to do is to understand how not should do that 😁

bnlawrence · 2023-09-19T08:00:55Z

I think the problem with that is that you've read the entire file across the network. Ideally we'd be using range-get and only getting metadata in the first instance, so it would remain lazy for actual data.

zklaus · 2023-09-19T09:57:31Z

You should be able to construct a simple https URL using bucket name and object name as https://localhost:9000/bucketname/object. Then you should be able to use #byte-range to open the whole thing without downloading it.

The caveats are:

for this to work the policy needs to be set correctly on minio (see Public URL access not working minio/minio#7393)
this negates part of the S3 advantage, since the URL must be public and might not offer the same resilience (particular wrt different hosts...)

But for this testing, that is appropriate. Hopefully, we'll resolve the S3 support before things move into production, pending the resolution of Unidata/netcdf-c#2739.

valeriupredoi · 2023-09-19T11:33:03Z

@bnlawrence exactly - I think there is a way to MacGyver that 😁 @zklaus cheers, but the problem is not solved by creating a public URL - we need to have support for s3:// types of file URLs + netCDF4python, but I'll try see what I can do with what we currently have. Hoping this will be sorted out upstream, like you did before (and that was so close!) 🍺

markgoddard · 2023-09-19T13:44:12Z

Something I noticed while working on byte order and missing data support - Kerchunk's SingleHdf5ToZarr uses h5py to read the netcdf metadata when creating a Zarr array. There definitely seems to be some oddness there in what we get back. In particular, when a byte order/endianness is specified on a variable, it always seems to show as big endian in the Zarr array. I have a workaround for this in #132, but I expect there are other issues, in particular around compressed data.

Just bumping this ^ While it may be an easy enough switch to netcdf4-python in PyActiveStorage when reading the metadata, swapping out h5py for netcdf4-python in kerchunk would be more difficult, and potentially more contentious.

bnlawrence · 2023-09-19T14:55:56Z

Good point wrt kerchunk h5py dependency ... need to think about that.

Without doing proper diligence, I suspect that the kerchunk to Zarr transition uses the endian specification to ensure that Zarr sees the data always with the same endianness, regardless of what is in the netcdf file ... IFIRC they do that in the chunk load step.

valeriupredoi · 2023-09-19T15:07:36Z

Something I noticed while working on byte order and missing data support - Kerchunk's SingleHdf5ToZarr uses h5py to read the netcdf metadata when creating a Zarr array. There definitely seems to be some oddness there in what we get back. In particular, when a byte order/endianness is specified on a variable, it always seems to show as big endian in the Zarr array. I have a workaround for this in #132, but I expect there are other issues, in particular around compressed data.

Just bumping this ^ While it may be an easy enough switch to netcdf4-python in PyActiveStorage when reading the metadata, swapping out h5py for netcdf4-python in kerchunk would be more difficult, and potentially more contentious.

good move to bump it! These things happen because we cobbled together this package using a bunch of other tools that have their own peculiarities for which others have workarounds for. Specifically about endianness no clue, but about the other thing @markgoddard has raised (chunk gets downloaded by SingleHDF5toZarr) I think we have to be very careful with how we build the fs that kerchunk uses

valeriupredoi · 2023-09-19T15:10:05Z

Something I noticed while working on byte order and missing data support - Kerchunk's SingleHdf5ToZarr uses h5py to read the netcdf metadata when creating a Zarr array. There definitely seems to be some oddness there in what we get back. In particular, when a byte order/endianness is specified on a variable, it always seems to show as big endian in the Zarr array. I have a workaround for this in #132, but I expect there are other issues, in particular around compressed data.

Just bumping this ^ While it may be an easy enough switch to netcdf4-python in PyActiveStorage when reading the metadata, swapping out h5py for netcdf4-python in kerchunk would be more difficult, and potentially more contentious.

In fact, I'd make this a separate issue 👍

bnlawrence · 2023-09-19T15:25:27Z

+1 for a separate issue.

valeriupredoi · 2023-12-01T14:43:42Z

just keeping this alive via a wee summary: here are the netcdf-C bother points that would need be fixed Unidata/netcdf-c#2739 and Unidata/netcdf-c#2777 and my comment #95 (comment) pointing to the reversal of S3 capability in libnetcdf due to flakiness

As you can see from the second issue listed there, there are already strides to get Zarry support via reference FS and chunks

bnlawrence assigned valeriupredoi Jun 16, 2023

valeriupredoi added the enhancement New feature or request label Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

netcdf4-python support for s3 #95

netcdf4-python support for s3 #95

bnlawrence commented Jun 16, 2023

valeriupredoi commented Jul 6, 2023

valeriupredoi commented Jul 6, 2023

zklaus commented Jul 6, 2023

valeriupredoi commented Jul 6, 2023

valeriupredoi commented Jul 6, 2023

valeriupredoi commented Jul 18, 2023 •

edited

Loading

markgoddard commented Aug 3, 2023

valeriupredoi commented Sep 18, 2023

valeriupredoi commented Sep 18, 2023 •

edited

Loading

bnlawrence commented Sep 19, 2023

zklaus commented Sep 19, 2023

valeriupredoi commented Sep 19, 2023 •

edited

Loading

markgoddard commented Sep 19, 2023

bnlawrence commented Sep 19, 2023 •

edited

Loading

valeriupredoi commented Sep 19, 2023

valeriupredoi commented Sep 19, 2023

bnlawrence commented Sep 19, 2023

valeriupredoi commented Dec 1, 2023

netcdf4-python support for s3 #95

netcdf4-python support for s3 #95

Comments

bnlawrence commented Jun 16, 2023

valeriupredoi commented Jul 6, 2023

valeriupredoi commented Jul 6, 2023

zklaus commented Jul 6, 2023

valeriupredoi commented Jul 6, 2023

valeriupredoi commented Jul 6, 2023

valeriupredoi commented Jul 18, 2023 • edited Loading

markgoddard commented Aug 3, 2023

valeriupredoi commented Sep 18, 2023

valeriupredoi commented Sep 18, 2023 • edited Loading

bnlawrence commented Sep 19, 2023

zklaus commented Sep 19, 2023

valeriupredoi commented Sep 19, 2023 • edited Loading

markgoddard commented Sep 19, 2023

bnlawrence commented Sep 19, 2023 • edited Loading

valeriupredoi commented Sep 19, 2023

valeriupredoi commented Sep 19, 2023

bnlawrence commented Sep 19, 2023

valeriupredoi commented Dec 1, 2023

valeriupredoi commented Jul 18, 2023 •

edited

Loading

valeriupredoi commented Sep 18, 2023 •

edited

Loading

valeriupredoi commented Sep 19, 2023 •

edited

Loading

bnlawrence commented Sep 19, 2023 •

edited

Loading