Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly calculate dtype element size in netCDF3 records #466

Merged
merged 7 commits into from
Jun 24, 2024

Conversation

martindurant
Copy link
Member

Fixes #465

@rsignell-usgs please test. Not what I thought was happening. I don't know why I can't trust numpy's dt.itemsize.

@rsignell
Copy link

@martindurant I tried testing, but don't know whether it's not working or simply user error opening in xarray (cell [8] in https://gist.github.com/rsignell/cb6e3ed842abedb797e2cd8ccc39169c

@martindurant
Copy link
Member Author

Should have been:

so = dict(anon=True, skip_instance_cache=True)

(remove mode=). Produces

<xarray.Dataset>
Dimensions:            (depth: 40, lat: 3251, lon: 4500, time: 1)
Coordinates:
  * depth              (depth) float64 0.0 2.0 4.0 6.0 ... 3e+03 4e+03 5e+03
  * lat                (lat) float64 -80.0 -79.96 -79.92 ... 89.92 89.96 90.0
  * lon                (lon) float64 -180.0 -179.9 -179.8 ... 179.8 179.8 179.9
  * time               (time) datetime64[ns] 2014-01-01T12:00:00
Data variables:
    salinity           (time, depth, lat, lon) float32 dask.array<chunksize=(1, 40, 3251, 4500), meta=np.ndarray>
    salinity_bottom    (time, lat, lon) float32 dask.array<chunksize=(1, 3251, 4500), meta=np.ndarray>
    surf_el            (time, lat, lon) float32 dask.array<chunksize=(1, 3251, 4500), meta=np.ndarray>
    water_temp         (time, depth, lat, lon) float32 dask.array<chunksize=(1, 40, 3251, 4500), meta=np.ndarray>
    water_temp_bottom  (time, lat, lon) float32 dask.array<chunksize=(1, 3251, 4500), meta=np.ndarray>
    water_u            (time, depth, lat, lon) float32 dask.array<chunksize=(1, 40, 3251, 4500), meta=np.ndarray>
    water_u_bottom     (time, lat, lon) float32 dask.array<chunksize=(1, 3251, 4500), meta=np.ndarray>
    water_v            (time, depth, lat, lon) float32 dask.array<chunksize=(1, 40, 3251, 4500), meta=np.ndarray>
    water_v_bottom     (time, lat, lon) float32 dask.array<chunksize=(1, 3251, 4500), meta=np.ndarray>
Attributes:
    Conventions:               CF-1.6 NAVO_netcdf_v1.1
    classification_authority:  not applicable
    classification_level:      UNCLASSIFIED
    distribution_statement:    Approved for public release. Distribution unli...
    downgrade_date:            not applicable
    field_type:                instantaneous
    history:                   archv2ncdf3z
    institution:               Naval Oceanographic Office
    source:                    HYCOM archive file

Note from our discussion: you can split any of the big arrays on the first dimension, which here is depth, length 40 (kerchunk.utils.subchunk). So long as you choose an exact divisor, e.g., 8 or 10, it will work. The on disk size unsplit is 1.17e9bytes, so probably worth doing this. I'm not actually sure if the split would be possible after comine, better to do on the simple JSON/dict from this first stage. (I hope I have counted for subchunking where the very first dimension is a 1, and so can be ignored.

@rsignell rsignell mentioned this pull request Jun 23, 2024
@martindurant
Copy link
Member Author

Test should be fixed by fsspec/filesystem_spec#1634

@martindurant
Copy link
Member Author

I'll merge this now and we can open any further issues that might arise.

@martindurant martindurant merged commit ae692fe into fsspec:main Jun 24, 2024
5 checks passed
@martindurant martindurant deleted the complex_cdf3_dt branch June 24, 2024 15:20
@rsignell
Copy link

Wooho! It's working!

image

@martindurant
Copy link
Member Author

We'll need to decide what the best size is in the end. Having smaller chunks means more requests and a bigger reference set on disk. Since it's pretty easy to generate, we could make multiple variants and profile them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NetCDF file has one time step, kerchunk-generated reference has nine time steps?
2 participants