Explicitly calculate dtype element size in netCDF3 records #466

martindurant · 2024-06-21T20:44:38Z

Fixes #465

@rsignell-usgs please test. Not what I thought was happening. I don't know why I can't trust numpy's dt.itemsize.

rsignell · 2024-06-22T18:11:25Z

@martindurant I tried testing, but don't know whether it's not working or simply user error opening in xarray (cell [8] in https://gist.github.com/rsignell/cb6e3ed842abedb797e2cd8ccc39169c

martindurant · 2024-06-22T18:27:03Z

Should have been:

so = dict(anon=True, skip_instance_cache=True)

(remove mode=). Produces

<xarray.Dataset>
Dimensions:            (depth: 40, lat: 3251, lon: 4500, time: 1)
Coordinates:
  * depth              (depth) float64 0.0 2.0 4.0 6.0 ... 3e+03 4e+03 5e+03
  * lat                (lat) float64 -80.0 -79.96 -79.92 ... 89.92 89.96 90.0
  * lon                (lon) float64 -180.0 -179.9 -179.8 ... 179.8 179.8 179.9
  * time               (time) datetime64[ns] 2014-01-01T12:00:00
Data variables:
    salinity           (time, depth, lat, lon) float32 dask.array<chunksize=(1, 40, 3251, 4500), meta=np.ndarray>
    salinity_bottom    (time, lat, lon) float32 dask.array<chunksize=(1, 3251, 4500), meta=np.ndarray>
    surf_el            (time, lat, lon) float32 dask.array<chunksize=(1, 3251, 4500), meta=np.ndarray>
    water_temp         (time, depth, lat, lon) float32 dask.array<chunksize=(1, 40, 3251, 4500), meta=np.ndarray>
    water_temp_bottom  (time, lat, lon) float32 dask.array<chunksize=(1, 3251, 4500), meta=np.ndarray>
    water_u            (time, depth, lat, lon) float32 dask.array<chunksize=(1, 40, 3251, 4500), meta=np.ndarray>
    water_u_bottom     (time, lat, lon) float32 dask.array<chunksize=(1, 3251, 4500), meta=np.ndarray>
    water_v            (time, depth, lat, lon) float32 dask.array<chunksize=(1, 40, 3251, 4500), meta=np.ndarray>
    water_v_bottom     (time, lat, lon) float32 dask.array<chunksize=(1, 3251, 4500), meta=np.ndarray>
Attributes:
    Conventions:               CF-1.6 NAVO_netcdf_v1.1
    classification_authority:  not applicable
    classification_level:      UNCLASSIFIED
    distribution_statement:    Approved for public release. Distribution unli...
    downgrade_date:            not applicable
    field_type:                instantaneous
    history:                   archv2ncdf3z
    institution:               Naval Oceanographic Office
    source:                    HYCOM archive file

Note from our discussion: you can split any of the big arrays on the first dimension, which here is depth, length 40 (kerchunk.utils.subchunk). So long as you choose an exact divisor, e.g., 8 or 10, it will work. The on disk size unsplit is 1.17e9bytes, so probably worth doing this. I'm not actually sure if the split would be possible after comine, better to do on the simple JSON/dict from this first stage. (I hope I have counted for subchunking where the very first dimension is a 1, and so can be ignored.

martindurant · 2024-06-24T14:52:54Z

Test should be fixed by fsspec/filesystem_spec#1634

martindurant · 2024-06-24T15:20:25Z

I'll merge this now and we can open any further issues that might arise.

rsignell · 2024-06-24T15:59:24Z

Wooho! It's working!

martindurant · 2024-06-24T16:58:28Z

We'll need to decide what the best size is in the end. Having smaller chunks means more requests and a bigger reference set on disk. Since it's pretty easy to generate, we could make multiple variants and profile them.

martindurant added 2 commits June 21, 2024 16:42

Explicitly calculate dtype element size in netCDF3 records

71a07a4

Fix one

2f60b92

rsignell mentioned this pull request Jun 23, 2024

How to use subchunking #467

Closed

martindurant added 2 commits June 23, 2024 20:58

subchunk2

db5211d

Maybe fix inlining

719e475

martindurant mentioned this pull request Jun 24, 2024

inline_threshold not encoding time value? #468

Open

martindurant added 2 commits June 23, 2024 22:24

Small correction to inline

fda3a1c

fix remote inline

2609d94

martindurant closed this Jun 24, 2024

martindurant reopened this Jun 24, 2024

numpy 2 fix

f5f073a

martindurant merged commit ae692fe into fsspec:main Jun 24, 2024
5 checks passed

martindurant deleted the complex_cdf3_dt branch June 24, 2024 15:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitly calculate dtype element size in netCDF3 records #466

Explicitly calculate dtype element size in netCDF3 records #466

martindurant commented Jun 21, 2024

rsignell commented Jun 22, 2024

martindurant commented Jun 22, 2024

martindurant commented Jun 24, 2024

martindurant commented Jun 24, 2024

rsignell commented Jun 24, 2024

martindurant commented Jun 24, 2024

Explicitly calculate dtype element size in netCDF3 records #466

Explicitly calculate dtype element size in netCDF3 records #466

Conversation

martindurant commented Jun 21, 2024

rsignell commented Jun 22, 2024

martindurant commented Jun 22, 2024

martindurant commented Jun 24, 2024

martindurant commented Jun 24, 2024

rsignell commented Jun 24, 2024

martindurant commented Jun 24, 2024