Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use subchunking #467

Closed
rsignell opened this issue Jun 23, 2024 · 10 comments
Closed

How to use subchunking #467

rsignell opened this issue Jun 23, 2024 · 10 comments

Comments

@rsignell
Copy link

rsignell commented Jun 23, 2024

Using the fixes in #466, I'm able to overcome the NetCDF 64-bit offset issue in #465.

With this data, I'd like to use kerchunk.utils.subchunk since the data chunk size on variables like 'salinity' is about 1GB:
image

The syntax for kerchunk.utils.subchunk is subchunk(store, variable, factor) where factor is:

factor: int
        the number of chunks each input chunk turns into. Must be an exact divisor
        of the original largest dimension length.

In this dataset the biggest dimension is lon=4500, which is also the fastest varying dimension.

But when I tried factor=10, it didn't work. What am I doing wrong? Notebook here
Screenshot 2024-06-23 064213

@martindurant
Copy link
Member

The problem, as I hinted in a previous comment, is the leading 1, dimension for time - which clearly doesn't divide to anything. I'll need to tweak the code to delve to deeper dimention(s) than the first.

@martindurant
Copy link
Member

I should have explained:

the original largest dimension length

this means in terms of memory layout (slowest varying), not length. So, time is the biggest, with length 1, followed by depth.

@martindurant
Copy link
Member

I should also note, that if only zarr allowed intra-chunk ranges for uncompressed data, none of this would be necessary.

@rsignell
Copy link
Author

rsignell commented Jun 23, 2024

So does that mean we can't subchunk this data because the length of time is 1? (e.g. only factor=1 is possible, which of course wouldn't do anything)

@martindurant
Copy link
Member

we can't subchunk this data because the length of time is 1

Correct, for the current implementation. Of course, (1, 40, ...) -> (1, 4, ...) should of course be possible, if I get around to making the required changes.

@rsignell
Copy link
Author

That would be great, as I think this would then actually allow this dataset to be accessed in a reasonable way!

@martindurant
Copy link
Member

I have put a potential fix into #466 (which still has a non-related outstanding test failure)

@rsignell
Copy link
Author

I gave this potential fix a try, but now NetCDF3ToZarr seems broken?

file0 = 'hycom-gofs-3pt1-reanalysis/2014/hycom_GLBv0.08_538_2014010112_t000.nc'
d = NetCDF3ToZarr("s3://" + file0, storage_options={"anon": True},
                  inline_threshold=400, version=2).translate()

produces the error

---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:113](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=112), in _error_wrapper(func, args, kwargs, retries)
    112 try:
--> 113     return await func(*args, **kwargs)
    114 except S3_RETRYABLE_ERRORS as e:

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/aiobotocore/client.py:411](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/aiobotocore/client.py#line=410), in AioBaseClient._make_api_call(self, operation_name, api_params)
    410     error_class = self.exceptions.from_code(error_code)
--> 411     raise error_class(parsed_response, operation_name)
    412 else:

ClientError: An error occurred (InvalidAccessKeyId) when calling the GetObject operation: The AWS Access Key Id you provided does not exist in our records.

The above exception was the direct cause of the following exception:

PermissionError                           Traceback (most recent call last)
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/fsspec/asyn.py:245](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/fsspec/asyn.py#line=244), in _run_coros_in_chunks.<locals>._run_coro(coro, i)
    244 try:
--> 245     return await asyncio.wait_for(coro, timeout=timeout), i
    246 except Exception as e:

File [~/miniforge3/envs/pangeo/lib/python3.11/asyncio/tasks.py:442](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/asyncio/tasks.py#line=441), in wait_for(fut, timeout)
    441 if timeout is None:
--> 442     return await fut
    444 if timeout <= 0:

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:1128](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=1127), in S3FileSystem._cat_file(self, path, version_id, start, end)
   1126         resp["Body"].close()
-> 1128 return await _error_wrapper(_call_and_read, retries=self.retries)

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:145](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=144), in _error_wrapper(func, args, kwargs, retries)
    144 err = translate_boto_error(err)
--> 145 raise err

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:113](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=112), in _error_wrapper(func, args, kwargs, retries)
    112 try:
--> 113     return await func(*args, **kwargs)
    114 except S3_RETRYABLE_ERRORS as e:

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:1115](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=1114), in S3FileSystem._cat_file.<locals>._call_and_read()
   1114 async def _call_and_read():
-> 1115     resp = await self._call_s3(
   1116         "get_object",
   1117         Bucket=bucket,
   1118         Key=key,
   1119         **version_id_kw(version_id or vers),
   1120         **head,
   1121         **self.req_kw,
   1122     )
   1123     try:

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:365](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=364), in S3FileSystem._call_s3(self, method, *akwarglist, **kwargs)
    364 additional_kwargs = self._get_s3_method_kwargs(method, *akwarglist, **kwargs)
--> 365 return await _error_wrapper(
    366     method, kwargs=additional_kwargs, retries=self.retries
    367 )

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:145](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=144), in _error_wrapper(func, args, kwargs, retries)
    144 err = translate_boto_error(err)
--> 145 raise err

PermissionError: The AWS Access Key Id you provided does not exist in our records.

The above exception was the direct cause of the following exception:

ReferenceNotReachable                     Traceback (most recent call last)
Cell In[7], line 2
      1 d = NetCDF3ToZarr("s3://" + flist[0], storage_options={"anon": True},
----> 2                   inline_threshold=400, version=2).translate()

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/netCDF3.py:288](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/netCDF3.py#line=287), in NetCDF3ToZarr.translate(self)
    280 z.attrs.update(
    281     {
    282         k: v.decode() if isinstance(v, bytes) else str(v)
   (...)
    285     }
    286 )
    287 if self.threshold:
--> 288     out = inline_array(out, self.threshold, remote_options=self.storage_options)
    290 if isinstance(out, LazyReferenceMapper):
    291     out.flush()

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/utils.py:230](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/utils.py#line=229), in inline_array(store, threshold, names, remote_options)
    226 fs = fsspec.filesystem(
    227     "reference", fo=store, **(remote_options or {}), skip_instance_cache=True
    228 )
    229 g = zarr.open_group(fs.get_mapper(), mode="r+")
--> 230 _inline_array(g, threshold, names=names or [])
    231 return fs.references

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/utils.py:193](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/utils.py#line=192), in _inline_array(group, threshold, names, prefix)
    187 if cond1 or cond2:
    188     original_attrs = dict(thing.attrs)
    189     arr = group.create_dataset(
    190         name=name,
    191         dtype=thing.dtype,
    192         shape=thing.shape,
--> 193         data=thing[:],
    194         chunks=thing.shape,
    195         compression=None,
    196         overwrite=True,
    197         fill_value=thing.fill_value,
    198     )
    199     arr.attrs.update(original_attrs)

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:800](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=799), in Array.__getitem__(self, selection)
    798     result = self.get_orthogonal_selection(pure_selection, fields=fields)
    799 else:
--> 800     result = self.get_basic_selection(pure_selection, fields=fields)
    801 return result

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:926](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=925), in Array.get_basic_selection(self, selection, out, fields)
    924     return self._get_basic_selection_zd(selection=selection, out=out, fields=fields)
    925 else:
--> 926     return self._get_basic_selection_nd(selection=selection, out=out, fields=fields)

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:968](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=967), in Array._get_basic_selection_nd(self, selection, out, fields)
    962 def _get_basic_selection_nd(self, selection, out=None, fields=None):
    963     # implementation of basic selection for array with at least one dimension
    964 
    965     # setup indexer
    966     indexer = BasicIndexer(selection, self)
--> 968     return self._get_selection(indexer=indexer, out=out, fields=fields)

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:1343](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=1342), in Array._get_selection(self, indexer, out, fields)
   1340 if math.prod(out_shape) > 0:
   1341     # allow storage to get multiple items at once
   1342     lchunk_coords, lchunk_selection, lout_selection = zip(*indexer)
-> 1343     self._chunk_getitems(
   1344         lchunk_coords,
   1345         lchunk_selection,
   1346         out,
   1347         lout_selection,
   1348         drop_axes=indexer.drop_axes,
   1349         fields=fields,
   1350     )
   1351 if out.shape:
   1352     return out

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:2179](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=2178), in Array._chunk_getitems(self, lchunk_coords, lchunk_selection, out, lout_selection, drop_axes, fields)
   2177     if not isinstance(self._meta_array, np.ndarray):
   2178         contexts = ConstantMap(ckeys, constant=Context(meta_array=self._meta_array))
-> 2179     cdatas = self.chunk_store.getitems(ckeys, contexts=contexts)
   2181 for ckey, chunk_select, out_select in zip(ckeys, lchunk_selection, lout_selection):
   2182     if ckey in cdatas:

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/storage.py:1435](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/storage.py#line=1434), in FSStore.getitems(self, keys, contexts)
   1432     continue
   1433 elif isinstance(v, Exception):
   1434     # Raise any other exception
-> 1435     raise v
   1436 else:
   1437     # The function calling this method may not recognize the transformed
   1438     # keys, so we send the values returned by self.map.getitems back into
   1439     # the original key space.
   1440     results[keys_transformed[k]] = v

ReferenceNotReachable: Reference "depth/0" failed to fetch target ['s3://hycom-gofs-3pt1-reanalysis/2014/hycom_GLBv0.08_538_2014010112_t000.nc', 4640, 320]

@martindurant
Copy link
Member

Sorry, messed up passing remote options for inlining.

@rsignell
Copy link
Author

Fixed in #466

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants