-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use subchunking #467
Comments
The problem, as I hinted in a previous comment, is the leading 1, dimension for time - which clearly doesn't divide to anything. I'll need to tweak the code to delve to deeper dimention(s) than the first. |
I should have explained:
this means in terms of memory layout (slowest varying), not length. So, time is the biggest, with length 1, followed by depth. |
I should also note, that if only zarr allowed intra-chunk ranges for uncompressed data, none of this would be necessary. |
So does that mean we can't subchunk this data because the length of |
Correct, for the current implementation. Of course, (1, 40, ...) -> (1, 4, ...) should of course be possible, if I get around to making the required changes. |
That would be great, as I think this would then actually allow this dataset to be accessed in a reasonable way! |
I have put a potential fix into #466 (which still has a non-related outstanding test failure) |
I gave this potential fix a try, but now NetCDF3ToZarr seems broken? file0 = 'hycom-gofs-3pt1-reanalysis/2014/hycom_GLBv0.08_538_2014010112_t000.nc'
d = NetCDF3ToZarr("s3://" + file0, storage_options={"anon": True},
inline_threshold=400, version=2).translate() produces the error ---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:113](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=112), in _error_wrapper(func, args, kwargs, retries)
112 try:
--> 113 return await func(*args, **kwargs)
114 except S3_RETRYABLE_ERRORS as e:
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/aiobotocore/client.py:411](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/aiobotocore/client.py#line=410), in AioBaseClient._make_api_call(self, operation_name, api_params)
410 error_class = self.exceptions.from_code(error_code)
--> 411 raise error_class(parsed_response, operation_name)
412 else:
ClientError: An error occurred (InvalidAccessKeyId) when calling the GetObject operation: The AWS Access Key Id you provided does not exist in our records.
The above exception was the direct cause of the following exception:
PermissionError Traceback (most recent call last)
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/fsspec/asyn.py:245](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/fsspec/asyn.py#line=244), in _run_coros_in_chunks.<locals>._run_coro(coro, i)
244 try:
--> 245 return await asyncio.wait_for(coro, timeout=timeout), i
246 except Exception as e:
File [~/miniforge3/envs/pangeo/lib/python3.11/asyncio/tasks.py:442](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/asyncio/tasks.py#line=441), in wait_for(fut, timeout)
441 if timeout is None:
--> 442 return await fut
444 if timeout <= 0:
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:1128](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=1127), in S3FileSystem._cat_file(self, path, version_id, start, end)
1126 resp["Body"].close()
-> 1128 return await _error_wrapper(_call_and_read, retries=self.retries)
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:145](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=144), in _error_wrapper(func, args, kwargs, retries)
144 err = translate_boto_error(err)
--> 145 raise err
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:113](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=112), in _error_wrapper(func, args, kwargs, retries)
112 try:
--> 113 return await func(*args, **kwargs)
114 except S3_RETRYABLE_ERRORS as e:
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:1115](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=1114), in S3FileSystem._cat_file.<locals>._call_and_read()
1114 async def _call_and_read():
-> 1115 resp = await self._call_s3(
1116 "get_object",
1117 Bucket=bucket,
1118 Key=key,
1119 **version_id_kw(version_id or vers),
1120 **head,
1121 **self.req_kw,
1122 )
1123 try:
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:365](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=364), in S3FileSystem._call_s3(self, method, *akwarglist, **kwargs)
364 additional_kwargs = self._get_s3_method_kwargs(method, *akwarglist, **kwargs)
--> 365 return await _error_wrapper(
366 method, kwargs=additional_kwargs, retries=self.retries
367 )
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:145](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=144), in _error_wrapper(func, args, kwargs, retries)
144 err = translate_boto_error(err)
--> 145 raise err
PermissionError: The AWS Access Key Id you provided does not exist in our records.
The above exception was the direct cause of the following exception:
ReferenceNotReachable Traceback (most recent call last)
Cell In[7], line 2
1 d = NetCDF3ToZarr("s3://" + flist[0], storage_options={"anon": True},
----> 2 inline_threshold=400, version=2).translate()
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/netCDF3.py:288](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/netCDF3.py#line=287), in NetCDF3ToZarr.translate(self)
280 z.attrs.update(
281 {
282 k: v.decode() if isinstance(v, bytes) else str(v)
(...)
285 }
286 )
287 if self.threshold:
--> 288 out = inline_array(out, self.threshold, remote_options=self.storage_options)
290 if isinstance(out, LazyReferenceMapper):
291 out.flush()
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/utils.py:230](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/utils.py#line=229), in inline_array(store, threshold, names, remote_options)
226 fs = fsspec.filesystem(
227 "reference", fo=store, **(remote_options or {}), skip_instance_cache=True
228 )
229 g = zarr.open_group(fs.get_mapper(), mode="r+")
--> 230 _inline_array(g, threshold, names=names or [])
231 return fs.references
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/utils.py:193](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/utils.py#line=192), in _inline_array(group, threshold, names, prefix)
187 if cond1 or cond2:
188 original_attrs = dict(thing.attrs)
189 arr = group.create_dataset(
190 name=name,
191 dtype=thing.dtype,
192 shape=thing.shape,
--> 193 data=thing[:],
194 chunks=thing.shape,
195 compression=None,
196 overwrite=True,
197 fill_value=thing.fill_value,
198 )
199 arr.attrs.update(original_attrs)
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:800](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=799), in Array.__getitem__(self, selection)
798 result = self.get_orthogonal_selection(pure_selection, fields=fields)
799 else:
--> 800 result = self.get_basic_selection(pure_selection, fields=fields)
801 return result
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:926](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=925), in Array.get_basic_selection(self, selection, out, fields)
924 return self._get_basic_selection_zd(selection=selection, out=out, fields=fields)
925 else:
--> 926 return self._get_basic_selection_nd(selection=selection, out=out, fields=fields)
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:968](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=967), in Array._get_basic_selection_nd(self, selection, out, fields)
962 def _get_basic_selection_nd(self, selection, out=None, fields=None):
963 # implementation of basic selection for array with at least one dimension
964
965 # setup indexer
966 indexer = BasicIndexer(selection, self)
--> 968 return self._get_selection(indexer=indexer, out=out, fields=fields)
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:1343](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=1342), in Array._get_selection(self, indexer, out, fields)
1340 if math.prod(out_shape) > 0:
1341 # allow storage to get multiple items at once
1342 lchunk_coords, lchunk_selection, lout_selection = zip(*indexer)
-> 1343 self._chunk_getitems(
1344 lchunk_coords,
1345 lchunk_selection,
1346 out,
1347 lout_selection,
1348 drop_axes=indexer.drop_axes,
1349 fields=fields,
1350 )
1351 if out.shape:
1352 return out
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:2179](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=2178), in Array._chunk_getitems(self, lchunk_coords, lchunk_selection, out, lout_selection, drop_axes, fields)
2177 if not isinstance(self._meta_array, np.ndarray):
2178 contexts = ConstantMap(ckeys, constant=Context(meta_array=self._meta_array))
-> 2179 cdatas = self.chunk_store.getitems(ckeys, contexts=contexts)
2181 for ckey, chunk_select, out_select in zip(ckeys, lchunk_selection, lout_selection):
2182 if ckey in cdatas:
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/storage.py:1435](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/storage.py#line=1434), in FSStore.getitems(self, keys, contexts)
1432 continue
1433 elif isinstance(v, Exception):
1434 # Raise any other exception
-> 1435 raise v
1436 else:
1437 # The function calling this method may not recognize the transformed
1438 # keys, so we send the values returned by self.map.getitems back into
1439 # the original key space.
1440 results[keys_transformed[k]] = v
ReferenceNotReachable: Reference "depth/0" failed to fetch target ['s3://hycom-gofs-3pt1-reanalysis/2014/hycom_GLBv0.08_538_2014010112_t000.nc', 4640, 320] |
Sorry, messed up passing remote options for inlining. |
Fixed in #466 |
Using the fixes in #466, I'm able to overcome the NetCDF 64-bit offset issue in #465.
With this data, I'd like to use
kerchunk.utils.subchunk
since the data chunk size on variables like 'salinity' is about 1GB:The syntax for
kerchunk.utils.subchunk
issubchunk(store, variable, factor)
where factor is:In this dataset the biggest dimension is
lon=4500
, which is also the fastest varying dimension.But when I tried
factor=10
, it didn't work. What am I doing wrong? Notebook hereThe text was updated successfully, but these errors were encountered: