Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_multiscale() writes to local path on Linux instead of cloud bucket when using dask>=2023.1.0 #282

Open
carshadi opened this issue Jun 5, 2023 · 4 comments

Comments

@carshadi
Copy link

carshadi commented Jun 5, 2023

Please see this issue dask/dask#10324 , which is the source of this problem

Here is an MRE:

import numpy as np
import dask.array as da
import zarr
from ome_zarr.writer import write_multiscale

store = zarr.storage.FSStore(url="s3://my-bucket/zarr-test.zarr", dimension_separator='/')
root_group = zarr.group(store=store, overwrite=True)
group = root_group.create_group(name="tile0")

d = da.ones(shape=(1, 1, 1000, 1000, 1000), dtype=np.uint16, chunks=(1, 1, 100, 100, 100))
pyramid = [d]

write_multiscale(pyramid, group, chunks=(1, 1, 100, 100, 100), compute=True)

The issue is that when an FSStore is passed to dask.array.to_zarr(url=group.store) and storage_options is non-empty, this code path is used https://github.com/dask/dask/blob/596d74dbc72db663efac606d42d5c93c5a917fb9/dask/array/core.py#L3710

passing an FSStore instance to the FSStore constructor on Linux results in the store being created in the current working directory, e.g., ./my-bucket/zarr-test.zarr instead of the correct S3 path (this also true for GCS paths).

This does not occur on Windows 10.

Reverting to dask 2022.12.1 fixes the issue

@will-moore
Copy link
Member

From dask/dask#10324 (comment) my understanding is that if you've already created the store (as we have in ome-zarr-py) then you shouldn't need to specify any other storage options (because they should all be incorporated into the store itself, or in the case of chunks, the chunking of the dask array itself will be used?

The addition of storage options to write_multiscale and write_image was in #161, before support of dask was added.
I don't know what the use-case is for including these options in da.to_zarr() - what options might be useful/needed at that point. The chunks are already used to rechunk the data, so they may not be necessary to include in options.

@carshadi
Copy link
Author

carshadi commented Jun 6, 2023

I agree that specifying chunks in storage_options is not necessary, since that option is ignored by the FSStore constructor and the zarr chunks are based on the dask array chunks.

@will-moore
Copy link
Member

If the intended / future behaviour of dask at this point (see dask/dask#10324 (comment)) is to "raise an informative error when url is an FSStore instance and storage_options are given" then we need to not pass any storage options.
That means that we already need to use the storage_options when we create the store.

The call stack for that is:

parse_url()
ZarrLocation.init
format.init_store()
store = FSStore()

So you'd need to pass storage_options to the creation of the store, either if you do it yourself (as in the example above) or in parse_url().

This would mean possibly removing the storage_options from write_multiscale() and write_image() as added in #161.

This would mean NOT going ahead with the deprecation of 'chunks' at #167

Presumably we should also remove the line at

options["chunks"] = chunks_opt
- which is the source of the error from the code above since that option is redundant after rechunking and means that options is not empty.

https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.FSStore

cc @sbesson

@toloudis
Copy link
Contributor

Interestingly, I just reproduced the same issue both on Windows and on Linux with dask 2023.6.0 when trying to convert a microscopy time series to zarr (TCZYX =~ (1000, 1, 150, 2000, 1000)). Just like @carshadi said, Reverting to dask 2022.12.1 gets me up and running again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants