Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): read_elem_as_dask method #1469

Merged
merged 160 commits into from
Jul 23, 2024
Merged
Show file tree
Hide file tree
Changes from 152 commits
Commits
Show all changes
160 commits
Select commit Hold shift + click to select a range
d111f04
(feat): `read_elem_lazy` method
ilan-gold Apr 11, 2024
00be7f0
(revert): error message
ilan-gold Apr 11, 2024
fd635d7
(refactor): declare `is_csc` reading elem directly in h5
ilan-gold Apr 11, 2024
f5e7fda
(chore): `read_elem_lazy` -> `read_elem_as_dask`
ilan-gold Apr 12, 2024
ae5396c
(chore): remove string handling
ilan-gold Apr 12, 2024
664336a
(refactor): use `elem` for h5 where posssble
ilan-gold Apr 12, 2024
2370215
Merge branch 'main' into ig/read_dask_elem
ilan-gold Apr 17, 2024
52002b6
(chore): remove invlaud syntax
ilan-gold Apr 17, 2024
5ab1ad1
Merge branch 'ig/read_dask_elem' of github.com:scverse/anndata into i…
ilan-gold Apr 17, 2024
aa1006e
(fix): put dask import inside function
ilan-gold Apr 17, 2024
dda7d83
(refactor): try maybe open?
ilan-gold Apr 17, 2024
fd418f0
Merge branch 'main' into ig/read_dask_elem
ilan-gold May 27, 2024
23b0bfd
Merge branch 'main' into ig/read_dask_elem
ilan-gold May 27, 2024
97b8031
Merge branch 'main' into ig/read_dask_elem
ilan-gold Jun 3, 2024
1fc4cc3
(fix): revert `encoding-version`
ilan-gold Jun 3, 2024
5ca71ea
(chore): document `create_sparse_store` test function
ilan-gold Jun 3, 2024
3672c18
(chore): sort indices to prevent warning
ilan-gold Jun 3, 2024
33c3599
(fix): remove utility function `make_dask_array`
ilan-gold Jun 3, 2024
157e710
(chore): `read_sparse_as_dask_h5` -> `read_sparse_as_dask`
ilan-gold Jun 3, 2024
375000d
(feat): make params of `h5_chunks` and `stride`
ilan-gold Jun 3, 2024
241904a
(chore): add distributed test
ilan-gold Jun 3, 2024
42d0d22
(fix): `TypeVar` bind
ilan-gold Jun 3, 2024
0bba2c0
(chore): release note
ilan-gold Jun 4, 2024
0d0b43a
(chore): `0.10.8` -> `0.11.0`
ilan-gold Jun 5, 2024
762d4c6
Merge branch 'main' into ig/read_dask_elem
ilan-gold Jun 26, 2024
c935fe0
(fix): `ruff` for default `pytest.fixture` `scope`
ilan-gold Jun 26, 2024
23e0ea2
Apply suggestions from code review
ilan-gold Jul 1, 2024
5b96c77
(fix): `Any` to `DaskArray`
ilan-gold Jul 1, 2024
0907a4e
(fix): type `make_index` + fix undeclared
ilan-gold Jul 1, 2024
20ced16
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 1, 2024
36ae8f2
Merge branch 'main' into ig/read_dask_elem
ilan-gold Jul 1, 2024
bb6607e
fix rest
flying-sheep Jul 1, 2024
419691b
(fix): use `chunks` kwarg
ilan-gold Jul 2, 2024
a23df34
Merge branch 'main' into ig/read_dask_elem
ilan-gold Jul 2, 2024
fd2376a
(feat): expose `chunks` as an option to `read_elem_as_dask` via `data…
ilan-gold Jul 2, 2024
ae723d0
Merge branch 'ig/read_dask_elem' of github.com:scverse/anndata into i…
ilan-gold Jul 2, 2024
42b1093
(fix): `test_read_dispatched_null_case` test
ilan-gold Jul 2, 2024
78de057
(fix): disallowed spread syntax?
ilan-gold Jul 2, 2024
717b997
(refactor): reuse `compute_chunk_layout_for_axis_shape` functionality
ilan-gold Jul 2, 2024
2b86293
(fix): remove unneeded `slice` arguments
ilan-gold Jul 3, 2024
8d5a9df
(fix): revert message
ilan-gold Jul 3, 2024
449fc1a
(refactor): `make_index` -> `make_block_indexer`
ilan-gold Jul 3, 2024
1522de3
(fix): export from `experimental`
ilan-gold Jul 3, 2024
71c150d
(fix): `callback` signature for `test_read_dispatched_null_case
ilan-gold Jul 3, 2024
b441366
(chore): `get_elem_name` helper
ilan-gold Jul 3, 2024
0307a1d
(chore): use `H5Group` consistently
ilan-gold Jul 3, 2024
ee075cd
(refactor): make `chunks` public facing API instead of `dataset_kwargs`
ilan-gold Jul 3, 2024
89acec4
(fix): regsiter for group not array
ilan-gold Jul 3, 2024
48b7630
(chore): add warning test
ilan-gold Jul 3, 2024
8712582
(chore): make arg order consistent
ilan-gold Jul 3, 2024
cda8aa7
(feat): add `callback` typing for `read_dispatched`
ilan-gold Jul 5, 2024
e8f62f4
(chore): use `npt.NDArray`
ilan-gold Jul 5, 2024
f6e48ac
(fix): remove uneceesary union
ilan-gold Jul 5, 2024
4de3246
(chore): release note
ilan-gold Jul 5, 2024
ba817e0
(fix); try protocol docs
ilan-gold Jul 5, 2024
438d28d
(feat): create `InMemoryElem` + `DictElemType` to remove `Any`
ilan-gold Jul 5, 2024
296ea3f
(chore): refactor `DictElemType` -> `InMemoryArrayOrScalarType` for r…
ilan-gold Jul 5, 2024
cf13a57
(fix): use `Union`
ilan-gold Jul 5, 2024
d02ba49
(fix): more `Union`
ilan-gold Jul 5, 2024
6970a97
(refactor): `InMemoryElem` -> `InMemoryReadElem`
ilan-gold Jul 5, 2024
2282351
(chore): add needed types to public export + docs fix
ilan-gold Jul 5, 2024
810cd0a
Merge branch 'main' into ig/read_dask_elem
flying-sheep Jul 8, 2024
a996081
(chore): type `write_elem` functions
ilan-gold Jul 8, 2024
f6e457b
(chore): create `write_callback` protocol
ilan-gold Jul 8, 2024
a0b4057
Merge branch 'main' into ig/protocol_for_callback
ilan-gold Jul 8, 2024
4416526
(chore): export + docs
ilan-gold Jul 8, 2024
fbe44f0
(fix): add string descriptions
ilan-gold Jul 8, 2024
8c1f01d
(fix): try sphinx protocol doc
ilan-gold Jul 8, 2024
a7d412a
(fix): try ignoring exports
ilan-gold Jul 8, 2024
4d56396
(fix): remap callback internal usages
ilan-gold Jul 8, 2024
2012ee5
(fix): add docstring
ilan-gold Jul 8, 2024
f65f065
Discard changes to pyproject.toml
flying-sheep Jul 9, 2024
8f6ea49
re-add dep
flying-sheep Jul 9, 2024
155a21e
Fix docs
flying-sheep Jul 9, 2024
daae3e5
Almost works
flying-sheep Jul 9, 2024
c415ae4
works!
flying-sheep Jul 9, 2024
00010b8
(chore): use pascal-case
ilan-gold Jul 9, 2024
0bd87fc
(feat): type read/write funcs in callback
ilan-gold Jul 9, 2024
5997678
(fix): use generic for `Read` as well.
ilan-gold Jul 9, 2024
f208332
(fix): need more aliases
ilan-gold Jul 9, 2024
eb69fcb
Split table, format
flying-sheep Jul 9, 2024
477bbef
(refactor): move to `_types` file
ilan-gold Jul 9, 2024
103cad6
Merge branch 'ig/protocol_for_callback' of github.com:scverse/anndata…
ilan-gold Jul 9, 2024
8d23f6f
bump scanpydoc
flying-sheep Jul 9, 2024
9b647c2
Some basic syntax fixes
flying-sheep Jul 9, 2024
d6d01bc
Merge branch 'ig/protocol_for_callback' into ig/read_dask_elem
ilan-gold Jul 9, 2024
5ef93e1
(fix): change `Read{Callback}` type for kwargs
ilan-gold Jul 9, 2024
9cfe908
(chore): test `chunks `argument
ilan-gold Jul 9, 2024
99fc6db
(fix): type `read_recarray`
ilan-gold Jul 9, 2024
b5bccc3
(fix): `GroupyStorageType` not `StorageType`
ilan-gold Jul 9, 2024
e5ea2b0
(fix): little type fixes
ilan-gold Jul 9, 2024
6ac72d6
(fix): clarify `H5File` typing
ilan-gold Jul 9, 2024
989dc65
(fix): dask doc
ilan-gold Jul 9, 2024
36b0207
(fix): dask docs
ilan-gold Jul 9, 2024
dadfb4d
Merge branch 'ig/protocol_for_callback' into ig/read_dask_elem
ilan-gold Jul 9, 2024
ca6cf66
(fix): typing
ilan-gold Jul 9, 2024
eabaf35
(fix): handle case when `chunks` is `None`
ilan-gold Jul 9, 2024
4c398c3
(feat): add string-array reading
ilan-gold Jul 9, 2024
d6fc8a4
(fix): remove `string-array` because it is not tested
ilan-gold Jul 9, 2024
33aebb2
(refactor): clean up tests
ilan-gold Jul 10, 2024
701cd85
(fix): overfetching problem
ilan-gold Jul 10, 2024
43b21a2
Fix circular import
flying-sheep Jul 11, 2024
0e22449
add some typing
flying-sheep Jul 11, 2024
ec546f4
fix mapping types
flying-sheep Jul 11, 2024
7c2e4da
Fix Read/Write
flying-sheep Jul 11, 2024
1ba5b99
Fix one more
flying-sheep Jul 11, 2024
49c0d49
unify names
flying-sheep Jul 11, 2024
3666735
claift ReadCallback signature
flying-sheep Jul 11, 2024
3a332ad
Fix type aliases
flying-sheep Jul 11, 2024
d0f4d13
(fix): clean up typing to use `RWAble`
ilan-gold Jul 11, 2024
6e89e14
Merge branch 'main' into ig/protocol_for_callback
ilan-gold Jul 11, 2024
ea29cfa
(fix): use `Union`
ilan-gold Jul 11, 2024
f4ff236
(fix): add qualname override
ilan-gold Jul 11, 2024
f50b286
(fix): ignore dask and masked array
ilan-gold Jul 11, 2024
712e085
(fix): ignore erroneous class warning
ilan-gold Jul 11, 2024
24dd18b
(fix): upgrade `scanpydoc`
ilan-gold Jul 11, 2024
79d3fdc
(fix): use `MutableMapping` instead of `dict` due to broken docstring
ilan-gold Jul 11, 2024
9a2be00
Merge branch 'ig/protocol_for_callback' into ig/read_dask_elem
ilan-gold Jul 11, 2024
d3bcddf
Add data docs
flying-sheep Jul 11, 2024
84fdc96
Revert "(fix): use `MutableMapping` instead of `dict` due to broken d…
flying-sheep Jul 11, 2024
2608bc3
(fix): add clarification
ilan-gold Jul 11, 2024
e551e18
Simplify
flying-sheep Jul 11, 2024
13e3bb1
Merge branch 'ig/protocol_for_callback' into ig/read_dask_elem
ilan-gold Jul 11, 2024
2935e45
Merge branch 'main' into ig/read_dask_elem
ilan-gold Jul 11, 2024
bf0be15
Merge branch 'ig/read_dask_elem' of github.com:scverse/anndata into i…
ilan-gold Jul 11, 2024
9d37fc8
Merge branch 'main' into ig/read_dask_elem
ilan-gold Jul 12, 2024
1ffe43e
(fix): remove double `dask` intersphinx
ilan-gold Jul 12, 2024
f9df5bc
(fix): remove `_types.DaskArray` from type checking block
ilan-gold Jul 12, 2024
a85da39
(refactor): use `block_info` for resolving fetch location
ilan-gold Jul 15, 2024
3bef77c
Merge branch 'ig/read_dask_elem' of github.com:scverse/anndata into i…
ilan-gold Jul 15, 2024
899184f
(fix): dtype for reading
ilan-gold Jul 15, 2024
efb70ec
(fix): ignore import cycle problem (why??)
ilan-gold Jul 16, 2024
118f43c
(fix): add issue
ilan-gold Jul 16, 2024
f742a0a
(fix): subclass `Reader` to remove `datasetkwargs`
ilan-gold Jul 18, 2024
ae68731
(fix): add message tp errpr
ilan-gold Jul 18, 2024
f5e7760
Update tests/test_io_elementwise.py
ilan-gold Jul 18, 2024
96b13a3
(fix): correct `self.callback` check
ilan-gold Jul 18, 2024
9c68e36
(fix): erroneous diffs
ilan-gold Jul 18, 2024
410aeda
(fix): extra `read_elem` `dataset_kwargs`
ilan-gold Jul 18, 2024
31a30c4
(fix): remove more `dataset_kwargs` nonsense
ilan-gold Jul 18, 2024
80fe8cb
(chore): add docs
ilan-gold Jul 18, 2024
b314248
(fix): use `block_info` for dense
ilan-gold Jul 18, 2024
02d4735
(fix): more erroneous diffs
ilan-gold Jul 18, 2024
6e5534a
(fix): use context again
ilan-gold Jul 18, 2024
d26cfe8
(fix): change size by dimension in tests
ilan-gold Jul 22, 2024
94e43a3
(refactor): clean up `get_elem_name`
ilan-gold Jul 22, 2024
5160016
(fix): try new sphinx for error
ilan-gold Jul 22, 2024
43da9a3
(fix): return type
ilan-gold Jul 22, 2024
9735ced
(fix): protocol for reading
ilan-gold Jul 22, 2024
f1730c3
(fix): bring back ignored warning
ilan-gold Jul 22, 2024
9861b56
Fix docs
flying-sheep Jul 22, 2024
235096a
almost fix typing
flying-sheep Jul 22, 2024
dce9f07
add wrapper
flying-sheep Jul 22, 2024
2725ef2
move into type checking
flying-sheep Jul 22, 2024
ffe89f0
(fix): small type fxes
ilan-gold Jul 22, 2024
6cb231e
Merge branch 'main' into ig/read_dask_elem
ilan-gold Jul 22, 2024
75a64fc
block info types
flying-sheep Jul 22, 2024
3f734fe
simplify
flying-sheep Jul 22, 2024
c4c2356
rename
flying-sheep Jul 22, 2024
cc67a9b
simplify more
flying-sheep Jul 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/_templates/autosummary/class.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
.. autosummary::
:toctree: .
{% for item in attributes %}
~{{ fullname }}.{{ item }}
~{{ name }}.{{ item }}
{%- endfor %}
{% endif %}
{% endblock %}
Expand All @@ -26,7 +26,7 @@
:toctree: .
{% for item in methods %}
{%- if item != '__init__' %}
~{{ fullname }}.{{ item }}
~{{ name }}.{{ item }}
{%- endif -%}
{%- endfor %}
{% endif %}
Expand Down
1 change: 1 addition & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ Low level methods for reading and writing elements of an `AnnData` object to a s

experimental.read_elem
experimental.write_elem
experimental.read_elem_as_dask
```

Utilities for customizing the IO process:
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
# default settings
templates_path = ["_templates"]
html_static_path = ["_static"]
source_suffix = [".rst", ".md"]
source_suffix = {".rst": "restructuredtext", ".md": "markdown"}
master_doc = "index"
default_role = "literal"
exclude_patterns = [
Expand Down
1 change: 1 addition & 0 deletions docs/release-notes/0.11.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
* Add `should_remove_unused_categories` option to `anndata.settings` to override current behavior. Default is `True` (i.e., previous behavior). Please refer to the [documentation](https://anndata.readthedocs.io/en/latest/generated/anndata.settings.html) for usage. {pr}`1340` {user}`ilan-gold`
* `scipy.sparse.csr_array` and `scipy.sparse.csc_array` are now supported when constructing `AnnData` objects {pr}`1028` {user}`ilan-gold` {user}`isaac-virshup`
* Add `should_check_uniqueness` option to `anndata.settings` to override current behavior. Default is `True` (i.e., previous behavior). Please refer to the [documentation](https://anndata.readthedocs.io/en/latest/generated/anndata.settings.html) for usage. {pr}`1507` {user}`ilan-gold`
* Add :func:`~anndata.experimental.read_elem_as_dask` function to handle i/o with sparse and dense arrays {pr}`1469` {user}`ilan-gold`
* Add functionality to write from GPU {class}`dask.array.Array` to disk {pr}`1550` {user}`ilan-gold`

#### Bugfix
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ dev = [
"pytest-xdist",
]
doc = [
"sphinx>=4.4",
"sphinx>=7.4.6",
"sphinx-book-theme>=1.1.0",
"sphinx-autodoc-typehints>=2.2.0",
"sphinx-issues",
Expand Down
17 changes: 16 additions & 1 deletion src/anndata/_core/file_backing.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

from collections.abc import Mapping
from functools import singledispatch
from pathlib import Path
from pathlib import Path, PurePosixPath
from typing import TYPE_CHECKING

import h5py
Expand Down Expand Up @@ -161,3 +161,18 @@ def _(x):
@filename.register(ZarrGroup)
def _(x):
return x.store.path


@singledispatch
def get_elem_name(x):
raise NotImplementedError(f"Not implemented for {type(x)}")


@get_elem_name.register(h5py.Group)
def _(x):
return x.name


@get_elem_name.register(ZarrGroup)
def _(x):
return PurePosixPath(x.path).name
6 changes: 5 additions & 1 deletion src/anndata/_io/specs/__init__.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,25 @@
from __future__ import annotations

from . import methods
from . import lazy_methods, methods
from .registry import (
_LAZY_REGISTRY, # noqa: F401
_REGISTRY, # noqa: F401
IOSpec,
Reader,
Writer,
get_spec,
read_elem,
read_elem_as_dask,
write_elem,
)

__all__ = [
"methods",
"lazy_methods",
"write_elem",
"get_spec",
"read_elem",
"read_elem_as_dask",
"Reader",
"Writer",
"IOSpec",
Expand Down
169 changes: 169 additions & 0 deletions src/anndata/_io/specs/lazy_methods.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
from __future__ import annotations

from contextlib import contextmanager
from pathlib import Path
from typing import TYPE_CHECKING

import h5py
import numpy as np
from scipy import sparse

import anndata as ad
from anndata._core.file_backing import filename, get_elem_name
from anndata.compat import H5Array, H5Group, ZarrArray, ZarrGroup

from .registry import _LAZY_REGISTRY, IOSpec

if TYPE_CHECKING:
from typing import Literal, Union

from anndata.compat import DaskArray

from .registry import DaskReader


@contextmanager
def maybe_open_h5(path_or_group: Path | ZarrGroup, elem_name: str):
if not isinstance(path_or_group, Path):
yield path_or_group
return
file = h5py.File(path_or_group, "r")
try:
yield file[elem_name]
finally:
file.close()


_DEFAULT_STRIDE = 1000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I just spent a long time trying to debug why a zarr store was being overfetched, and it turns out it wasn't. This number is just crazy high. So I'm not entirely sure what the best way forward is. This number is (probably) good for bulk tasks like filtering or PCA but absolutely useless for random access. For example, on a CSC matrix, you end up fetching 5% of the dataset if there are 20,000 genes. So we need to be VERY careful here. Not sure what the answer is, actually. I don't think exposing stride is great because it's specific to csr/csc and also a subset of the functionality of chunks, but the flipside is that resolving the size of a sparse matrix requires looking at attrs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing a per-matrix (i.e., layers) chunking is tough with a monolithic API #1247 but a monolithic API (i.e., read_backed) makes it easier to deal with telling people how to do things, which is currently very difficult. I'll need to think about that PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But with that PR, it doesn't really make sense to expose a method called read_elem_as_xarray because that's way too specific - only our dataframes work there. So I'm not sure what the answer is. Perhaps changing this to read_lazy or something is not a bad idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably a fine value for now. And agree on not exposing stride. Also don't know what the answer is.



def compute_chunk_layout_for_axis_shape(
chunk_axis_shape: int, full_axis_shape: int
) -> tuple[int, ...]:
n_strides, rest = np.divmod(full_axis_shape, chunk_axis_shape)
chunk = (chunk_axis_shape,) * n_strides
if rest > 0:
chunk += (rest,)
return chunk


@_LAZY_REGISTRY.register_read(H5Group, IOSpec("csc_matrix", "0.1.0"))
@_LAZY_REGISTRY.register_read(H5Group, IOSpec("csr_matrix", "0.1.0"))
@_LAZY_REGISTRY.register_read(ZarrGroup, IOSpec("csc_matrix", "0.1.0"))
@_LAZY_REGISTRY.register_read(ZarrGroup, IOSpec("csr_matrix", "0.1.0"))
def read_sparse_as_dask(
elem: H5Group | ZarrGroup,
*,
_reader: DaskReader,
chunks: tuple[int, ...] | None = None,
) -> DaskArray:
import dask.array as da

path_or_group = Path(filename(elem)) if isinstance(elem, H5Group) else elem
elem_name = get_elem_name(elem)
shape: tuple[int, int] = tuple(elem.attrs["shape"])
dtype = elem["data"].dtype
is_csc: bool = elem.attrs["encoding-type"] == "csc_matrix"

stride: int = _DEFAULT_STRIDE
if chunks is not None:
if len(chunks) != 2:
raise ValueError("`chunks` must be a tuple of two integers")
flying-sheep marked this conversation as resolved.
Show resolved Hide resolved
if chunks[int(not is_csc)] != shape[int(not is_csc)]:
flying-sheep marked this conversation as resolved.
Show resolved Hide resolved
raise ValueError(
"Only the major axis can be chunked. "
f"Try setting chunks to {((-1, _DEFAULT_STRIDE) if is_csc else (_DEFAULT_STRIDE, -1))}"
)
stride = chunks[int(is_csc)]

def make_dask_chunk(
flying-sheep marked this conversation as resolved.
Show resolved Hide resolved
block_info: Union[ # noqa: UP007
flying-sheep marked this conversation as resolved.
Show resolved Hide resolved
dict[
Literal[None],
dict[str, Union[tuple[int, ...], list[tuple[int, ...]]]], # noqa: UP007
],
None,
] = None,
):
# We need to open the file in each task since `dask` cannot share h5py objects when using `dask.distributed`
# https://github.com/scverse/anndata/issues/1105
if block_info is None:
raise ValueError("Block info is required")
with maybe_open_h5(path_or_group, elem_name) as f:
mtx = ad.experimental.sparse_dataset(f)
array_location = block_info[None]["array-location"]
flying-sheep marked this conversation as resolved.
Show resolved Hide resolved
index = (
slice(array_location[0][0], array_location[0][1]),
slice(array_location[1][0], array_location[1][1]),
)
chunk = mtx[index]
return chunk

shape_minor, shape_major = shape if is_csc else shape[::-1]
chunks_major = compute_chunk_layout_for_axis_shape(stride, shape_major)
chunks_minor = (shape_minor,)
chunk_layout = (
(chunks_minor, chunks_major) if is_csc else (chunks_major, chunks_minor)
)
memory_format = sparse.csc_matrix if is_csc else sparse.csr_matrix
da_mtx = da.map_blocks(
make_dask_chunk,
dtype=dtype,
chunks=chunk_layout,
meta=memory_format((0, 0), dtype=dtype),
)
return da_mtx


@_LAZY_REGISTRY.register_read(H5Array, IOSpec("array", "0.2.0"))
def read_h5_array(
elem: H5Array, *, _reader: DaskReader, chunks: tuple[int, ...] | None = None
) -> DaskArray:
import dask.array as da

path = Path(elem.file.filename)
elem_name = elem.name
shape = tuple(elem.shape)
dtype = elem.dtype
chunks: tuple[int, ...] = (
chunks if chunks is not None else (_DEFAULT_STRIDE,) * len(shape)
)

def make_dask_chunk(
flying-sheep marked this conversation as resolved.
Show resolved Hide resolved
block_info: Union[ # noqa: UP007
dict[
Literal[None],
dict[str, Union[tuple[int, ...], list[tuple[int, ...]]]], # noqa: UP007
],
None,
] = None,
):
if block_info is None:
raise ValueError("Block info is required")
with maybe_open_h5(path, elem_name) as f:
idx = ()
for i in range(len(shape)):
array_location = block_info[None]["array-location"][i]
idx += (slice(array_location[0], array_location[1]),)
return f[idx]

chunk_layout = tuple(
compute_chunk_layout_for_axis_shape(chunks[i], shape[i])
for i in range(len(shape))
)

return da.map_blocks(
make_dask_chunk,
dtype=dtype,
chunks=chunk_layout,
)


@_LAZY_REGISTRY.register_read(ZarrArray, IOSpec("array", "0.2.0"))
def read_zarr_array(
elem: ZarrArray, *, _reader: DaskReader, chunks: tuple[int, ...] | None = None
) -> DaskArray:
chunks: tuple[int, ...] = chunks if chunks is not None else elem.chunks
import dask.array as da

return da.from_zarr(elem, chunks=chunks)
Loading
Loading