Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): experimental read_backed method for zarr + hdf5 via read_dispatched #947

Closed
wants to merge 147 commits into from

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Mar 8, 2023

Creates a new experimental AnnDataBacked class as well as a helper read_backed method for reading of on-disk zarr and hdf5 datasets lazily, with a focus on usage over the internet. The primary focus is less on optimizing for performance of analysis than on reading metadata/making fetching of data quick.

Fixes #951 as well
Probably fixes #981 if an "experimental" PR can do that.

The main highlights of this PR in experimental.read_backed:

  • xarray for all dataframes
  • by-default lazy loading of all elements (dense array, sparse array, dataframes) except AwkwardArrays
  • Out-of-the-box, fully backed, works with both h5ad and zarr while also being compatible with both on-disk/remote storage (i.e., changing subelements should work if you want them to be e.g., numpy arrays although this isn't tested because the focus here is really on reading)
  • simple to_memory function on the AnnDataBacked class that brings whatever you want locally for use in scanpy including optional exclude keys for making the download faster by restricting it to exactly what you need

Outside of this PR but included in the overall work:

  • views of views for backed sparse matrices (and more targeted/efficient reading)
  • common abstract AnnData class beginning to define a sensible contract for new classes to build upon
  • backed X for zarr

ivirshup and others added 30 commits April 29, 2022 17:38
To fix error reporting, I've put the attempt to catch an error during IO on top of the `read_elem` method.
Since the decorator is sometimes used on functions, I modified it to be able to handle the signature of both a method and a function.

What's weird is that sometimes the decorator is being passed the arguments of a method, that has a name like a method, but is a function.
So that still needs to be fixed.
pyproject.toml Outdated Show resolved Hide resolved
@ilan-gold
Copy link
Contributor Author

Before this can be merged/reviewed, there are several blocking PRs:
#949
#765
ivirshup#4

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Aug 11, 2023

Open question: If we want out-of-core indices on load (i.e., obs_names) and we decide to make obs a generator we need to figure out

  1. Is e.g., obs immutable or mutable? (cached_property or just property)
  2. If mutable, how do we do caching?

See pydata/xarray#1650 for the "real" way forward. This just mitigates the initial load of indices when calling read_backed because creating an xarray object entails loading the coords. Ideally, that just wouldn't happen so making our xarray objects lazily-generated (on top of already being "lazily-loaded") is a way to circumvent this i.e., users only pay for the indices when accessing e.g., obs

@ilan-gold
Copy link
Contributor Author

Note to self: we need the sparse_dataset changes (I think) because they allow for getting a representation of the matrix without actually reading it into memory. That is, for X in the current AnnData object, the following happens

def X(self) -> np.ndarray | sparse.spmatrix | ArrayView | None:
"""Data matrix of shape :attr:`n_obs` × :attr:`n_vars`."""
if self.isbacked:
if not self.file.is_open:
self.file.open()
X = self.file["X"]
if isinstance(X, h5py.Group):
X = sparse_dataset(X)
# This is so that we can index into a backed dense dataset with
# indices that aren’t strictly increasing
if self.is_view:
X = _subset(X, (self._oidx, self._vidx))
elif self.is_view and self._adata_ref.X is None:
X = None
elif self.is_view:
X = as_view(
_subset(self._adata_ref.X, (self._oidx, self._vidx)),
ElementRef(self, "X"),
)

This is problematic because it means that accessing X immediately reads it into memory. This is why I moved the indexing on to the BaseCompressedSparseDataset class here.

@ilan-gold
Copy link
Contributor Author

No need as #1247 seems like it will be the way to go.

@ilan-gold ilan-gold closed this Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants