Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest/Functional Store #427

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 116 additions & 1 deletion virtualizarr/manifests/manifest.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,22 @@
import json
import re
from collections.abc import ItemsView, Iterable, Iterator, KeysView, ValuesView
from collections.abc import (
AsyncIterator,
ItemsView,
Iterable,
Iterator,
KeysView,
ValuesView,
)
from pathlib import PosixPath
from typing import Any, Callable, NewType, Tuple, TypedDict, cast
from urllib.parse import urlparse, urlunparse

import numpy as np
import xarray as xr
from fsspec.asyn import AsyncFileSystem
from zarr.abc.store import RangeByteRequest, Store
from zarr.core.buffer import Buffer, BufferPrototype

from virtualizarr.types import ChunkKey

Expand Down Expand Up @@ -54,6 +65,110 @@
return ChunkEntry(path=path, offset=offset, length=length)


class ManifestStore(Store):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class should be defined in it's own file, and potentially even made public (developer) API.

supports_writes: bool = False
supports_deletes: bool = False
supports_partial_writes: bool = False
supports_listing: bool = True

fs: AsyncFileSystem
vds: xr.Dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this could also be a DataTree, or maybe a dict of Datasets. But we might never need that.


def __init__(self, fs, vds, read_only=True):
super().__init__(read_only=read_only)
self.fs = fs
self.vds = vds

Check warning on line 80 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L78-L80

Added lines #L78 - L80 were not covered by tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if some of the variables in the dataset are not virtual? Do we have any need to support that case?


async def clear(self) -> None:
self.fs = None
self.vds = None

Check warning on line 84 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L83-L84

Added lines #L83 - L84 were not covered by tests

def __str__(self) -> str:
return f"manifest://{id(self.vds)}"

Check warning on line 87 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L87

Added line #L87 was not covered by tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to build a custom repr using the xarray.Dataset repr here. Low priority suggestion though.


def __repr__(self) -> str:
return f"ManifestStore('{self}')"

Check warning on line 90 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L90

Added line #L90 was not covered by tests

def __eq__(self, other: object) -> bool:
return (

Check warning on line 93 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L93

Added line #L93 was not covered by tests
isinstance(other, type(self))
and self.fs == other.fs
and xr.testing.assert_identical(self.vds, other.vds)
)

async def get(
self,
key: str,
prototype: BufferPrototype,
byte_range: None = None, # could this optionally accept a RangeByteRequest?
) -> Buffer | None:
if not self._is_open:
await self._open()
print("key: ", key)
print("key split: ", key.split("/"))
array_name, _, chunk_key = key.split("/")

Check warning on line 109 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L105-L109

Added lines #L105 - L109 were not covered by tests
# TODO: is this the best way?
url, offset, length = self.vds[array_name].data.manifest.dict()[chunk_key]
Comment on lines +110 to +111
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a reason to improve the API of ChunkManifest to make it more dict-like (separately from this PR). Although I guess we already have the entire numpy array containing all references already in memory.

value = prototype.buffer.from_bytes(

Check warning on line 112 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L111-L112

Added lines #L111 - L112 were not covered by tests
await self.fs._cat_file(
url,
start=offset,
end=offset + length,
)
)
return value

Check warning on line 119 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L119

Added line #L119 was not covered by tests

# TODO: need a get_v3_array_metadata method
# to handle key="zarr.json" and return the metadata for the array

async def get_partial_values(
self,
prototype: BufferPrototype,
key_ranges: Iterable[tuple[str, RangeByteRequest | None]],
) -> list[Buffer | None]:
key_ranges = list(key_ranges)
paths: list[str] = []
starts: list[int] = []
stops: list[int] = []
for key, _ in key_ranges:
array_name, _, chunk_key = key.split("/")
url, offset, length = self.vds[array_name].data.manifest.dict()[chunk_key]
paths.append(url)
starts.append(offset)
stops.append(offset + length)
res = await self.fs._cat_ranges(paths, starts, stops, on_error="return")
return [prototype.buffer.from_bytes(r) for r in res]

Check warning on line 140 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L129-L140

Added lines #L129 - L140 were not covered by tests

async def exists(self, key: str) -> bool:
array_name, _, chunk_key = key.split("/")
url, _, _ = self.vds[array_name].data.manifest.dict()[chunk_key]
return await self.fs._exists(url)

Check warning on line 145 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L143-L145

Added lines #L143 - L145 were not covered by tests
Comment on lines +142 to +145
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should instead look at the ._paths array in the manifest and return True if the path string is non-empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think we need to check the path for that specific chunk. The _paths array cannot retrieve the specific path for that key

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that there is a convention in the ChunkManifest class that a zarr store with missing chunks has those chunks represented using an empty path string in the manifest. (The numpy array implementation prevented me from just omitting those keys entirely.)

It's the readers' job to find out if a chunk doesn't actually exist, so by the time we get here we should already know which chunks exist in storage and which don't without having to reach out to the storage to find out.


async def list(self) -> AsyncIterator[str]:
for array_name in self.vds.data_vars:
for chunk_key in self.vds[array_name].data.manifest:
yield f"{array_name}/{chunk_key}"

Check warning on line 150 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L148-L150

Added lines #L148 - L150 were not covered by tests

async def list_prefix(self, prefix: str) -> AsyncIterator[str]:
raise NotImplementedError

Check warning on line 153 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L153

Added line #L153 was not covered by tests

async def list_dir(self, prefix: str) -> AsyncIterator[str]:
raise NotImplementedError

Check warning on line 156 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L156

Added line #L156 was not covered by tests

async def delete(self, key: str) -> None:
raise NotImplementedError

Check warning on line 159 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L159

Added line #L159 was not covered by tests

async def set(
self, key: str, value: Buffer, byte_range: tuple[int, int] | None = None
) -> None:
raise NotImplementedError

Check warning on line 164 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L164

Added line #L164 was not covered by tests

async def set_partial_values(
self, key_start_values: Iterable[tuple[str, int, bytes]]
) -> None:
raise NotImplementedError

Check warning on line 169 in virtualizarr/manifests/manifest.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/manifest.py#L169

Added line #L169 was not covered by tests


def validate_and_normalize_path_to_uri(path: str, fs_root: str | None = None) -> str:
"""
Makes all paths into fully-qualified absolute URIs, or raises.
Expand Down
Loading