Skip to content

Commit

Permalink
Add codespell support (config, workflow to detect/not fix) and make i…
Browse files Browse the repository at this point in the history
…t fix few typos (#267)

* Add github action to codespell main on push and PRs

* Add pre-commit definition for codespell

* Fix manually not yet known to codespell Zar typo

but not yet known for not long: codespell-project/codespell#3568

* Add rudimentary codespell config

* [DATALAD RUNCMD] run codespell throughout fixing typos automagically (but ignoring overall fail due to ambigous ones)

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "codespell -w || :",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^

* [DATALAD RUNCMD] Do interactive fixing of some ambigous typos

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "codespell -w -i 3 -C 2",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^

* Mark remaining ignores for codespell

* Fix expected by rust spacing to inline comment

Co-authored-by: Sebastián Galkin <[email protected]>

---------

Co-authored-by: Sebastián Galkin <[email protected]>
  • Loading branch information
yarikoptic and paraseba authored Oct 15, 2024
1 parent 1ec39f3 commit d2cced7
Show file tree
Hide file tree
Showing 22 changed files with 76 additions and 40 deletions.
6 changes: 6 additions & 0 deletions .codespellrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[codespell]
# Ref: https://github.com/codespell-project/codespell#using-a-config-file
skip = .git*,*.svg,*.lock,*.css,.codespellrc
check-hidden = true
ignore-regex = ^\s*"image/\S+": ".*|\bND\b
ignore-words-list = crate
25 changes: 25 additions & 0 deletions .github/workflows/codespell.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Codespell configuration is within .codespellrc
---
name: Codespell

on:
push:
branches: [main]
pull_request:
branches: [main]

permissions:
contents: read

jobs:
codespell:
name: Check for spelling errors
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v4
- name: Annotate locations with typos
uses: codespell-project/codespell-problem-matcher@v1
- name: Codespell
uses: codespell-project/actions-codespell@v2
5 changes: 5 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,8 @@ repos:
entry: just pre-commit
language: system
pass_filenames: false
- repo: https://github.com/codespell-project/codespell
# Configuration for codespell is in .codespellrc
rev: v2.3.0
hooks:
- id: codespell
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,10 +72,10 @@ Arrays have two fundamental properties:
In Zarr / Icechunk, arrays are split into **chunks**,
A chunk is the minimum unit of data that must be read / written from storage, and thus choices about chunking have strong implications for performance.
Zarr leaves this completely up to the user.
Chunk shape should be chosen based on the anticipated data access patten for each array
Chunk shape should be chosen based on the anticipated data access pattern for each array
An Icechunk array is not bounded by an individual file and is effectively unlimited in size.

For further organization of data, Icechunk supports **groups** withing a single repo.
For further organization of data, Icechunk supports **groups** within a single repo.
Group are like folders which contain multiple arrays and or other groups.
Groups enable data to be organized into hierarchical trees.
A common usage pattern is to store multiple arrays in a group representing a NetCDF-style dataset.
Expand All @@ -86,7 +86,7 @@ Arbitrary JSON-style key-value metadata can be attached to both arrays and group

Every update to an Icechunk store creates a new **snapshot** with a unique ID.
Icechunk users must organize their updates into groups of related operations called **transactions**.
For example, appending a new time slice to mutliple arrays should be done as a single transaction, comprising the following steps
For example, appending a new time slice to multiple arrays should be done as a single transaction, comprising the following steps
1. Update the array metadata to resize the array to accommodate the new elements.
2. Write new chunks for each array in the group.

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ pip install -e icechunk@.
```


### Rust Development Worflow
### Rust Development Workflow

TODO

Expand Down
6 changes: 3 additions & 3 deletions docs/docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ HDF is widely used in high-performance computing.
---

Icechunk and HDF5 share the same data model: multidimensional arrays and metadata organized into a hierarchical tree structure.
This data model can accomodate a wide range of different use cases and workflows.
This data model can accommodate a wide range of different use cases and workflows.

Both Icechunk and HDF5 use the concept of "chunking" to split large arrays into smaller storage units.

Expand Down Expand Up @@ -238,7 +238,7 @@ The following table compares Zarr + Icechunk with TileDB Embedded in a few key a
| *versioning* | snapshots, branches, tags | linear version history | Icechunk's data versioning model is closer to Git's. |
| *unit of storage* | chunk | tile | (basically the same thing) |
| *minimum write* | chunk | cell | TileDB allows atomic updates to individual cells, while Zarr requires writing an entire chunk. |
| *sparse arrays* | :material-close: | :material-check: | Zar + Icechunk do not currently support sparse arrays. |
| *sparse arrays* | :material-close: | :material-check: | Zarr + Icechunk do not currently support sparse arrays. |
| *virtual chunk references* | :material-check: | :material-close: | Icechunk enables references to chunks in other file formats (HDF5, NetCDF, GRIB, etc.), while TileDB does not. |

Beyond this list, there are numerous differences in the design, file layout, and implementation of Icechunk and TileDB embedded
Expand All @@ -251,7 +251,7 @@ SafeTensors is a format developed by HuggingFace for storing tensors (arrays) sa
By the same criteria Icechunk and Zarr are also "safe", in that it is impossible to trigger arbitrary code execution when reading data.

SafeTensors is a single-file format, like HDF5,
SafeTensors optimizes for a simple on-disk layout that facilitates mem-map-based zero-copy reading in ML training pipleines,
SafeTensors optimizes for a simple on-disk layout that facilitates mem-map-based zero-copy reading in ML training pipelines,
assuming that the data are being read from a local POSIX filesystem
Zarr and Icechunk instead allow for flexible chunking and compression to optimize I/O against object storage.

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/icechunk-python/version-control.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
COMING SOON!

In the meantime, you can read about [version control in Arraylake](https://docs.earthmover.io/arraylake/version-control),
which is very similar to version contol in Icechunk.
which is very similar to version control in Icechunk.
4 changes: 2 additions & 2 deletions docs/docs/icechunk-python/virtual.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ While Icechunk works wonderfully with native chunks managed by Zarr, there is lo

!!! warning

While virtual references are fully supported in Icechunk, creating virtual datasets currently relies on using experimental or pre-release versions of open source tools. For full instructions on how to install the required tools and ther current statuses [see the tracking issue on Github](https://github.com/earth-mover/icechunk/issues/197).
While virtual references are fully supported in Icechunk, creating virtual datasets currently relies on using experimental or pre-release versions of open source tools. For full instructions on how to install the required tools and their current statuses [see the tracking issue on Github](https://github.com/earth-mover/icechunk/issues/197).
With time, these experimental features will make their way into the released packages.

To create virtual Icechunk datasets with Python, the community utilizes the [kerchunk](https://fsspec.github.io/kerchunk/) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/) packages.
Expand Down Expand Up @@ -83,7 +83,7 @@ virtual_ds = xr.concat(
# err (time, zlev, lat, lon) int16 64MB ManifestArray<shape=(31, 1, 72...
```

We have a virtual dataset with 31 timestamps! One hint that this worked correctly is that the readout shows the variables and coordinates as [`ManifestArray`](https://virtualizarr.readthedocs.io/en/latest/usage.html#manifestarray-class) instances, the represenation that `VirtualiZarr` uses for virtual arrays. Let's create an Icechunk store to write this dataset to.
We have a virtual dataset with 31 timestamps! One hint that this worked correctly is that the readout shows the variables and coordinates as [`ManifestArray`](https://virtualizarr.readthedocs.io/en/latest/usage.html#manifestarray-class) instances, the representation that `VirtualiZarr` uses for virtual arrays. Let's create an Icechunk store to write this dataset to.

!!! note

Expand Down
6 changes: 3 additions & 3 deletions docs/docs/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,10 @@ Arrays have two fundamental properties:
In Zarr / Icechunk, arrays are split into **chunks**,
A chunk is the minimum unit of data that must be read / written from storage, and thus choices about chunking have strong implications for performance.
Zarr leaves this completely up to the user.
Chunk shape should be chosen based on the anticipated data access patten for each array
Chunk shape should be chosen based on the anticipated data access pattern for each array
An Icechunk array is not bounded by an individual file and is effectively unlimited in size.

For further organization of data, Icechunk supports **groups** withing a single repo.
For further organization of data, Icechunk supports **groups** within a single repo.
Group are like folders which contain multiple arrays and or other groups.
Groups enable data to be organized into hierarchical trees.
A common usage pattern is to store multiple arrays in a group representing a NetCDF-style dataset.
Expand All @@ -77,7 +77,7 @@ Arbitrary JSON-style key-value metadata can be attached to both arrays and group

Every update to an Icechunk store creates a new **snapshot** with a unique ID.
Icechunk users must organize their updates into groups of related operations called **transactions**.
For example, appending a new time slice to mutliple arrays should be done as a single transaction, comprising the following steps
For example, appending a new time slice to multiple arrays should be done as a single transaction, comprising the following steps
1. Update the array metadata to resize the array to accommodate the new elements.
2. Write new chunks for each array in the group.

Expand Down
6 changes: 3 additions & 3 deletions docs/docs/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -307,7 +307,7 @@ If another client attempts to initialize a repository in the same location, only

If the specific snapshot ID is known, a client can open it directly in read only mode.

1. Use the specified shapshot ID to fetch the snapshot file.
1. Use the specified snapshot ID to fetch the snapshot file.
1. Fetch desired attributes and values from arrays.

#### From Branch
Expand All @@ -316,13 +316,13 @@ Usually, a client will want to read from the latest branch (e.g. `main`).

1. List the object store prefix `refs/branch.$BRANCH_NAME/` to obtain the latest branch file in the sequence. Due to the encoding of the sequence number, this should be the _first file_ in lexicographical order.
1. Read the branch file JSON contents to obtain the snapshot ID.
1. Use the shapshot ID to fetch the snapshot file.
1. Use the snapshot ID to fetch the snapshot file.
1. Fetch desired attributes and values from arrays.

#### From Tag

1. Read the tag file found at `refs/tag.$TAG_NAME/ref.json` to obtain the snapshot ID.
1. Use the shapshot ID to fetch the snapshot file.
1. Use the snapshot ID to fetch the snapshot file.
1. Fetch desired attributes and values from arrays.

### Write New Snapshot
Expand Down
2 changes: 1 addition & 1 deletion icechunk-python/notebooks/demo-dummy-data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -357,7 +357,7 @@
"id": "f2b6a3c6-9518-4ec4-921f-5303d4e851c7",
"metadata": {},
"source": [
"### Commiting when not on `HEAD` will fail."
"### Committing when not on `HEAD` will fail."
]
},
{
Expand Down
10 changes: 5 additions & 5 deletions icechunk-python/notebooks/demo-s3.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1064,16 +1064,16 @@
"text": [
"u\n",
"D2YNJWWTKW6DY8ECPJZG\n",
"commited; 43.68043661117554 seconds\n",
"committed; 43.68043661117554 seconds\n",
"um\n",
"V0RSK39P1EXKB37F6Z10\n",
"commited; 44.08490180969238 seconds\n",
"committed; 44.08490180969238 seconds\n",
"v\n",
"JNDCHT5MF2MWRHYY8Q1G\n",
"commited; 61.78669619560242 seconds\n",
"committed; 61.78669619560242 seconds\n",
"vm\n",
"GAKXY70VJ2NQ3ANMEE10\n",
"commited; 55.72252893447876 seconds\n"
"committed; 55.72252893447876 seconds\n"
]
}
],
Expand All @@ -1093,7 +1093,7 @@
" exists_ok=True,\n",
" )\n",
" print(store.commit(f\"wrote {var}\"))\n",
" print(f\"commited; {time.time() - tic} seconds\")"
" print(f\"committed; {time.time() - tic} seconds\")"
]
},
{
Expand Down
10 changes: 5 additions & 5 deletions icechunk-python/python/icechunk/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ def open_or_create(cls, *args: Any, **kwargs: Any) -> Self:
store = cls.create(storage, mode, *args, **kwargs)

assert(store)
# We dont want to call _open() becuase icechunk handles the opening, etc.
# We dont want to call _open() because icechunk handles the opening, etc.
# if we have gotten this far we can mark it as open
store._is_open = True

Expand Down Expand Up @@ -293,7 +293,7 @@ def commit(self, message: str) -> str:
This method will fail if:
* there is no currently checked out branch
* some other writer updated the curret branch since the repository was checked out
* some other writer updated the current branch since the repository was checked out
"""
return self._store.commit(message)

Expand All @@ -306,7 +306,7 @@ async def async_commit(self, message: str) -> str:
This method will fail if:
* there is no currently checked out branch
* some other writer updated the curret branch since the repository was checked out
* some other writer updated the current branch since the repository was checked out
"""
return await self._store.async_commit(message)

Expand All @@ -321,7 +321,7 @@ def distributed_commit(
This method will fail if:
* there is no currently checked out branch
* some other writer updated the curret branch since the repository was checked out
* some other writer updated the current branch since the repository was checked out
other_change_set_bytes must be generated as the output of calling `change_set_bytes`
on other stores. The resulting commit will include changes from all stores.
Expand All @@ -341,7 +341,7 @@ async def async_distributed_commit(
This method will fail if:
* there is no currently checked out branch
* some other writer updated the curret branch since the repository was checked out
* some other writer updated the current branch since the repository was checked out
other_change_set_bytes must be generated as the output of calling `change_set_bytes`
on other stores. The resulting commit will include changes from all stores.
Expand Down
2 changes: 1 addition & 1 deletion icechunk-python/src/errors.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ use thiserror::Error;
/// A simple wrapper around the StoreError to make it easier to convert to a PyErr
///
/// When you use the ? operator, the error is coerced. But if you return the value it is not.
/// So for now we just use the extra operation to get the coersion instead of manually mapping
/// So for now we just use the extra operation to get the coercion instead of manually mapping
/// the errors where this is returned from a python class
#[allow(clippy::enum_variant_names)]
#[derive(Debug, Error)]
Expand Down
6 changes: 3 additions & 3 deletions icechunk-python/tests/test_zarr/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ async def test_open_array(memory_store: IcechunkStore) -> None:
assert z.shape == (100,)

# open array, overwrite
# _store_dict wont currently work with IcechunkStore
# _store_dict won't currently work with IcechunkStore
# TODO: Should it?
pytest.xfail("IcechunkStore does not support _store_dict")
store._store_dict = {}
Expand All @@ -66,7 +66,7 @@ async def test_open_array(memory_store: IcechunkStore) -> None:
# open array, read-only
store_cls = type(store)

# _store_dict wont currently work with IcechunkStore
# _store_dict won't currently work with IcechunkStore
# TODO: Should it?

ro_store = store_cls.open(store_dict=store._store_dict, mode="r")
Expand Down Expand Up @@ -96,7 +96,7 @@ async def test_open_group(memory_store: IcechunkStore) -> None:

# open group, read-only
store_cls = type(store)
# _store_dict wont currently work with IcechunkStore
# _store_dict won't currently work with IcechunkStore
# TODO: Should it?
pytest.xfail("IcechunkStore does not support _store_dict")
ro_store = store_cls.open(store_dict=store._store_dict, mode="r")
Expand Down
2 changes: 1 addition & 1 deletion icechunk/examples/low_level_dataset.rs
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ let zarr_meta1 = ZarrArrayMetadata {{
chunk_key_encoding: ChunkKeyEncoding::Slash,
fill_value: FillValue::Int32(0),
codecs: Codecs("codec".to_string()),
storage_transformers: Some(StorageTransformers("tranformers".to_string())),
storage_transformers: Some(StorageTransformers("transformers".to_string())),
dimension_names: Some(vec![
Some("x".to_string()),
Some("y".to_string()),
Expand Down
2 changes: 1 addition & 1 deletion icechunk/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
//! - There is a low level interface that speaks zarr keys and values, and is used to provide the
//! zarr store that will be used from python. This is the [`zarr::Store`] type.
//! - There is a translation language between low and high levels. When user writes to a zarr key,
//! we need to convert that key to the language of arrays and groups. This is implmented it the
//! we need to convert that key to the language of arrays and groups. This is implemented it the
//! [`zarr`] module
//! - There is an abstract type for loading and saving of the Arrow datastructures.
//! This is the [`Storage`] trait. It knows how to fetch and write arrow.
Expand Down
2 changes: 1 addition & 1 deletion icechunk/src/refs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -310,7 +310,7 @@ mod tests {

/// Execute the passed block with all test implementations of Storage.
///
/// Currently this function executes agains the in-memory and local filesystem object_store
/// Currently this function executes against the in-memory and local filesystem object_store
/// implementations.
async fn with_test_storages<
R,
Expand Down
4 changes: 2 additions & 2 deletions icechunk/src/repository.rs
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ pub struct RepositoryConfig {
pub inline_chunk_threshold_bytes: u16,
// Unsafely overwrite refs on write. This is not recommended, users should only use it at their
// own risk in object stores for which we don't support write-object-if-not-exists. There is
// teh posibility of race conditions if this variable is set to true and there are concurrent
// the possibility of race conditions if this variable is set to true and there are concurrent
// commit attempts.
pub unsafe_overwrite_refs: bool,
}
Expand Down Expand Up @@ -1399,7 +1399,7 @@ mod tests {
let non_chunk = ds.get_chunk_ref(&new_array_path, &ChunkIndices(vec![1])).await?;
assert_eq!(non_chunk, None);

// update old array use attriutes and check them
// update old array use attributes and check them
ds.set_user_attributes(
array1_path.clone(),
Some(UserAttributes::try_new(br#"{"updated": true}"#).unwrap()),
Expand Down
4 changes: 2 additions & 2 deletions icechunk/src/storage/object_store.rs
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ pub struct ObjectStorage {
}

impl ObjectStorage {
/// Create an in memory Storage implementantion
/// Create an in memory Storage implementation
///
/// This implementation should not be used in production code.
pub fn new_in_memory_store(prefix: Option<String>) -> ObjectStorage {
Expand All @@ -88,7 +88,7 @@ impl ObjectStorage {
}
}

/// Create an local filesystem Storage implementantion
/// Create an local filesystem Storage implementation
///
/// This implementation should not be used in production code.
pub fn new_local_store(prefix: &StdPath) -> Result<ObjectStorage, std::io::Error> {
Expand Down
2 changes: 1 addition & 1 deletion icechunk/src/zarr.rs
Original file line number Diff line number Diff line change
Expand Up @@ -754,7 +754,7 @@ impl Store {
.list_prefix(prefix)
.await?
.map_ok(move |s| {
// If the prefix is "/", get rid of it. This can happend when prefix is missing
// If the prefix is "/", get rid of it. This can happen when prefix is missing
// the trailing slash (as it does in zarr-python impl)
let rem = &s[idx..].trim_start_matches('/');
let parent = rem.split_once('/').map_or(*rem, |(parent, _)| parent);
Expand Down
2 changes: 1 addition & 1 deletion icechunk/tests/test_s3_storage.rs
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ pub async fn test_chunk_write_read() -> Result<(), Box<dyn std::error::Error>> {
assert_eq!(Bytes::from_static(b"ello"), back);

let back = storage.fetch_chunk(&id, &ByteRange::to_offset(3)).await?;
assert_eq!(Bytes::from_static(b"hel"), back);
assert_eq!(Bytes::from_static(b"hel"), back); // codespell:ignore

let back = storage.fetch_chunk(&id, &ByteRange::bounded(1, 4)).await?;
assert_eq!(Bytes::from_static(b"ell"), back);
Expand Down

0 comments on commit d2cced7

Please sign in to comment.