Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nano-arrow #11179

Merged
merged 13 commits into from
Sep 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 3 additions & 2 deletions .github/workflows/lint-rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,8 @@ jobs:
save-if: ${{ github.ref_name == 'main' }}

- name: Run cargo clippy with all features enabled
run: cargo clippy --workspace --all-targets --all-features -- -D warnings
# not all features can combine with each other for nano-arrow
run: cargo clippy --workspace --all-targets --exclude nano-arrow --all-features -- -D warnings

# Default feature set should compile on the stable toolchain
clippy-stable:
Expand All @@ -58,7 +59,7 @@ jobs:
save-if: ${{ github.ref_name == 'main' }}

- name: Run cargo clippy
run: cargo clippy --workspace --all-targets -- -D warnings
run: cargo clippy --workspace --all-targets --exclude nano-arrow -- -D warnings

rustfmt:
if: github.ref_name != 'main'
Expand Down
16 changes: 10 additions & 6 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ bytemuck = { version = "1", features = ["derive", "extern_crate_alloc"] }
chrono = { version = "0.4", default-features = false, features = ["std"] }
chrono-tz = "0.8.1"
ciborium = "0.2"
either = "1.8"
either = "1.9"
futures = "0.3.25"
hashbrown = { version = "0.14", features = ["rayon", "ahash"] }
indexmap = { version = "2", features = ["std"] }
Expand All @@ -50,6 +50,12 @@ strum_macros = "0.25"
thiserror = "1"
url = "2.3.1"
version_check = "0.9.4"
simdutf8 = "0.1.4"
hex = "0.4.3"
base64 = "0.21.2"
fallible-streaming-iterator = "0.1.9"
streaming-iterator = "0.1.9"

xxhash-rust = { version = "0.8.6", features = ["xxh3"] }
polars-core = { version = "0.33.2", path = "crates/polars-core", default-features = false }
polars-arrow = { version = "0.33.2", path = "crates/polars-arrow", default-features = false }
Expand All @@ -69,11 +75,9 @@ polars-json = { version = "0.33.2", path = "crates/polars-json", default-feature
polars = { version = "0.33.2", path = "crates/polars", default-features = false }

[workspace.dependencies.arrow]
package = "arrow2"
# git = "https://github.com/jorgecarleitao/arrow2"
# rev = "7c93e358fc400bf3c0c0219c22eefc6b38fc2d12"
# branch = ""
version = "0.18.0"
package = "nano-arrow"
version = "0.1.0"
path = "crates/nano-arrow"
default-features = false
features = [
"compute_aggregate",
Expand Down
4 changes: 2 additions & 2 deletions crates/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@ fmt: ## Run rustfmt and dprint

.PHONY: check
check: ## Run cargo check with all features
cargo check --workspace --all-targets --all-features
cargo check --workspace --all-targets --exclude nano-arrow --all-features

.PHONY: clippy
clippy: ## Run clippy with all features
cargo clippy --workspace --all-targets --all-features
cargo clippy --workspace --all-targets --exclude nano-arrow --all-features

.PHONY: clippy-default
clippy-default: ## Run clippy with default features
Expand Down
198 changes: 198 additions & 0 deletions crates/nano-arrow/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
[package]
name = "nano-arrow"
version = "0.1.0"
authors = ["Jorge C. Leitao <[email protected]>", "Apache Arrow <[email protected]>", "Ritchie Vink"]
edition.workspace = true
homepage.workspace = true
licence = "Apache 2.0 and MIT"
license.workspace = true
repository.workspace = true
description = "Minimal implementation of the Arrow specification forked from arrow2."

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
bytemuck.workspace = true
chrono.workspace = true
# for timezone support
chrono-tz = { workspace = true, optional = true }
dyn-clone = "1"
either.workspace = true
foreign_vec = "0.1.0"
hashbrown.workspace = true
num-traits.workspace = true
simdutf8.workspace = true

# for decimal i256
ethnum = "1"

# To efficiently cast numbers to strings
lexical-core = { version = "0.8", optional = true }

fallible-streaming-iterator = { workspace = true, optional = true }
regex = { workspace = true, optional = true }
regex-syntax = { version = "0.7", optional = true }
streaming-iterator = { workspace = true }

indexmap = { workspace = true, optional = true }

arrow-format = { version = "0.8", optional = true, features = ["ipc"] }

hex = { workspace = true, optional = true }

# for IPC compression
lz4 = { version = "1.24", optional = true }
zstd = { version = "0.12", optional = true }

base64 = { workspace = true, optional = true }

# to write to parquet as a stream
futures = { version = "0.3", optional = true }

# to read IPC as a stream
async-stream = { version = "0.3.2", optional = true }

# avro support
avro-schema = { version = "0.3", optional = true }

# for division/remainder optimization at runtime
strength_reduce = { version = "0.2", optional = true }

# For instruction multiversioning
multiversion = { workspace = true, optional = true }

# Faster hashing
ahash.workspace = true

# Support conversion to/from arrow-rs
arrow-array = { version = ">=40", optional = true }
arrow-buffer = { version = ">=40", optional = true }
arrow-data = { version = ">=40", optional = true }
arrow-schema = { version = ">=40", optional = true }

[target.wasm32-unknown-unknown.dependencies]
getrandom = { version = "0.2", features = ["js"] }

# parquet support
[dependencies.parquet2]
version = "0.17"
optional = true
default_features = false
features = ["async"]

[dev-dependencies]
avro-rs = { version = "0.13", features = ["snappy"] }
criterion = "0.4"
crossbeam-channel = "0.5.1"
doc-comment = "0.3"
flate2 = "1"
# used to run formal property testing
proptest = { version = "1", default_features = false, features = ["std"] }
# use for flaky testing
rand = "0.8"
# use for generating and testing random data samples
sample-arrow2 = "0.1"
sample-std = "0.1"
sample-test = "0.1"
# used to test async readers
tokio = { version = "1", features = ["macros", "rt", "fs", "io-util"] }
tokio-util = { version = "0.7", features = ["compat"] }

[package.metadata.docs.rs]
features = ["full"]
rustdoc-args = ["--cfg", "docsrs"]

[features]
default = []
full = [
"arrow",
"io_ipc",
"io_flight",
"io_ipc_write_async",
"io_ipc_read_async",
"io_ipc_compression",
"io_parquet",
"io_parquet_compression",
"io_avro",
"io_avro_compression",
"io_avro_async",
"regex-syntax",
"compute",
# parses timezones used in timestamp conversions
"chrono-tz",
]
arrow = ["arrow-buffer", "arrow-schema", "arrow-data", "arrow-array"]
io_ipc = ["arrow-format"]
io_ipc_write_async = ["io_ipc", "futures"]
io_ipc_read_async = ["io_ipc", "futures", "async-stream"]
io_ipc_compression = ["lz4", "zstd"]
io_flight = ["io_ipc", "arrow-format/flight-data"]

# base64 + io_ipc because arrow schemas are stored as base64-encoded ipc format.
io_parquet = ["parquet2", "io_ipc", "base64", "futures", "fallible-streaming-iterator"]

io_parquet_compression = [
"io_parquet_zstd",
"io_parquet_gzip",
"io_parquet_snappy",
"io_parquet_lz4",
"io_parquet_brotli",
]

# sample testing of generated arrow data
io_parquet_sample_test = ["io_parquet"]

# compression backends
io_parquet_zstd = ["parquet2/zstd"]
io_parquet_snappy = ["parquet2/snappy"]
io_parquet_gzip = ["parquet2/gzip"]
io_parquet_lz4_flex = ["parquet2/lz4_flex"]
io_parquet_lz4 = ["parquet2/lz4"]
io_parquet_brotli = ["parquet2/brotli"]

# parquet bloom filter functions
io_parquet_bloom_filter = ["parquet2/bloom_filter"]

io_avro = ["avro-schema"]
io_avro_compression = [
"avro-schema/compression",
]
io_avro_async = ["avro-schema/async"]

# the compute kernels. Disabling this significantly reduces compile time.
compute_aggregate = ["multiversion"]
compute_arithmetics_decimal = ["strength_reduce"]
compute_arithmetics = ["strength_reduce", "compute_arithmetics_decimal"]
compute_bitwise = []
compute_boolean = []
compute_boolean_kleene = []
compute_cast = ["lexical-core", "compute_take"]
compute_comparison = ["compute_take", "compute_boolean"]
compute_concatenate = []
compute_filter = []
compute_hash = ["multiversion"]
compute_if_then_else = []
compute_take = []
compute_temporal = []
compute = [
"compute_aggregate",
"compute_arithmetics",
"compute_bitwise",
"compute_boolean",
"compute_boolean_kleene",
"compute_cast",
"compute_comparison",
"compute_concatenate",
"compute_filter",
"compute_hash",
"compute_if_then_else",
"compute_take",
"compute_temporal",
]
simd = []

[build-dependencies]
rustc_version = "0.4.0"

[package.metadata.cargo-all-features]
allowlist = ["compute", "compute_sort", "compute_hash", "compute_nullif"]
32 changes: 32 additions & 0 deletions crates/nano-arrow/src/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Crate's design

This document describes the design of this module, and thus the overall crate.
Each module MAY have its own design document, that concerns specifics of that module, and if yes,
it MUST be on each module's `README.md`.

## Equality

Array equality is not defined in the Arrow specification. This crate follows the intent of the specification, but there is no guarantee that this no verification that this equals e.g. C++'s definition.

There is a single source of truth about whether two arrays are equal, and that is via their
equality operators, defined on the module [`array/equal`](array/equal/mod.rs).

Implementation MUST use these operators for asserting equality, so that all testing follows the same definition of array equality.

## Error handling

- Errors from an external dependency MUST be encapsulated on `External`.
- Errors from IO MUST be encapsulated on `Io`.
- This crate MAY return `NotYetImplemented` when the functionality does not exist, or it MAY panic with `unimplemented!`.

## Logical and physical types

There is a strict separation between physical and logical types:

- physical types MUST be implemented via generics
- logical types MUST be implemented via variables (whose value is e.g. an `enum`)
- logical types MUST be declared and implemented on the `datatypes` module

## Source of undefined behavior

There is one, and only one, acceptable source of undefined behavior: FFI. It is impossible to prove that data passed via pointers are safe for consumption (only a promise from the specification).
73 changes: 73 additions & 0 deletions crates/nano-arrow/src/array/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Array module

This document describes the overall design of this module.

## Notation:

- "array" in this module denotes any struct that implements the trait `Array`.
- "mutable array" in this module denotes any struct that implements the trait `MutableArray`.
- words in `code` denote existing terms on this implementation.

## Arrays:

- Every arrow array with a different physical representation MUST be implemented as a struct or generic struct.

- An array MAY have its own module. E.g. `primitive/mod.rs`

- An array with a null bitmap MUST implement it as `Option<Bitmap>`

- An array MUST be `#[derive(Clone)]`

- The trait `Array` MUST only be implemented by structs in this module.

- Every child array on the struct MUST be `Box<dyn Array>`.

- An array MUST implement `try_new(...) -> Self`. This method MUST error iff
the data does not follow the arrow specification, including any sentinel types such as utf8.

- An array MAY implement `unsafe try_new_unchecked` that skips validation steps that are `O(N)`.

- An array MUST implement either `new_empty()` or `new_empty(DataType)` that returns a zero-len of `Self`.

- An array MUST implement either `new_null(length: usize)` or `new_null(DataType, length: usize)` that returns a valid array of length `length` whose all elements are null.

- An array MAY implement `value(i: usize)` that returns the value at slot `i` ignoring the validity bitmap.

- functions to create new arrays from native Rust SHOULD be named as follows:
- `from`: from a slice of optional values (e.g. `AsRef<[Option<bool>]` for `BooleanArray`)
- `from_slice`: from a slice of values (e.g. `AsRef<[bool]>` for `BooleanArray`)
- `from_trusted_len_iter` from an iterator of trusted len of optional values
- `from_trusted_len_values_iter` from an iterator of trusted len of values
- `try_from_trusted_len_iter` from an fallible iterator of trusted len of optional values

### Slot offsets

- An array MUST have a `offset: usize` measuring the number of slots that the array is currently offsetted by if the specification requires.

- An array MUST implement `fn slice(&self, offset: usize, length: usize) -> Self` that returns an offsetted and/or truncated clone of the array. This function MUST increase the array's offset if it exists.

- Conversely, `offset` MUST only be changed by `slice`.

The rational of the above is that it enable us to be fully interoperable with the offset logic supported by the C data interface, while at the same time easily perform array slices
within Rust's type safety mechanism.

### Mutable Arrays

- An array MAY have a mutable counterpart. E.g. `MutablePrimitiveArray<T>` is the mutable counterpart of `PrimitiveArray<T>`.

- Arrays with mutable counterparts MUST have its own module, and have the mutable counterpart declared in `{module}/mutable.rs`.

- The trait `MutableArray` MUST only be implemented by mutable arrays in this module.

- A mutable array MUST be `#[derive(Debug)]`

- A mutable array with a null bitmap MUST implement it as `Option<MutableBitmap>`

- Converting a `MutableArray` to its immutable counterpart MUST be `O(1)`. Specifically:
- it must not allocate
- it must not cause `O(N)` data transformations

This is achieved by converting mutable versions to immutable counterparts (e.g. `MutableBitmap -> Bitmap`).

The rational is that `MutableArray`s can be used to perform in-place operations under
the arrow spec.
Loading