Columnar data in Materialize #9463

antiguru · 2021-12-07T14:09:34Z

antiguru
Dec 7, 2021
Maintainer

Summary

This document motivates the need for columnar data in Materialize and outlines implementation requirements. Columnar data captures closely-stored data in contiguous allocations to avoid small heap allocations. We believe, Materialize benefits in latency, throughput and memory requirements from encoding arrangements and data on dataflow edges in a columnar shape.

Motivation

Materialize currently transfers and stores all data in-memory-data in a row-first format. For Rows this means that a batch is a list of pointers to heap-allocated row contents, with the exception that short rows can be stored inline. Accessing rows requires a pointer dereference to the heap, likely paired with a TLB and cache miss.

A columnar data presentation therefore brings improvements in various dimensions:

Latency: Accessing rows has a lower chance of cache misses.
Throughput: Positioning allocations adjacently allows the memory subsystem to more efficiently fetch data
Memory consumption: By moving small allocations from the management of the allocator to a region allocator, we avoid the maintenance overhead and heap fragmentation otherwise incurred.

On the flipside, a columnar data representation introduces new limitations:

Mutability: Data stored within a region allocator can only be accessed by immutable reference to ensure the integrity of the region allocator. Especially, data cannot be moved or dropped because the backing memory is not owned by the datum.
Maintainability: Columnar data in Materialize requires to use a set of new or specialized APIs, which adds a maintainability burdon.

Design

We propose to implement support for columnar data based on the columnation crate and containerized Timely/Differential. #9461 outlines what implementation steps are necessary to at least partially support columnar data on dataflow edges and within arrangements.

antiguru · 2021-12-14T10:44:11Z

antiguru
Dec 14, 2021
Maintainer Author

I added a comment on the current limitations Timely's API imposes on ownership: TimelyDataflow/timely-dataflow#426 (comment)

0 replies

antiguru · 2021-12-17T15:55:32Z

antiguru
Dec 17, 2021
Maintainer Author

Interfaces used by Materialize that currently don't support columnar data:

Arranged::as_collection
- Arranged::flat_map_ref
  - Arranged::flat_map_batches
Collection::enter
Collection::leave
Collection::map
Collection::map_fallible
Collection::negate
CollectionBundle::flat_map_core
CollectionExt
Product is not Columnation
StreamCore::flat_map
StreamCore::flat_map_fallible
StreamCore::map_in_place
StreamCore::ok_err
StreamExt: Some inputs are Vec, others columnar data
Variable -> Collection::negate
dataflow_types::DataflowError: Columnation
dogs^3/half_join would need a variant for Columnation

A major pain point is that it's difficult to define operators that are generic over the container type they take as input and produce as output. I see three main reason for this:

Some containers reveal data as owned data while other present data as references.
Some containers are constructed from owned data while others expect references.
Expressing to Rust that the input type of an operator is similar to its output type seems tricky, even with GATs. Ideally, I'd like to define an operator that takes a type T with a generic argument, where the operator reads T<A> and produces T<B>. GATs can represent this, but it's more difficult if a container has additional type constraints, because it's hard to encode these constraints as restrictions on the implementation. Not even sure it's possible

For these reasons, I think the best path forward is to gradually switch to columnar data where possible. Some operators need to be re-implemented for specific containers while others might be possible to turn into a generic variant, but I haven't determined a good approach.

Columnation for Product

#[derive(Default)]
pub struct ProductRegion<R0: Region, R1: Region> {
    region0: R0,
    region1: R1,
}

impl<R0: Region, R1: Region> Region for ProductRegion<R0, R1> {
    type Item = Product<R0::Item, R1::Item>;

    unsafe fn copy(&mut self, item: &Self::Item) -> Self::Item {
        Product::new(self.region0.copy(&item.outer), self.region1.copy(&item.inner))
    }

    fn clear(&mut self) {
        self.region0.clear();
        self.region1.clear();
    }
}

impl<TOuter: Columnation, TInner: Columnation> Columnation for Product<TOuter, TInner> {
    type InnerRegion = ProductRegion<TOuter::InnerRegion, TInner::InnerRegion>;
}

GATs for Containers

trait Container {
   type Output<T: Clone + 'static>: Container<Item=T>;
}

We can implement this for Vec<T> by including its restrictions on T, but this doesn't scale as Columnation would require a restriction on T: Columnation. I don't know how to express additional type constraints on GATs outside of the associated type.

0 replies

antiguru · 2022-02-10T09:59:15Z

antiguru
Feb 10, 2022
Maintainer Author

Container monotonicity

In TimelyDataflow/timely-dataflow#444, I experimented a but more with expressing to Rust that certain transformation go from a SpecificContainer<A> to a SpecificContainer<B> where only the item type changes. We can express this as a trait:

trait MonotonicContainer<O>: Container {
    type Output: Container<Item=O>;
}

So far, so good, it allows us to express that for any item, the container type stays the same:

impl<T: ..., O: ...> MonotonicContainer<O> for Vec<T> {
    type Output = Vec<O>;
}

Using the trait is harder, though. Rust doesn't seem to pick up that the implementation of MonotonicContainer for a specific container is universal and requires more type annotations to determine which implementation to pick (from the single available one). This means that it doesn't add the value I'd hoped it would add.

For this reason, I think it's better to have concrete implementations of traits for containers rather than generic implementations for type classes of containers. I actually implemented the Broadcast trait in a non-container specific way, but it's type constraints are mind-boggling (TODO: Show the code).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Columnar data in Materialize #9463

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Columnar data in Materialize #9463

antiguru Dec 7, 2021 Maintainer

Summary

Motivation

Design

Replies: 3 comments

antiguru Dec 14, 2021 Maintainer Author

antiguru Dec 17, 2021 Maintainer Author

Columnation for Product

GATs for Containers

antiguru Feb 10, 2022 Maintainer Author

Container monotonicity

antiguru
Dec 7, 2021
Maintainer

antiguru
Dec 14, 2021
Maintainer Author

antiguru
Dec 17, 2021
Maintainer Author

antiguru
Feb 10, 2022
Maintainer Author