Columnar data in Materialize #9463
Replies: 3 comments
-
I added a comment on the current limitations Timely's API imposes on ownership: TimelyDataflow/timely-dataflow#426 (comment) |
Beta Was this translation helpful? Give feedback.
-
Interfaces used by Materialize that currently don't support columnar data:
A major pain point is that it's difficult to define operators that are generic over the container type they take as input and produce as output. I see three main reason for this:
For these reasons, I think the best path forward is to gradually switch to columnar data where possible. Some operators need to be re-implemented for specific containers while others might be possible to turn into a generic variant, but I haven't determined a good approach. Columnation for Product#[derive(Default)]
pub struct ProductRegion<R0: Region, R1: Region> {
region0: R0,
region1: R1,
}
impl<R0: Region, R1: Region> Region for ProductRegion<R0, R1> {
type Item = Product<R0::Item, R1::Item>;
unsafe fn copy(&mut self, item: &Self::Item) -> Self::Item {
Product::new(self.region0.copy(&item.outer), self.region1.copy(&item.inner))
}
fn clear(&mut self) {
self.region0.clear();
self.region1.clear();
}
}
impl<TOuter: Columnation, TInner: Columnation> Columnation for Product<TOuter, TInner> {
type InnerRegion = ProductRegion<TOuter::InnerRegion, TInner::InnerRegion>;
} GATs for Containerstrait Container {
type Output<T: Clone + 'static>: Container<Item=T>;
} We can implement this for |
Beta Was this translation helpful? Give feedback.
-
Container monotonicityIn TimelyDataflow/timely-dataflow#444, I experimented a but more with expressing to Rust that certain transformation go from a
So far, so good, it allows us to express that for any item, the container type stays the same:
Using the trait is harder, though. Rust doesn't seem to pick up that the implementation of For this reason, I think it's better to have concrete implementations of traits for containers rather than generic implementations for type classes of containers. I actually implemented the |
Beta Was this translation helpful? Give feedback.
-
Summary
This document motivates the need for columnar data in Materialize and outlines implementation requirements. Columnar data captures closely-stored data in contiguous allocations to avoid small heap allocations. We believe, Materialize benefits in latency, throughput and memory requirements from encoding arrangements and data on dataflow edges in a columnar shape.
Motivation
Materialize currently transfers and stores all data in-memory-data in a row-first format. For
Row
s this means that a batch is a list of pointers to heap-allocated row contents, with the exception that short rows can be stored inline. Accessing rows requires a pointer dereference to the heap, likely paired with a TLB and cache miss.A columnar data presentation therefore brings improvements in various dimensions:
On the flipside, a columnar data representation introduces new limitations:
Design
We propose to implement support for columnar data based on the
columnation
crate and containerized Timely/Differential. #9461 outlines what implementation steps are necessary to at least partially support columnar data on dataflow edges and within arrangements.Beta Was this translation helpful? Give feedback.
All reactions