Skip to content
Frank McSherry edited this page Aug 1, 2015 · 13 revisions

This project is a modular implementation of timely dataflow in Rust.

There is a lot to say about timely dataflow, and it may be best to start with an introduction to writing timely dataflow programs.

This document contains explanations both of how timely dataflow (the concept) works, and how to use timely dataflow in Rust (the software). In the fullness of time, the project will get a better name, and the two will stop being conflated.

Motivation

Timely dataflow is a framework for data-parallel programming, with the goal of letting you write quite general programs that will scale to large numbers of independent workers (i.e. multiple threads, processes, and computers).

The main distinction between timely dataflow and other dataflow systems before it lies in its generality. Most dataflow systems constrain your dataflow to a directed acyclic graph, which makes control structures like loops difficult to write. This makes some sense, because dataflow is largely about excising control structures that are not part of the data themselves. Nonetheless, there are reasonable ways to write cyclic dataflow computations, and timely dataflow is about helping you do this without losing your sanity.

Core ideas

Timely dataflow is a dataflow programming model, which means that at its lowest level it views computations as a directed graph whose vertices are operators, and whose edges indicate channels along which data may move from one operator to another. Each batch of data bear a logical timestamp, which encodes what little control information we required (for example, in which iteration of a loop was this data produced).

In a perfect world, you should not be obliged to understand or reason about this dataflow representation; we have found several other programming idioms reduce nicely to dataflow, with the most common one being the collection or stream oriented programming you see in languages like SQL, LINQ, and Datalog. Timely dataflow is intentionally designed to let you program at different levels: you can write raw dataflow if you like, or built-in operators that resemble SQL operators, or mix and match as you like. You can even write your own custom operators in situ.

One goal of timely dataflow is to still let you write programs with common imperative programming idioms. Where possible, we would like to let you write something like a while loop, with a minimal set of restrictions to make the implementation possible in dataflow. And you should be able to wrap that while loop in another while loop, because that can be useful.

Tracking progress

The most technically interesting and challenging part of timely dataflow is identifying graph structures that effect more traditional control structures from imperative programming. How do we implement a for loop in dataflow; just with a cycle? It is more complicated than that, because someone somewhere needs to notice when we "go around" the loop.

A good deal of the work in timely dataflow is tracking and reporting the "progress" of the computation. As data circulate around loops and other graph structures, operators may need to know that the computation has logically proceeded to the next iteration. This happens partly by annotating data with its associated iteration, but also by clearly notifying operators when all data for some iteration has been received (even if no data was received).

Of course, this is more complicated than just running around a cycle. The graph structures may have nested cycles and other control flow constructs represented as graphs. We'll need to be more general.

Some details (this is all a work in progress) can be found on the page on tracking progress.

Clone this wiki locally