FlumeJava: Easy, Efficient Data-Parallel Pipelines [pdf]

Chambers, Craig, et al. "FlumeJava: easy, efficient data-parallel pipelines." ACM Sigplan Notices. Vol. 45. No. 6. ACM, 2010.

Flume is a Google library for building and executing pipelines of MapReduce jobs. Often, a single MapReduce job is not sufficient to compute everything. Chaining several MapReduce jobs requires connecting them, cleaning up the intermediate results and implementing logic for handling failures. Flume abstracts all of these details in a library. It automatically optimizes pipelines and evaluates everything in a lazy way. Conceptually, the interface and idea are similar to those of Spark.

1. Introduction

Many tasks can be expressed ask MapReduce jobs, but often a sequence of such jobs is necessary
Additional low-level logic is required to execute these pipelines:
- Logic for connecting the jobs
- Deleting the intermediate results in the end
- Handling failures
All of this makes it harder to change the pipeline
Flume is meant to support the development of such data-parallel pipelines
There are classes for representing parallel collections. These support parallel operations
(Note: This seems to be similar to Spark's RDDs)
Flume abstracts away the details of how the data is stored (in memory, in a file, in a database)
Similarly, Flume abstracts away the details of how the parallel operations are executed
For development, everything can be executed and debugged locally
For good performance, evaluations are deferred. The invocation of a parallel operations does not execute anything but just records the operation internally
Once the execution plan is constructed, Flume optimizes it
- Chains of parallel operations are fused together
- Depending on the size of the input data, parts of the pipelines are either executed locally or on a cluster
After the execution, Flume cleans up intermediate results
Performance is as fast as MapReduce programs optimized by experienced users

2. Background on MapReduce

Flume builds on the MapReduce abstraction, which has three phases, each using the output of the previous one:
- map invokes a user-defined function on each record independently and in parallel. Records consist of (key, value)-pairs. For each such input, the user-defined function emits zero or more new (key, value)-pairs that are written to an output file
- The shuffle phase groups all pairs by the key and for each key then outputs a stream of its values. The user configures nothing in this phase
- reduce takes each key and its stream, and emits the new values of the respective key. Often it reduces the stream down to a single value. If this function is commutative and associative, it can already be executed as a combiner in the map phase. This reduces the amount of data sent over the wire
map and reduce can be the identity function
These phases are designed in a way that they can be executed in a massively parallel way
The MapReduce library deals with the low-level tasks of managing the intermediate data, distributing the computation and dealing with failures
It's implemented in C++
The Map phase supports embarrassingly parallel, element-wise computations
Using MapReduce means figuring out how to express computations in the map and reduce framework
High-level tasks like "count the number of occurrences" need to be hand-compiled into low-level map and reduce tasks

3. The FlumeJava library

3.1 Core Abstractions

Data
- PCollection<T> is an immutable bag (multi-set) of elements, with a well-defined type
- Optionally, collections can have order. In that case, we refer to them as sequences
- Collections can be created from data in memory or by reading an input file
- Datasets are often sharded into several files. Flume can read all of them at once
- PTable<K, V> represents a (key, value)-table. It is a subclass of PCollection<Pair< K, V>> with some additional operations. In other languages, it might just be an alias for a collection
Operations
- Flume defines a few primitive data-parallel operations. All other operations are implemented using these building blocks
- parallelDo applies a function element-wise to a PCollection<T> to convert it to a PCollection<S>. Each value is mapped to zero or more new values
- There are subclasses of DoFn for common tasks, such as mapping and filtering. There is also a version that can produce multiple output PCollections using a single pass over the data
- parallelDo can be used to implement both map and reduce
- Ideally, this function is pure and does not access global, mutable state
- groupByKey converts a PTable<K, V> to a PTable<K, Collection<V>> by grouping the values belonging to each key. The value list is a plain Java collection. This corresponds to the shuffling phase in MapReduce and has to be used explicitly. There is a variant that allows specifying the exact sorting order
- combineValues takes a PTable<K, Collection<V>> and converts it to a PTable<K, V>. Each value collection is reduced to a single value. This operation should be associative. It could be implemented using a parallelDo function but since the function has to be associative it can be executed faster by using a MapReduce combiner and reducer
- flatten takes a list of PCollection<T>s and concatenates them. It does not actually copy any items but creates a view of them as one logical PCollection
- The results of those functions can be combined using normal Java control flow, since the collections are regular Java objects

3.2 Derived Operations

Several other operations are built on the primitive functions. They are no different to helper functions that users could write
count: Maps each value so that it becomes a key with a value of 1, applies groupByKey and then adds up the counts using combineValues
join: Merges two PTable so that values with the same key are grouped together. This could be used as an intermediate result for computing traditional inner and outer joins. The implementation is equivalent to that of a hashing-based join
top: Takes a comparison function and a count N, and returns the top N elements according to the given comparison function

3.3 Deferred Evaluation

Flume executes pipelines in a lazy way, using deferred evaluation
Each PCollection is either marked as deferred or materialized
When parallelDo or similar functions are called, they return a deferred collection
The result of several such function calls is a directed acyclic graph that we refer to as the execution plan
The user needs to call FlumeJava.run() to actually execute the commands. This optimizes the execution plan and then evaluates each node in a topological order

3.4 PObjects

PObject<T> is a container that allows us to inspect values during or after the execution. It's a promise for a future value
For example, pcollection.asSequentialCollection() returns a PObject that will eventually hold the respective data
getValue() returns the value of the object, once it's available. Otherwise it seems to return something falsy
The operate primitive can be used to execute code once a promise is fulfilled
File-reading operations are overloaded to also be able to read filenames contained in PObjects
PObjects can also be passed as side-inputs to parallelDo calls

4. Optimizer

The entire optimization process fully consists of graph rewrites

4.1 `ParallelDo` Fusion

If two ParallelDos are composed, they can be fused into a single operation
Siblings are also fused into an operation that has multiple outputs
This also applies to combineValues operations

4.2 The MapShuffleCombineReduce (`MSCR`) Operation

Combinations of parallelDo, groupByKey, combineValues and flatten are transformed into a single MSCR operation
(Note: This seems to be the abstraction for a single MapReduce job)
Such an operation has M inputs and R outputs
Rules for reduce are less strict. It no longer needs to output values of the same key, since it's just a parallelDo internally

4.3 `MSCR` Fusion

We consider groupByKey operations related if they make use of the same inputs
These related operations form the basis for MSCRs as they define the inputs to the combined operation
We move all M, S, C, R functions into individual MSCR blocks

4.4 Overall Optimizer Strategy

The optimization performs several passes through the graph
The goal is to minimize the number of MSCR operations
The major steps:
1. Sink flattens. If we execute them as late as possible, we can fuse more parallelDo operations
2. Move combineValues inside of parallelDo
3. Insert fusion blocks. If we have an operation between two MSCRs, this marker determines which MSCR the operation is merged into. The fundamental idea here is that the output of each MSCR should be as small as possible, to minimize network traffic
4. Fuse parallelDos
5. Create and fuse MSCRs

4.5 Example: SiteData

Just an example of the optimized steps
16 operations are optimized into just two MSCRs

4.6 Optimizer Limitations and Future Work

Code inside of functions is not analyzed or changed
Traditional techniques, such as inlining code, could be applied here
Users can provide hints about the size of the data, so that Flume can optimize in a more informed way. Static analysis might improve on this
Common subexpression elimination could make sense
Users often include groupByKeys, even though they don't need it. In some situations, this is because the next operation implicitly groups by keys anyways. The optimizor could detect this

5. Executor

Once the execution plan is optimized, Flume runs it in a concurrent way
Independent operations are executed in parallel
If there's not enough data, Flume runs stuff locally to avoid overhead. Otherwise it launches a remote MapReduce job
Flume automatically chooses the number of machines, based on prior analysis of the data and the operations
Intermediate results are automatically deleted
Design goal: Developing these pipelines should feel as if the developer was just writing a program that's locally executed. To help with this, Flume redirects all stdout/stderr from workers to the client and also rethrows errors there
To aid with debugging, Flume can cache subresults. If a job fails, Flume can use these subresults and directly pick up the work that failed last time. This makes it much easier to iterate. Flume automatically detects what parts of the job changed and need to be reexecuted
As of this paper, Flume only supports batch-mode and no streaming jobs

6. Evaluation

The optimizer is extremely useful. In one example, it reduced 207 operations down to a single one
It's on par with hand-optimization from experienced users

7. Related Work

The optimizer uses many ideas from compiler optimization
Before FlumeJava, there was a similar system with its own interface language
That system was never widely adopted because it increased the effort necessary to understand everything (and some other reasons)

8. Conclusion

Flume provides simple abstractions on top of MapReduce
It uses a form of deferred evaluation to enable dynamic optimization
The executor can choose among several implementation strategies (local, distributed)

Other notes

I'm not sure why this paper is so Java-centric. It seems like the paper would be much nicer to read if it would be just about Flume, not FlumeJava.
Maybe the Lumberjack background story is the reason for the excessive use of Java
Still, the Java generics make it easier to understand some things
The related work section is absurdly long

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

010_FlumeJava_Easy_Efficient_Data-Parallel_Pipelines.md

010_FlumeJava_Easy_Efficient_Data-Parallel_Pipelines.md

FlumeJava: Easy, Efficient Data-Parallel Pipelines [pdf]

1. Introduction

2. Background on MapReduce

3. The FlumeJava library

3.1 Core Abstractions

3.2 Derived Operations

3.3 Deferred Evaluation

3.4 PObjects

4. Optimizer

4.1 `ParallelDo` Fusion

4.2 The MapShuffleCombineReduce (`MSCR`) Operation

4.3 `MSCR` Fusion

4.4 Overall Optimizer Strategy

4.5 Example: SiteData

4.6 Optimizer Limitations and Future Work

5. Executor

6. Evaluation

7. Related Work

8. Conclusion

Other notes

Files

010_FlumeJava_Easy_Efficient_Data-Parallel_Pipelines.md

Latest commit

History

010_FlumeJava_Easy_Efficient_Data-Parallel_Pipelines.md

File metadata and controls

FlumeJava: Easy, Efficient Data-Parallel Pipelines [pdf]

1. Introduction

2. Background on MapReduce

3. The FlumeJava library

3.1 Core Abstractions

3.2 Derived Operations

3.3 Deferred Evaluation

3.4 PObjects

4. Optimizer

4.1 ParallelDo Fusion

4.2 The MapShuffleCombineReduce (MSCR) Operation

4.3 MSCR Fusion

4.4 Overall Optimizer Strategy

4.5 Example: SiteData

4.6 Optimizer Limitations and Future Work

5. Executor

6. Evaluation

7. Related Work

8. Conclusion

Other notes

4.1 `ParallelDo` Fusion

4.2 The MapShuffleCombineReduce (`MSCR`) Operation

4.3 `MSCR` Fusion