Skip to content

Commit

Permalink
Add doc page on 'Why xarray-beam'
Browse files Browse the repository at this point in the history
  • Loading branch information
shoyer committed Aug 9, 2021
1 parent e66a38e commit f191fd3
Show file tree
Hide file tree
Showing 9 changed files with 279 additions and 252 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
*.egg-info
.DS_Store
build
dist
docs/.ipynb_checkpoints
Expand Down
61 changes: 10 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,65 +17,23 @@ multi-dimensional labeled arrays, such as:
- Calculating statistics (e.g., "climatology") across distributed datasets
with arbitrary groups.

Xarray-Beam is implemented as a _thin layer_ on top of existing libraries for
working with large-scale Xarray datasets. For example, it leverages
[Dask](https://dask.org/) for describing lazy arrays and for executing
multi-threaded computation on a single machine.
For more about our approach and how to get started,
**[read the documentation](https://xarray-beam.readthedocs.io/)**!

**🚨 Warning: Xarray-Beam is new and unpolished 🚨**

Expect sharp edges 🔪 and performance cliffs 🧗, particularly related to the
management of lazy data with Dask and reading/writing data with Zarr. We have
used it to efficiently process 5 TB datasets. We _expect_ it to scale to PB size
datasets but that's easier said than done. We welcome feedback and contributions
from early adopters, and hope to have it ready for wider audience soon.
used it to efficiently process ~25 TB datasets. We _expect_ it to scale to PB
size datasets but that's easier said than done. We welcome feedback and
contributions from early adopters, and hope to have it ready for wider audience
soon.

## How does Xarray-Beam compare to Dask?

We love Dask! Xarray-Beam explores a different part of the design space for
distributed data pipelines than Xarray's built-in Dask integration:

- Xarray-Beam is built around explicit manipulation of `(xarray_beam.Key,
xarray.Dataset)` pairs to perform operations on distributed datasets, where
`Key` is an immutable dict keeping track of the offsets from the origin for
a small contiguous "chunk" of a larger distributed dataset. This requires
more boilerplate but is also more robust than generating distributed
computation graphs in Dask using Xarray's built-in API. The user is expected
to have a mental model for how their data pipeline is distributed across
many machines.
- Xarray-Beam distributes datasets by splitting them into many
`xarray.Dataset` chunks, rather than the chunks of NumPy arrays typically
used by Xarray with Dask (unless using
[xarray.map_blocks](http://xarray.pydata.org/en/stable/user-guide/dask.html#automatic-parallelization-with-apply-ufunc-and-map-blocks)).
Chunks of datasets is a more convenient data-model for writing ad-hoc whole
dataset transformations, but is potentially a bit less efficient.
- Beam ([like Spark](https://docs.dask.org/en/latest/spark.html)) was designed
around a higher-level model for distributed computation than Dask (although
Dask has been making
[progress in this direction](https://coiled.io/blog/dask-under-the-hood-scheduler-refactor/)).
Roughly speaking, this trade-off favors scalability over flexibility.
- Beam allows for executing distributed computation using multiple runners,
notably including Google Cloud Dataflow and Apache Spark. These runners are
more mature than Dask, and in many cases are supported as a service by major
commercial cloud providers.

![Xarray-Beam datamodel vs Xarray-Dask](./static/xarray-beam-vs-xarray-dask.png)

These design choices are not set in stone. In particular, in the future we
_could_ imagine writing a high-level `xarray_beam.Dataset` that emulates the
`xarray.Dataset` API, similar to the popular high-level DataFrame APIs in Beam,
Spark and Dask. This could be built on top of the lower-level transformations
currently in Xarray-Beam, or alternatively could use a "chunks of NumPy arrays"
representation similar to that used by dask.array.

## Getting started
## Installation

Xarray-Beam requires recent versions of immutabledict, xarray, dask, rechunker
and zarr. It needs the latest release of Apache Beam (2.31.0 or later). For good
performance when writing Zarr files, we strongly recommend patching Xarray with
[this pull request](https://github.com/pydata/xarray/pull/5252).

TODO(shoyer): write a tutorial here! For now, see the test suite for examples.
and zarr, and the *latest* release of Apache Beam (2.31.0 or later). For best
performance when writing Zarr files, use Xarray 0.19.0 or later.

## Disclaimer

Expand All @@ -93,3 +51,4 @@ Contributors:
- Stephan Hoyer
- Jason Hickey
- Cenk Gazen
- Alex Merose
File renamed without changes.
File renamed without changes
File renamed without changes
Loading

0 comments on commit f191fd3

Please sign in to comment.