Add doc page on 'Why xarray-beam'

alxmrs · Aug 9, 2021 · f191fd3 · f191fd3
1 parent e66a38e
commit f191fd3
Show file tree

Hide file tree

Showing 9 changed files with 279 additions and 252 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
 *.egg-info
+.DS_Store
 build
 dist
 docs/.ipynb_checkpoints

diff --git a/README.md b/README.md
@@ -17,65 +17,23 @@ multi-dimensional labeled arrays, such as:
 -   Calculating statistics (e.g., "climatology") across distributed datasets
     with arbitrary groups.
 
-Xarray-Beam is implemented as a _thin layer_ on top of existing libraries for
-working with large-scale Xarray datasets. For example, it leverages
-[Dask](https://dask.org/) for describing lazy arrays and for executing
-multi-threaded computation on a single machine.
+For more about our approach and how to get started,
+**[read the documentation](https://xarray-beam.readthedocs.io/)**!
 
 **🚨 Warning: Xarray-Beam is new and unpolished 🚨**
 
 Expect sharp edges 🔪 and performance cliffs 🧗, particularly related to the
 management of lazy data with Dask and reading/writing data with Zarr. We have
-used it to efficiently process 5 TB datasets. We _expect_ it to scale to PB size
-datasets but that's easier said than done. We welcome feedback and contributions
-from early adopters, and hope to have it ready for wider audience soon.
+used it to efficiently process ~25 TB datasets. We _expect_ it to scale to PB
+size datasets but that's easier said than done. We welcome feedback and
+contributions from early adopters, and hope to have it ready for wider audience
+soon.
 
-## How does Xarray-Beam compare to Dask?
-
-We love Dask! Xarray-Beam explores a different part of the design space for
-distributed data pipelines than Xarray's built-in Dask integration:
-
--   Xarray-Beam is built around explicit manipulation of `(xarray_beam.Key,
-    xarray.Dataset)` pairs to perform operations on distributed datasets, where
-    `Key` is an immutable dict keeping track of the offsets from the origin for
-    a small contiguous "chunk" of a larger distributed dataset. This requires
-    more boilerplate but is also more robust than generating distributed
-    computation graphs in Dask using Xarray's built-in API. The user is expected
-    to have a mental model for how their data pipeline is distributed across
-    many machines.
--   Xarray-Beam distributes datasets by splitting them into many
-    `xarray.Dataset` chunks, rather than the chunks of NumPy arrays typically
-    used by Xarray with Dask (unless using
-    [xarray.map_blocks](http://xarray.pydata.org/en/stable/user-guide/dask.html#automatic-parallelization-with-apply-ufunc-and-map-blocks)).
-    Chunks of datasets is a more convenient data-model for writing ad-hoc whole
-    dataset transformations, but is potentially a bit less efficient.
--   Beam ([like Spark](https://docs.dask.org/en/latest/spark.html)) was designed
-    around a higher-level model for distributed computation than Dask (although
-    Dask has been making
-    [progress in this direction](https://coiled.io/blog/dask-under-the-hood-scheduler-refactor/)).
-    Roughly speaking, this trade-off favors scalability over flexibility.
--   Beam allows for executing distributed computation using multiple runners,
-    notably including Google Cloud Dataflow and Apache Spark. These runners are
-    more mature than Dask, and in many cases are supported as a service by major
-    commercial cloud providers.
-
-![Xarray-Beam datamodel vs Xarray-Dask](./static/xarray-beam-vs-xarray-dask.png)
-
-These design choices are not set in stone. In particular, in the future we
-_could_ imagine writing a high-level `xarray_beam.Dataset` that emulates the
-`xarray.Dataset` API, similar to the popular high-level DataFrame APIs in Beam,
-Spark and Dask. This could be built on top of the lower-level transformations
-currently in Xarray-Beam, or alternatively could use a "chunks of NumPy arrays"
-representation similar to that used by dask.array.
-
-## Getting started
+## Installation
 
 Xarray-Beam requires recent versions of immutabledict, xarray, dask, rechunker
-and zarr. It needs the latest release of Apache Beam (2.31.0 or later). For good
-performance when writing Zarr files, we strongly recommend patching Xarray with
-[this pull request](https://github.com/pydata/xarray/pull/5252).
-
-TODO(shoyer): write a tutorial here! For now, see the test suite for examples.
+and zarr, and the *latest* release of Apache Beam (2.31.0 or later). For best
+performance when writing Zarr files, use Xarray 0.19.0 or later.
 
 ## Disclaimer
 
@@ -93,3 +51,4 @@ Contributors:
 -   Stephan Hoyer
 -   Jason Hickey
 -   Cenk Gazen
+-   Alex Merose
diff --git a/static/README.md → docs/_static/README.md b/static/README.md → docs/_static/README.md
diff --git a/static/xarray-beam-logo.png → docs/_static/xarray-beam-logo.png b/static/xarray-beam-logo.png → docs/_static/xarray-beam-logo.png
diff --git a/static/xarray-beam-vs-xarray-dask.png → docs/_static/xarray-beam-vs-xarray-dask.png b/static/xarray-beam-vs-xarray-dask.png → docs/_static/xarray-beam-vs-xarray-dask.png