Skip to content

Commit

Permalink
Fixes per review
Browse files Browse the repository at this point in the history
  • Loading branch information
shoyer committed Aug 5, 2021
1 parent 72191d5 commit 9b91089
Show file tree
Hide file tree
Showing 7 changed files with 78 additions and 52 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ dist
docs/.ipynb_checkpoints
docs/_build
docs/_autosummary
docs/my-data.zarr
docs/*.zarr
__pycache__
4 changes: 0 additions & 4 deletions .read-the-docs.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@
<<<<<<< HEAD
# TODO
=======
# .readthedocs.yaml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
Expand All @@ -23,4 +20,3 @@ python:
- method: pip
path: .
system_packages: false
>>>>>>> f2470a9 (Initial sphinx docs.)
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,4 +80,4 @@
nb_output_stderr = "remove-warn"

# https://stackoverflow.com/a/66295922/809705
autodoc_typehints = "description"
autodoc_typehints = "description"
6 changes: 3 additions & 3 deletions docs/data-model.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Making a `Key` from scratch is simple:"
"Making a {py:class}`~xarray_beam.Key` from scratch is simple:"
]
},
{
Expand Down Expand Up @@ -81,7 +81,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Or given an existing `Key`, you can easily modify it with `replace()` or `with_offsets()`:"
"Or given an existing {py:class}`~xarray_beam.Key`, you can easily modify it with `replace()` or `with_offsets()`:"
]
},
{
Expand Down Expand Up @@ -128,7 +128,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"`Key` objects don't do very much. They are just simple structs with two attributes, along with various special methods required to use them as `dict` keys or as keys in Beam pipelines. You can find a more examples of manipulating keys in the docstring."
"{py:class}`~xarray_beam.Key` objects don't do very much. They are just simple structs with two attributes, along with various special methods required to use them as `dict` keys or as keys in Beam pipelines. You can find a more examples of manipulating keys {py:class}`in its docstring <xarray_beam.Key>`."
]
},
{
Expand Down
28 changes: 14 additions & 14 deletions docs/read-write.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
{
"cell_type": "code",
"execution_count": 42,
"id": "5923b201",
"id": "93d8abec",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -37,7 +37,7 @@
{
"cell_type": "code",
"execution_count": 39,
"id": "bc5bfdc0",
"id": "bb9f5306",
"metadata": {
"tags": [
"hide-input"
Expand Down Expand Up @@ -100,7 +100,7 @@
},
{
"cell_type": "markdown",
"id": "2f0e5efb",
"id": "32acc6f5",
"metadata": {},
"source": [
"Importantly, xarray datasets fed into `DatasetToChunks` **can be lazy**, with data not already loaded eagerly into NumPy arrays. When you feed lazy datasets into `DatasetToChunks`, each individual chunk will be indexed and evaluated separately on Beam workers.\n",
Expand All @@ -113,7 +113,7 @@
{
"cell_type": "code",
"execution_count": 47,
"id": "8a0d0091",
"id": "e9d9df0e",
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -149,15 +149,15 @@
},
{
"cell_type": "markdown",
"id": "de622acb",
"id": "dee1e6e1",
"metadata": {},
"source": [
"`chunks=None` tells Xarray to use its builtin lazy indexing machinery, instead of using Dask. This is advantageous because datasets using Xarray's lazy indexing are serialized much more compactly (via [pickle](https://docs.python.org/3/library/pickle.html)) when passed into Beam transforms."
]
},
{
"cell_type": "markdown",
"id": "e4dc8c82",
"id": "6006a96e",
"metadata": {},
"source": [
"Alternatively, you can pass in lazy datasets [using dask](http://xarray.pydata.org/en/stable/user-guide/dask.html). In this case, you don't need to explicitly supply `chunks` to `DatasetToChunks`:"
Expand All @@ -166,7 +166,7 @@
{
"cell_type": "code",
"execution_count": 49,
"id": "b61440aa",
"id": "62563df5",
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -198,7 +198,7 @@
},
{
"cell_type": "markdown",
"id": "d73c6398",
"id": "fe8869a6",
"metadata": {},
"source": [
"Dask's lazy evaluation system is much more general than Xarray's lazy indexing, so as long as resulting dataset can be independently evaluated in each chunk this can be a very convenient way to setup computation for Xarray-Beam.\n",
Expand All @@ -208,7 +208,7 @@
},
{
"cell_type": "markdown",
"id": "4c4dfd42",
"id": "2dd6e083",
"metadata": {},
"source": [
"```{note}\n",
Expand All @@ -226,7 +226,7 @@
},
{
"cell_type": "markdown",
"id": "2f415ceb",
"id": "672aaea7",
"metadata": {},
"source": [
"[Zarr](https://zarr.readthedocs.io/) is the preferred file format for reading and writing data with Xarray-Beam, due to its excellent scalability and support inside Xarray.\n",
Expand All @@ -239,7 +239,7 @@
{
"cell_type": "code",
"execution_count": 50,
"id": "9c6efa33",
"id": "8fd3fede",
"metadata": {},
"outputs": [
{
Expand All @@ -257,7 +257,7 @@
},
{
"cell_type": "markdown",
"id": "04c0f50b",
"id": "c035aee7",
"metadata": {},
"source": [
"By default, `ChunksToZarr` needs to evaluate and combine the entire distributed dataset in order to determine overall Zarr metadata (e.g., array names, shapes, dtypes and attributes). This is fine for relatively small datasets, but can entail significant additional communication and storage costs for large datasets.\n",
Expand All @@ -270,7 +270,7 @@
{
"cell_type": "code",
"execution_count": 55,
"id": "993191db",
"id": "63bc75e1",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -280,7 +280,7 @@
},
{
"cell_type": "markdown",
"id": "e70cd961",
"id": "711026c7",
"metadata": {},
"source": [
"Xarray operations like indexing and expand dimensions (see {py:meth}`xarray.Dataset.expand_dims`) are entirely lazy on this dataset, which makes it relatively straightforward to build up a Dataset with the required variables and dimensions, e.g., as used in the [ERA5 climatology example](https://github.com/google/xarray-beam/blob/main/examples/era5_climatology.py)."
Expand Down
80 changes: 51 additions & 29 deletions docs/rechunking.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,43 +11,31 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Rechunking lets us re-distribute how datasets are split between variables and chunks."
"Rechunking lets us re-distribute how datasets are split between variables and chunks across a Beam PCollection."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll recreate our dummy data from the data model tutorial:"
"To get started we'll recreate our dummy data from the data model tutorial:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/shoyer/miniconda3/envs/xarray-beam/lib/python3.9/site-packages/apache_beam/__init__.py:79: UserWarning: This version of Apache Beam has not been sufficiently tested on Python 3.9. You may encounter bugs or missing features.\n",
" warnings.warn(\n"
]
}
],
"execution_count": 2,
"metadata": {
"tags": [
"hide-inputs"
]
},
"outputs": [],
"source": [
"import apache_beam as beam\n",
"import numpy as np\n",
"import xarray_beam as xbeam\n",
"import xarray"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import xarray\n",
"\n",
"def create_records():\n",
" for offset in [0, 4]:\n",
" key = xbeam.Key({'x': offset, 'y': 0})\n",
Expand All @@ -65,7 +53,24 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Adjusting variables"
"## Choosing chunks\n",
"\n",
"Chunking can be essential for some operations. Some operations are very hard or impossible to perform with certain chunking schemes. For example, to make a plot all the data needs to come toether on a single machine. Other calculations such as calculating a median are _possible_ to perform on distributed data, but require tricky algorithms and/or approximation.\n",
"\n",
"More broadly, chunking can have critical performance implications, similar to [those for Xarray and Dask](http://xarray.pydata.org/en/stable/user-guide/dask.html#chunking-and-performance). As a rule of thumb, chunk sizes of 10-100 MB work well. The optimal chunk size is a balance among a number of considerations, adapted here [from Dask docs](https://docs.dask.org/en/latest/array-chunks.html):\n",
"\n",
"1. Chunks should be small enough to fit comfortably into memory on a single machine. As an upper limit, chunks over roughly 2 GB in size will not fit into the protocol buffers Beam uses to pass data between workers. \n",
"2. There should be enough chunks for Beam runners (like Cloud Dataflow) to elastically shard work over many workers.\n",
"3. Chunks should be large enough to amortize the overhead of networking and the Python interpreter, which starts to become noticeable for arrays with fewer than 1 million elements.\n",
"\n",
"The `nbytes` attribute on both NumPy arrays and `xarray.Dataset` objects is a good easy way to figure out how larger chunks are."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Adjusting variables"
]
},
{
Expand Down Expand Up @@ -168,7 +173,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Adjusting chunks"
"## Adjusting chunks"
]
},
{
Expand All @@ -178,14 +183,21 @@
"You can also adjust _chunks_ in a dataset to distribute arrays of different sizes. Here you have two choices of API:\n",
"\n",
"1. The lower level {py:class}`~xarray_beam.SplitChunks` and {py:class}`~xarray_beam.ConsolidateChunks`. These transformations apply a single splitting (with indexing) or consolidation (with {py:function}`xarray.concat`) function to array elements.\n",
"2. The high level {py:class}`~xarray_beam.Rechunk`, which uses a pipeline of multiple split/consolidate steps to efficient rechunk a dataset."
"2. The high level {py:class}`~xarray_beam.Rechunk`, which uses a pipeline of multiple split/consolidate steps (as needed) to efficiently rechunk a dataset.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Low level rechunking"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For minor adjustments, the more explicit `SplitChunks()` and `ConsolidateChunks()` are good options. They take a dict of _desired_ chunk sizes as a parameter, which can also be `-1` to indicate \"no chunking\" along a dimension:"
"For minor adjustments (e.g., mostly along a single dimension), the more explicit `SplitChunks()` and `ConsolidateChunks()` are good options. They take a dict of _desired_ chunk sizes as a parameter, which can also be `-1` to indicate \"no chunking\" along a dimension:"
]
},
{
Expand Down Expand Up @@ -335,7 +347,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, `Rechunk()` applies multiple split and consolidate steps based on the [Rechunker](https://github.com/pangeo-data/rechunker) algorithm:"
"### High level rechunking"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, the high-level `Rechunk()` method applies multiple split and consolidate steps based on the [Rechunker](https://github.com/pangeo-data/rechunker) algorithm:"
]
},
{
Expand Down Expand Up @@ -387,11 +406,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"`Rechunk` requires specifying a few more parameters, but based on that information it can be _much_ more efficient for more complex rechunking tasks, particular in cases where data needs to be distributed into a very different shape (e.g., distributing a matrix across rows vs. columns). A naive \"splitting\" approach in such cases could divide datasets into extremely small tasks corresponding to individual array elements, which adds a huge amount of overhead."
"`Rechunk` requires specifying a few more parameters, but based on that information **it can be _much_ more efficient for more complex rechunking tasks**, particular in cases where data needs to be distributed into a very different shape (e.g., distributing a matrix across rows vs. columns).\n",
"\n",
"The naive \"splitting\" approach in such cases may divide datasets into extremely small tasks corresponding to individual array elements, which adds a huge amount of overhead."
]
}
],
"metadata": {
"celltoolbar": "Tags",
"interpreter": {
"hash": "aef148d7ea0dbd1f91630322dd5bc9e24a2135d95f24fe1a9dab9696856be2b9"
},
Expand Down
8 changes: 8 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,13 @@
'zarr',
'xarray',
]
docs_requires = [
'myst-nb',
'myst-parser',
'sphinx',
'sphinx_rtd_theme',
'scipy',
]
tests_requires = [
'absl-py',
'pandas',
Expand All @@ -39,6 +46,7 @@
install_requires=base_requires,
extras_require={
'tests': tests_requires,
'docs': docs_requires,
},
url='https://github.com/google/xarray-beam',
packages=setuptools.find_packages(),
Expand Down

0 comments on commit 9b91089

Please sign in to comment.