Fixes per review

alxmrs · Aug 5, 2021 · 9b91089 · 9b91089
1 parent 72191d5
commit 9b91089
Show file tree

Hide file tree

Showing 7 changed files with 78 additions and 52 deletions.
diff --git a/.gitignore b/.gitignore
@@ -4,5 +4,5 @@ dist
 docs/.ipynb_checkpoints
 docs/_build
 docs/_autosummary
-docs/my-data.zarr
+docs/*.zarr
 __pycache__
diff --git a/.read-the-docs.yaml b/.read-the-docs.yaml
@@ -1,6 +1,3 @@
-<<<<<<< HEAD
-# TODO
-=======
 # .readthedocs.yaml
 # Read the Docs configuration file
 # See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
@@ -23,4 +20,3 @@ python:
     - method: pip
       path: .
   system_packages: false
->>>>>>> f2470a9 (Initial sphinx docs.)
diff --git a/docs/conf.py b/docs/conf.py
@@ -80,4 +80,4 @@
 nb_output_stderr = "remove-warn"
 
 # https://stackoverflow.com/a/66295922/809705
-autodoc_typehints = "description"
+autodoc_typehints = "description"
diff --git a/docs/data-model.ipynb b/docs/data-model.ipynb
@@ -53,7 +53,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Making a `Key` from scratch is simple:"
+    "Making a {py:class}`~xarray_beam.Key` from scratch is simple:"
    ]
   },
   {
@@ -81,7 +81,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Or given an existing `Key`, you can easily modify it with `replace()` or `with_offsets()`:"
+    "Or given an existing {py:class}`~xarray_beam.Key`, you can easily modify it with `replace()` or `with_offsets()`:"
    ]
   },
   {
@@ -128,7 +128,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "`Key` objects don't do very much. They are just simple structs with two attributes, along with various special methods required to use them as `dict` keys or as keys in Beam pipelines. You can find a more examples of manipulating keys in the docstring."
+    "{py:class}`~xarray_beam.Key` objects don't do very much. They are just simple structs with two attributes, along with various special methods required to use them as `dict` keys or as keys in Beam pipelines. You can find a more examples of manipulating keys {py:class}`in its docstring <xarray_beam.Key>`."
    ]
   },
   {

diff --git a/docs/read-write.ipynb b/docs/read-write.ipynb
@@ -27,7 +27,7 @@
   {
    "cell_type": "code",
    "execution_count": 42,
-   "id": "5923b201",
+   "id": "93d8abec",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -37,7 +37,7 @@
   {
    "cell_type": "code",
    "execution_count": 39,
-   "id": "bc5bfdc0",
+   "id": "bb9f5306",
    "metadata": {
     "tags": [
      "hide-input"
@@ -100,7 +100,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "2f0e5efb",
+   "id": "32acc6f5",
    "metadata": {},
    "source": [
     "Importantly, xarray datasets fed into `DatasetToChunks` **can be lazy**, with data not already loaded eagerly into NumPy arrays. When you feed lazy datasets into `DatasetToChunks`, each individual chunk will be indexed and evaluated separately on Beam workers.\n",
@@ -113,7 +113,7 @@
   {
    "cell_type": "code",
    "execution_count": 47,
-   "id": "8a0d0091",
+   "id": "e9d9df0e",
    "metadata": {},
    "outputs": [
     {
@@ -149,15 +149,15 @@
   },
   {
    "cell_type": "markdown",
-   "id": "de622acb",
+   "id": "dee1e6e1",
    "metadata": {},
    "source": [
     "`chunks=None` tells Xarray to use its builtin lazy indexing machinery, instead of using Dask. This is advantageous because datasets using Xarray's lazy indexing are serialized much more compactly (via [pickle](https://docs.python.org/3/library/pickle.html)) when passed into Beam transforms."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "e4dc8c82",
+   "id": "6006a96e",
    "metadata": {},
    "source": [
     "Alternatively, you can pass in lazy datasets [using dask](http://xarray.pydata.org/en/stable/user-guide/dask.html). In this case, you don't need to explicitly supply `chunks` to `DatasetToChunks`:"
@@ -166,7 +166,7 @@
   {
    "cell_type": "code",
    "execution_count": 49,
-   "id": "b61440aa",
+   "id": "62563df5",
    "metadata": {},
    "outputs": [
     {
@@ -198,7 +198,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "d73c6398",
+   "id": "fe8869a6",
    "metadata": {},
    "source": [
     "Dask's lazy evaluation system is much more general than Xarray's lazy indexing, so as long as resulting dataset can be independently evaluated in each chunk this can be a very convenient way to setup computation for Xarray-Beam.\n",
@@ -208,7 +208,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "4c4dfd42",
+   "id": "2dd6e083",
    "metadata": {},
    "source": [
     "```{note}\n",
@@ -226,7 +226,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "2f415ceb",
+   "id": "672aaea7",
    "metadata": {},
    "source": [
     "[Zarr](https://zarr.readthedocs.io/) is the preferred file format for reading and writing data with Xarray-Beam, due to its excellent scalability and support inside Xarray.\n",
@@ -239,7 +239,7 @@
   {
    "cell_type": "code",
    "execution_count": 50,
-   "id": "9c6efa33",
+   "id": "8fd3fede",
    "metadata": {},
    "outputs": [
     {
@@ -257,7 +257,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "04c0f50b",
+   "id": "c035aee7",
    "metadata": {},
    "source": [
     "By default, `ChunksToZarr` needs to evaluate and combine the entire distributed dataset in order to determine overall Zarr metadata (e.g., array names, shapes, dtypes and attributes). This is fine for relatively small datasets, but can entail significant additional communication and storage costs for large datasets.\n",
@@ -270,7 +270,7 @@
   {
    "cell_type": "code",
    "execution_count": 55,
-   "id": "993191db",
+   "id": "63bc75e1",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -280,7 +280,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "e70cd961",
+   "id": "711026c7",
    "metadata": {},
    "source": [
     "Xarray operations like indexing and expand dimensions (see {py:meth}`xarray.Dataset.expand_dims`) are entirely lazy on this dataset, which makes it relatively straightforward to build up a Dataset with the required variables and dimensions, e.g., as used in the [ERA5 climatology example](https://github.com/google/xarray-beam/blob/main/examples/era5_climatology.py)."

diff --git a/docs/rechunking.ipynb b/docs/rechunking.ipynb
@@ -11,43 +11,31 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Rechunking lets us re-distribute how datasets are split between variables and chunks."
+    "Rechunking lets us re-distribute how datasets are split between variables and chunks across a Beam PCollection."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We'll recreate our dummy data from the data model tutorial:"
+    "To get started we'll recreate our dummy data from the data model tutorial:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/Users/shoyer/miniconda3/envs/xarray-beam/lib/python3.9/site-packages/apache_beam/__init__.py:79: UserWarning: This version of Apache Beam has not been sufficiently tested on Python 3.9. You may encounter bugs or missing features.\n",
-      "  warnings.warn(\n"
-     ]
-    }
-   ],
+   "execution_count": 2,
+   "metadata": {
+    "tags": [
+     "hide-inputs"
+    ]
+   },
+   "outputs": [],
    "source": [
     "import apache_beam as beam\n",
     "import numpy as np\n",
     "import xarray_beam as xbeam\n",
-    "import xarray"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [],
-   "source": [
+    "import xarray\n",
+    "\n",
     "def create_records():\n",
     "    for offset in [0, 4]:\n",
     "        key = xbeam.Key({'x': offset, 'y': 0})\n",
@@ -65,7 +53,24 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Adjusting variables"
+    "## Choosing chunks\n",
+    "\n",
+    "Chunking can be essential for some operations. Some operations are very hard or impossible to perform with certain chunking schemes. For example, to make a plot all the data needs to come toether on a single machine. Other calculations such as calculating a median are _possible_ to perform on distributed data, but require tricky algorithms and/or approximation.\n",
+    "\n",
+    "More broadly, chunking can have critical performance implications, similar to [those for Xarray and Dask](http://xarray.pydata.org/en/stable/user-guide/dask.html#chunking-and-performance). As a rule of thumb, chunk sizes of 10-100 MB work well. The optimal chunk size is a balance among a number of considerations, adapted here [from Dask docs](https://docs.dask.org/en/latest/array-chunks.html):\n",
+    "\n",
+    "1. Chunks should be small enough to fit comfortably into memory on a single machine. As an upper limit, chunks over roughly 2 GB in size will not fit into the protocol buffers Beam uses to pass data between workers. \n",
+    "2. There should be enough chunks for Beam runners (like Cloud Dataflow) to elastically shard work over many workers.\n",
+    "3. Chunks should be large enough to amortize the overhead of networking and the Python interpreter, which starts to become noticeable for arrays with fewer than 1 million elements.\n",
+    "\n",
+    "The `nbytes` attribute on both NumPy arrays and `xarray.Dataset` objects is a good easy way to figure out how larger chunks are."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Adjusting variables"
    ]
   },
   {
@@ -168,7 +173,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Adjusting chunks"
+    "## Adjusting chunks"
    ]
   },
   {
@@ -178,14 +183,21 @@
     "You can also adjust _chunks_ in a dataset to distribute arrays of different sizes. Here you have two choices of API:\n",
     "\n",
     "1. The lower level {py:class}`~xarray_beam.SplitChunks` and {py:class}`~xarray_beam.ConsolidateChunks`. These transformations apply a single splitting (with indexing) or consolidation (with {py:function}`xarray.concat`) function to array elements.\n",
-    "2. The high level {py:class}`~xarray_beam.Rechunk`, which uses a pipeline of multiple split/consolidate steps to efficient rechunk a dataset."
+    "2. The high level {py:class}`~xarray_beam.Rechunk`, which uses a pipeline of multiple split/consolidate steps (as needed) to efficiently rechunk a dataset.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Low level rechunking"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "For minor adjustments, the more explicit `SplitChunks()` and `ConsolidateChunks()` are good options. They take a dict of _desired_ chunk sizes as a parameter, which can also be `-1` to indicate \"no chunking\" along a dimension:"
+    "For minor adjustments (e.g., mostly along a single dimension), the more explicit `SplitChunks()` and `ConsolidateChunks()` are good options. They take a dict of _desired_ chunk sizes as a parameter, which can also be `-1` to indicate \"no chunking\" along a dimension:"
    ]
   },
   {
@@ -335,7 +347,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Alternatively, `Rechunk()` applies multiple split and consolidate steps based on the [Rechunker](https://github.com/pangeo-data/rechunker) algorithm:"
+    "### High level rechunking"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Alternatively, the high-level `Rechunk()` method applies multiple split and consolidate steps based on the [Rechunker](https://github.com/pangeo-data/rechunker) algorithm:"
    ]
   },
   {
@@ -387,11 +406,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "`Rechunk` requires specifying a few more parameters, but based on that information it can be _much_ more efficient for more complex rechunking tasks, particular in cases where data needs to be distributed into a very different shape (e.g., distributing a matrix across rows vs. columns). A naive \"splitting\" approach in such cases could divide datasets into extremely small tasks corresponding to individual array elements, which adds a huge amount of overhead."
+    "`Rechunk` requires specifying a few more parameters, but based on that information **it can be _much_ more efficient for more complex rechunking tasks**, particular in cases where data needs to be distributed into a very different shape (e.g., distributing a matrix across rows vs. columns).\n",
+    "\n",
+    "The naive \"splitting\" approach in such cases may divide datasets into extremely small tasks corresponding to individual array elements, which adds a huge amount of overhead."
    ]
   }
  ],
  "metadata": {
+  "celltoolbar": "Tags",
   "interpreter": {
    "hash": "aef148d7ea0dbd1f91630322dd5bc9e24a2135d95f24fe1a9dab9696856be2b9"
   },

diff --git a/setup.py b/setup.py
@@ -24,6 +24,13 @@
     'zarr',
     'xarray',
 ]
+docs_requires = [
+    'myst-nb',
+    'myst-parser',
+    'sphinx',
+    'sphinx_rtd_theme',
+    'scipy',
+]
 tests_requires = [
     'absl-py',
     'pandas',
@@ -39,6 +46,7 @@
     install_requires=base_requires,
     extras_require={
         'tests': tests_requires,
+        'docs': docs_requires,
     },
     url='https://github.com/google/xarray-beam',
     packages=setuptools.find_packages(),