replace references to collection with catalog (#457)

* replace references to collection with catalog * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * don't break links to collection-spec * Delete sandbox.ipynb * missing one col * collection to cat in new tutorial sample cats * new google cmip cat * update index.md in tutorials to use new module * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update README to use tutorial.py Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
intake · Mar 24, 2022 · 84f945b · 84f945b
1 parent 5d26293
commit 84f945b
Show file tree

Hide file tree

Showing 35 changed files with 192 additions and 168 deletions.
diff --git a/README.md b/README.md
@@ -33,48 +33,50 @@ providing necessary functionality for searching, discovering, data access/loadin
 
 `intake-esm` is a data cataloging utility built on top of [intake](https://github.com/intake/intake), [pandas](https://pandas.pydata.org/), and [xarray](https://xarray.pydata.org/en/stable/), and it's pretty awesome!
 
-- Opening an ESM collection definition file: An ESM (Earth System Model) collection file is a JSON file that conforms
-  to the [ESM Collection Specification](https://github.com/NCAR/esm-collection-spec). When provided a link/path to an esm collection file, `intake-esm` establishes
+- Opening an ESM catalog definition file: An ESM (Earth System Model) catalog file is a JSON file that conforms
+  to the [ESM Collection Specification](https://github.com/NCAR/esm-collection-spec). When provided a link/path to an esm catalog file, `intake-esm` establishes
   a link to a database (CSV file) that contains data assets locations and associated metadata
-  (i.e., which experiment, model, the come from). The collection JSON file can be stored on a local filesystem
+  (i.e., which experiment, model, the come from). The catalog JSON file can be stored on a local filesystem
   or can be hosted on a remote server.
 
   ```python
 
   In [1]: import intake
 
-  In [2]: col_url = "https://gist.githubusercontent.com/andersy005/7f416e57acd8319b20fc2b88d129d2b8/raw/987b4b336d1a8a4f9abec95c23eed3bd7c63c80e/pangeo-gcp-subset.json"
+  In [2]: import intake_esm
 
-  In [3]: col = intake.open_esm_datastore(col_url)
+  In [3]: cat_url = intake_esm.tutorial.get_url("google_cmip6")
 
-  In [4]: col
-  Out[4]: <pangeo-cmip6 catalog with 4287 dataset(s) from 282905 asset(s)>
+  In [4]: cat = intake.open_esm_datastore(cat_url)
+
+  In [5]: cat
+  Out[5]: <GOOGLE-CMIP6 catalog with 4 dataset(s) from 261 asset(s>
   ```
 
 - Search and Discovery: `intake-esm` provides functionality to execute queries against the catalog:
 
   ```python
-  In [5]: col_subset = col.search(
+  In [5]: cat_subset = cat.search(
      ...:     experiment_id=["historical", "ssp585"],
      ...:     table_id="Oyr",
      ...:     variable_id="o2",
      ...:     grid_label="gn",
      ...: )
 
-  In [6]: col_subset
-  Out[6]: <pangeo-cmip6 catalog with 18 dataset(s) from 138 asset(s)>
+  In [6]: cat_subset
+  Out[6]: <GOOGLE-CMIP6 catalog with 4 dataset(s) from 261 asset(s)>
   ```
 
 - Access: when the user is satisfied with the results of their query, they can ask `intake-esm`
   to load data assets (netCDF/HDF files and/or Zarr stores) into xarray datasets:
 
   ```python
 
-    In [7]: dset_dict = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True})
+    In [7]: dset_dict = cat_subset.to_dataset_dict(zarr_kwargs={"consolidated": True})
 
     --> The keys in the returned dictionary of datasets are constructed as follows:
             'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
-    |███████████████████████████████████████████████████████████████| 100.00% [18/18 00:10<00:00]
+    |███████████████████████████████████████████████████████████████| 100.00% [2/2 00:18<00:00]
   ```
 
 See [documentation](https://intake-esm.readthedocs.io/en/latest/) for more information.

diff --git a/...source/explanation/esm-collection-spec.md → docs/source/explanation/esm-catalog-spec.md b/...source/explanation/esm-collection-spec.md → docs/source/explanation/esm-catalog-spec.md
@@ -1,12 +1,12 @@
-# ESM Collection Specification
+# ESM Catalog Specification
 
 ```{note}
 This documents mirrors the [ESM Collection Specification](https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/collection-spec.md) and is updated as the specification evolves.
 ```
 
-- [ESM Collection Specification](#esm-collection-specification)
+- [ESM Catalog Specification](#esm-catalog-specification)
   - [Overview](#overview)
-    - [Collection Specification](#collection-specification)
+    - [Catalog Specification](#catalog-specification)
     - [Catalog](#catalog)
     - [Assets (Data Files)](#assets-data-files)
   - [Catalog fields](#catalog-fields)
@@ -17,22 +17,22 @@ This documents mirrors the [ESM Collection Specification](https://github.com/NCA
 
 ## Overview
 
-This document explains the structure and content of an ESM Collection.
-A collection provides metadata about the catalog, telling us what we expect to find inside and how to open it.
-The collection is described is a single json file, inspired by the STAC spec.
+This document explains the structure and content of an ESM Catalog.
+A catalog provides metadata about the catalog, telling us what we expect to find inside and how to open it.
+The catalog is described is a single json file, inspired by the STAC spec.
 
-The ESM Collection specification consists of three parts:
+The ESM Catalog specification consists of three parts:
 
-### Collection Specification
+### Catalog Specification
 
-The _collection_ specification provides metadata about the catalog, telling us what we expect to find inside and how to open it.
+The _catalog_ specification provides metadata about the catalog, telling us what we expect to find inside and how to open it.
 The descriptor is a single json file, inspired by the [STAC spec](https://github.com/radiantearth/stac-spec).
 
 ```json
 {
   "esmcat_version": "0.1.0",
   "id": "sample",
-  "description": "This is a very basic sample ESM collection.",
+  "description": "This is a very basic sample ESM catalog.",
   "catalog_file": "sample_catalog.csv",
   "attributes": [
     {
@@ -70,17 +70,17 @@ They should be either [URIs](https://en.wikipedia.org/wiki/Uniform_Resource_Iden
 
 ## Catalog fields
 
-| Element             | Type                                                      | Description                                                                                                                                                               |
-| ------------------- | --------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| esmcat_version      | string                                                    | **REQUIRED.** The ESM Catalog version the collection implements.                                                                                                          |
-| id                  | string                                                    | **REQUIRED.** Identifier for the collection.                                                                                                                              |
-| title               | string                                                    | A short descriptive one-line title for the collection.                                                                                                                    |
-| description         | string                                                    | **REQUIRED.** Detailed multi-line description to fully explain the collection. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. |
-| catalog_file        | string                                                    | **REQUIRED.** Path to a the CSV file with the catalog contents.                                                                                                           |
-| catalog_dict        | array                                                     | If specified, it is mutually exclusive with `catalog_file`. An array of dictionaries that represents the data that would otherwise be in the csv.                         |
-| attributes          | [[Attribute Object](#attribute-object)]                   | **REQUIRED.** A list of attribute columns in the data set.                                                                                                                |
-| assets              | [Assets Object](#assets-object)                           | **REQUIRED.** Description of how the assets (data files) are referenced in the CSV catalog file.                                                                          |
-| aggregation_control | [Aggregation Control Object](#aggregation-control-object) | **OPTIONAL.** Description of how to support aggregation of multiple assets into a single xarray data set.                                                                 |
+| Element             | Type                                                      | Description                                                                                                                                                            |
+| ------------------- | --------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| esmcat_version      | string                                                    | **REQUIRED.** The ESM Catalog version the catalog implements.                                                                                                          |
+| id                  | string                                                    | **REQUIRED.** Identifier for the catalog.                                                                                                                              |
+| title               | string                                                    | A short descriptive one-line title for the catalog.                                                                                                                    |
+| description         | string                                                    | **REQUIRED.** Detailed multi-line description to fully explain the catalog. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. |
+| catalog_file        | string                                                    | **REQUIRED.** Path to a the CSV file with the catalog contents.                                                                                                        |
+| catalog_dict        | array                                                     | If specified, it is mutually exclusive with `catalog_file`. An array of dictionaries that represents the data that would otherwise be in the csv.                      |
+| attributes          | [[Attribute Object](#attribute-object)]                   | **REQUIRED.** A list of attribute columns in the data set.                                                                                                             |
+| assets              | [Assets Object](#assets-object)                           | **REQUIRED.** Description of how the assets (data files) are referenced in the CSV catalog file.                                                                       |
+| aggregation_control | [Aggregation Control Object](#aggregation-control-object) | **OPTIONAL.** Description of how to support aggregation of multiple assets into a single xarray data set.                                                              |
 
 ### Attribute Object
 

diff --git a/docs/source/explanation/index.md b/docs/source/explanation/index.md
@@ -4,5 +4,5 @@
 ---
 maxdepth: 1
 ---
-esm-collection-spec.md
+esm-catalog-spec.md
 ```
diff --git a/docs/source/how-to/enforce-search-query-criteria-via-require-all-on.md b/docs/source/how-to/enforce-search-query-criteria-via-require-all-on.md
@@ -25,7 +25,7 @@ dataframe column or a list of dataframe columns across which all elements must
 satisfy the query criteria. The `require_all_on` argument is best explained with
 the following example.
 
-Let’s define a query for our collection that requests multiple variable_ids and
+Let’s define a query for our catalog that requests multiple variable_ids and
 multiple experiment_ids from the Omon table_id, all from 3 different source_ids:
 
 ```{code-cell} ipython3
@@ -38,7 +38,7 @@ query = dict(
 )
 ```
 
-Now, let’s use this query to search for all assets in the collection that
+Now, let’s use this query to search for all assets in the catalog that
 satisfy any combination of these requests (i.e., with `require_all_on=None`,
 which is the default):
 

diff --git a/...rce/how-to/multi-variable-collection.json → ...source/how-to/multi-variable-catalog.json b/...rce/how-to/multi-variable-collection.json → ...source/how-to/multi-variable-catalog.json
@@ -1,7 +1,7 @@
 {
   "esmcat_version": "0.1.0",
   "id": "sample-multi-variable-cesm1-lens",
-  "description": "This is a sample ESM collection emulating multi variable/history files for CESM1-LENS",
+  "description": "This is a sample ESM catalog emulating multi variable/history files for CESM1-LENS",
   "catalog_file": "multi-variable-catalog.csv",
   "attributes": [
     {

diff --git a/docs/source/how-to/use-catalogs-with-assets-containing-multiple-variables.md b/docs/source/how-to/use-catalogs-with-assets-containing-multiple-variables.md
@@ -35,7 +35,7 @@ import intake
 import ast
 
 cat = intake.open_esm_datastore(
-    "multi-variable-collection.json",
+    "multi-variable-catalog.json",
     read_csv_kwargs={"converters": {"variable": ast.literal_eval}},
 )
 cat

diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md
@@ -27,22 +27,43 @@ import intake
 
 At import time, intake-esm plugin is available in intake’s registry as
 `esm_datastore` and can be accessed with `intake.open_esm_datastore()` function.
+Use the `intake_esm.tutorial.get_url()` method to access smaller subsetted catalogs for tutorial purposes.
+
+```{code-cell} ipython3
+
+import intake_esm
+url = intake_esm.tutorial.get_url('google_cmip6')
+print(url)
+```
 
 ```{code-cell} ipython3
 
-url = "https://gist.githubusercontent.com/andersy005/7f416e57acd8319b20fc2b88d129d2b8/raw/987b4b336d1a8a4f9abec95c23eed3bd7c63c80e/pangeo-gcp-subset.json"
 cat = intake.open_esm_datastore(url)
 cat
 ```
 
-The summary above tells us that this catalog contains over 268,000 data assets.
+The summary above tells us that this catalog contains 261 data assets.
 We can get more information on the individual data assets contained in the
 catalog by looking at the underlying dataframe created when we load the catalog:
 
 ```{code-cell} ipython3
 cat.df.head()
 ```
 
+The first data asset listed in the catalog contains:
+
+- the Northward Wind (variable_id='va'), as a function of latitude, longitude, time,
+
+- the latest version of the IPSL climate model (source_id='IPSL-CM6A-LR'),
+
+- hindcasts initialized from observations with historical forcing (experiment_id='historical'),
+
+- developed by theInstitut Pierre Simon Laplace (instution_id='IPSL'),
+
+- run as part of the Coupled Model Intercomparison Project (activity_id='CMIP')
+
+And is located in Google Cloud Storage at 'gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r2i1p1f1/Amon/va/gr/v20180803/'.
+
 ## Finding unique entries
 
 To get unique values for given columns in the catalog, intake-esm provides a
@@ -88,12 +109,10 @@ In the example below, we are are going to search for the following:
 
 - variable_d: `o2` which stands for
   `mole_concentration_of_dissolved_molecular_oxygen_in_sea_water`
-- experiment_id: `['historical', 'ssp585']`:
-  - `historical`: all forcing of the recent past.
-  - `ssp585`: emission-driven
-    [RCP8.5](https://en.wikipedia.org/wiki/Representative_Concentration_Pathway)
-    based on SSP5.
-- table_id: `Oyr` which stands for annual mean variables on the ocean grid.
+- experiments: ['historical', 'ssp585']:
+  - historical: all forcing of the recent past.
+  - ssp585: emission-driven RCP8.5 based on SSP5.
+- table_id: `0yr` which stands for annual mean variables on the ocean grid.
 - grid_label: `gn` which stands for data reported on a model's native grid.
 
 ```{note}
@@ -133,6 +152,9 @@ returns a dictionary of aggregate xarray datasets as the name hints.
 dset_dict = cat_subset.to_dataset_dict(
     xarray_open_kwargs={"consolidated": True, "decode_times": True, "use_cftime": True}
 )
+```
+
+```{code-cell} ipython3
 [key for key in dset_dict.keys()]
 ```
 

diff --git a/intake_esm/cat.py b/intake_esm/cat.py
@@ -188,7 +188,7 @@ def save(
             json_kwargs.update(json_dump_kwargs or {})
             json.dump(data, outfile, **json_kwargs)
 
-        print(f'Successfully wrote ESM collection json file to: {json_file_name}')
+        print(f'Successfully wrote ESM catalog json file to: {json_file_name}')
 
     @classmethod
     def load(