Skip to content

Commit

Permalink
replace references to collection with catalog (#457)
Browse files Browse the repository at this point in the history
* replace references to collection with catalog

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* don't break links to collection-spec

* Delete sandbox.ipynb

* missing one col

* collection to cat in new tutorial sample cats

* new google cmip cat

* update index.md in tutorials to use new module

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update README to use tutorial.py

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
jukent and pre-commit-ci[bot] authored Mar 24, 2022
1 parent 5d26293 commit 84f945b
Show file tree
Hide file tree
Showing 35 changed files with 192 additions and 168 deletions.
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,48 +33,50 @@ providing necessary functionality for searching, discovering, data access/loadin

`intake-esm` is a data cataloging utility built on top of [intake](https://github.com/intake/intake), [pandas](https://pandas.pydata.org/), and [xarray](https://xarray.pydata.org/en/stable/), and it's pretty awesome!

- Opening an ESM collection definition file: An ESM (Earth System Model) collection file is a JSON file that conforms
to the [ESM Collection Specification](https://github.com/NCAR/esm-collection-spec). When provided a link/path to an esm collection file, `intake-esm` establishes
- Opening an ESM catalog definition file: An ESM (Earth System Model) catalog file is a JSON file that conforms
to the [ESM Collection Specification](https://github.com/NCAR/esm-collection-spec). When provided a link/path to an esm catalog file, `intake-esm` establishes
a link to a database (CSV file) that contains data assets locations and associated metadata
(i.e., which experiment, model, the come from). The collection JSON file can be stored on a local filesystem
(i.e., which experiment, model, the come from). The catalog JSON file can be stored on a local filesystem
or can be hosted on a remote server.

```python

In [1]: import intake

In [2]: col_url = "https://gist.githubusercontent.com/andersy005/7f416e57acd8319b20fc2b88d129d2b8/raw/987b4b336d1a8a4f9abec95c23eed3bd7c63c80e/pangeo-gcp-subset.json"
In [2]: import intake_esm

In [3]: col = intake.open_esm_datastore(col_url)
In [3]: cat_url = intake_esm.tutorial.get_url("google_cmip6")

In [4]: col
Out[4]: <pangeo-cmip6 catalog with 4287 dataset(s) from 282905 asset(s)>
In [4]: cat = intake.open_esm_datastore(cat_url)

In [5]: cat
Out[5]: <GOOGLE-CMIP6 catalog with 4 dataset(s) from 261 asset(s>
```

- Search and Discovery: `intake-esm` provides functionality to execute queries against the catalog:

```python
In [5]: col_subset = col.search(
In [5]: cat_subset = cat.search(
...: experiment_id=["historical", "ssp585"],
...: table_id="Oyr",
...: variable_id="o2",
...: grid_label="gn",
...: )

In [6]: col_subset
Out[6]: <pangeo-cmip6 catalog with 18 dataset(s) from 138 asset(s)>
In [6]: cat_subset
Out[6]: <GOOGLE-CMIP6 catalog with 4 dataset(s) from 261 asset(s)>
```

- Access: when the user is satisfied with the results of their query, they can ask `intake-esm`
to load data assets (netCDF/HDF files and/or Zarr stores) into xarray datasets:

```python

In [7]: dset_dict = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True})
In [7]: dset_dict = cat_subset.to_dataset_dict(zarr_kwargs={"consolidated": True})

--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
|███████████████████████████████████████████████████████████████| 100.00% [18/18 00:10<00:00]
|███████████████████████████████████████████████████████████████| 100.00% [2/2 00:18<00:00]
```

See [documentation](https://intake-esm.readthedocs.io/en/latest/) for more information.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# ESM Collection Specification
# ESM Catalog Specification

```{note}
This documents mirrors the [ESM Collection Specification](https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/collection-spec.md) and is updated as the specification evolves.
```

- [ESM Collection Specification](#esm-collection-specification)
- [ESM Catalog Specification](#esm-catalog-specification)
- [Overview](#overview)
- [Collection Specification](#collection-specification)
- [Catalog Specification](#catalog-specification)
- [Catalog](#catalog)
- [Assets (Data Files)](#assets-data-files)
- [Catalog fields](#catalog-fields)
Expand All @@ -17,22 +17,22 @@ This documents mirrors the [ESM Collection Specification](https://github.com/NCA

## Overview

This document explains the structure and content of an ESM Collection.
A collection provides metadata about the catalog, telling us what we expect to find inside and how to open it.
The collection is described is a single json file, inspired by the STAC spec.
This document explains the structure and content of an ESM Catalog.
A catalog provides metadata about the catalog, telling us what we expect to find inside and how to open it.
The catalog is described is a single json file, inspired by the STAC spec.

The ESM Collection specification consists of three parts:
The ESM Catalog specification consists of three parts:

### Collection Specification
### Catalog Specification

The _collection_ specification provides metadata about the catalog, telling us what we expect to find inside and how to open it.
The _catalog_ specification provides metadata about the catalog, telling us what we expect to find inside and how to open it.
The descriptor is a single json file, inspired by the [STAC spec](https://github.com/radiantearth/stac-spec).

```json
{
"esmcat_version": "0.1.0",
"id": "sample",
"description": "This is a very basic sample ESM collection.",
"description": "This is a very basic sample ESM catalog.",
"catalog_file": "sample_catalog.csv",
"attributes": [
{
Expand Down Expand Up @@ -70,17 +70,17 @@ They should be either [URIs](https://en.wikipedia.org/wiki/Uniform_Resource_Iden

## Catalog fields

| Element | Type | Description |
| ------------------- | --------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| esmcat_version | string | **REQUIRED.** The ESM Catalog version the collection implements. |
| id | string | **REQUIRED.** Identifier for the collection. |
| title | string | A short descriptive one-line title for the collection. |
| description | string | **REQUIRED.** Detailed multi-line description to fully explain the collection. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. |
| catalog_file | string | **REQUIRED.** Path to a the CSV file with the catalog contents. |
| catalog_dict | array | If specified, it is mutually exclusive with `catalog_file`. An array of dictionaries that represents the data that would otherwise be in the csv. |
| attributes | [[Attribute Object](#attribute-object)] | **REQUIRED.** A list of attribute columns in the data set. |
| assets | [Assets Object](#assets-object) | **REQUIRED.** Description of how the assets (data files) are referenced in the CSV catalog file. |
| aggregation_control | [Aggregation Control Object](#aggregation-control-object) | **OPTIONAL.** Description of how to support aggregation of multiple assets into a single xarray data set. |
| Element | Type | Description |
| ------------------- | --------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| esmcat_version | string | **REQUIRED.** The ESM Catalog version the catalog implements. |
| id | string | **REQUIRED.** Identifier for the catalog. |
| title | string | A short descriptive one-line title for the catalog. |
| description | string | **REQUIRED.** Detailed multi-line description to fully explain the catalog. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. |
| catalog_file | string | **REQUIRED.** Path to a the CSV file with the catalog contents. |
| catalog_dict | array | If specified, it is mutually exclusive with `catalog_file`. An array of dictionaries that represents the data that would otherwise be in the csv. |
| attributes | [[Attribute Object](#attribute-object)] | **REQUIRED.** A list of attribute columns in the data set. |
| assets | [Assets Object](#assets-object) | **REQUIRED.** Description of how the assets (data files) are referenced in the CSV catalog file. |
| aggregation_control | [Aggregation Control Object](#aggregation-control-object) | **OPTIONAL.** Description of how to support aggregation of multiple assets into a single xarray data set. |

### Attribute Object

Expand Down
2 changes: 1 addition & 1 deletion docs/source/explanation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@
---
maxdepth: 1
---
esm-collection-spec.md
esm-catalog-spec.md
```
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ dataframe column or a list of dataframe columns across which all elements must
satisfy the query criteria. The `require_all_on` argument is best explained with
the following example.

Let’s define a query for our collection that requests multiple variable_ids and
Let’s define a query for our catalog that requests multiple variable_ids and
multiple experiment_ids from the Omon table_id, all from 3 different source_ids:

```{code-cell} ipython3
Expand All @@ -38,7 +38,7 @@ query = dict(
)
```

Now, let’s use this query to search for all assets in the collection that
Now, let’s use this query to search for all assets in the catalog that
satisfy any combination of these requests (i.e., with `require_all_on=None`,
which is the default):

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"esmcat_version": "0.1.0",
"id": "sample-multi-variable-cesm1-lens",
"description": "This is a sample ESM collection emulating multi variable/history files for CESM1-LENS",
"description": "This is a sample ESM catalog emulating multi variable/history files for CESM1-LENS",
"catalog_file": "multi-variable-catalog.csv",
"attributes": [
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ import intake
import ast
cat = intake.open_esm_datastore(
"multi-variable-collection.json",
"multi-variable-catalog.json",
read_csv_kwargs={"converters": {"variable": ast.literal_eval}},
)
cat
Expand Down
38 changes: 30 additions & 8 deletions docs/source/tutorials/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,22 +27,43 @@ import intake

At import time, intake-esm plugin is available in intake’s registry as
`esm_datastore` and can be accessed with `intake.open_esm_datastore()` function.
Use the `intake_esm.tutorial.get_url()` method to access smaller subsetted catalogs for tutorial purposes.

```{code-cell} ipython3
import intake_esm
url = intake_esm.tutorial.get_url('google_cmip6')
print(url)
```

```{code-cell} ipython3
url = "https://gist.githubusercontent.com/andersy005/7f416e57acd8319b20fc2b88d129d2b8/raw/987b4b336d1a8a4f9abec95c23eed3bd7c63c80e/pangeo-gcp-subset.json"
cat = intake.open_esm_datastore(url)
cat
```

The summary above tells us that this catalog contains over 268,000 data assets.
The summary above tells us that this catalog contains 261 data assets.
We can get more information on the individual data assets contained in the
catalog by looking at the underlying dataframe created when we load the catalog:

```{code-cell} ipython3
cat.df.head()
```

The first data asset listed in the catalog contains:

- the Northward Wind (variable_id='va'), as a function of latitude, longitude, time,

- the latest version of the IPSL climate model (source_id='IPSL-CM6A-LR'),

- hindcasts initialized from observations with historical forcing (experiment_id='historical'),

- developed by theInstitut Pierre Simon Laplace (instution_id='IPSL'),

- run as part of the Coupled Model Intercomparison Project (activity_id='CMIP')

And is located in Google Cloud Storage at 'gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r2i1p1f1/Amon/va/gr/v20180803/'.

## Finding unique entries

To get unique values for given columns in the catalog, intake-esm provides a
Expand Down Expand Up @@ -88,12 +109,10 @@ In the example below, we are are going to search for the following:

- variable_d: `o2` which stands for
`mole_concentration_of_dissolved_molecular_oxygen_in_sea_water`
- experiment_id: `['historical', 'ssp585']`:
- `historical`: all forcing of the recent past.
- `ssp585`: emission-driven
[RCP8.5](https://en.wikipedia.org/wiki/Representative_Concentration_Pathway)
based on SSP5.
- table_id: `Oyr` which stands for annual mean variables on the ocean grid.
- experiments: ['historical', 'ssp585']:
- historical: all forcing of the recent past.
- ssp585: emission-driven RCP8.5 based on SSP5.
- table_id: `0yr` which stands for annual mean variables on the ocean grid.
- grid_label: `gn` which stands for data reported on a model's native grid.

```{note}
Expand Down Expand Up @@ -133,6 +152,9 @@ returns a dictionary of aggregate xarray datasets as the name hints.
dset_dict = cat_subset.to_dataset_dict(
xarray_open_kwargs={"consolidated": True, "decode_times": True, "use_cftime": True}
)
```

```{code-cell} ipython3
[key for key in dset_dict.keys()]
```

Expand Down
2 changes: 1 addition & 1 deletion intake_esm/cat.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ def save(
json_kwargs.update(json_dump_kwargs or {})
json.dump(data, outfile, **json_kwargs)

print(f'Successfully wrote ESM collection json file to: {json_file_name}')
print(f'Successfully wrote ESM catalog json file to: {json_file_name}')

@classmethod
def load(
Expand Down
Loading

0 comments on commit 84f945b

Please sign in to comment.