Skip to content

Commit

Permalink
naming conventions (closes ecmwf/anemoi-datasets#78)
Browse files Browse the repository at this point in the history
  • Loading branch information
floriankrb committed Oct 15, 2024
1 parent 00e499e commit 41908c3
Show file tree
Hide file tree
Showing 2 changed files with 93 additions and 24 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Keep it human-readable, your future self will thank you!
- Support for "anemoi-datasets publish"
- Added set from file (python only)
- Force full paths when registering
- Added naming conventions


### Changed
Expand Down
116 changes: 92 additions & 24 deletions docs/naming-conventions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,97 @@
Dataset naming conventions
############################

A dataset name is a string used to identify a dataset. It is designed to
be human readable and is *not* designed to be parsed and splitted into
parts.

To ensure consistency, a dataset name should follow the following rules:

- All lower case.
- Only letters and numbers and dashes ``-`` are allowed.
- No underscore ``_`` and no dot ``.`` and no upper case letter and
no other special character (``@``, ``#``, ``*`` etc.).

Additionlly, a dataset name is built from different parts joined with
``-`` as follows (each part can contain additional ``-``):

.. code::
purpose-content-source-resolution-start-year-end-year-frequency-version[-extra-str]
.. note::

This is a draft proposal for naming conventions for datasets in the
Anemoi registry. It will need to be updated and adapted as more
datasets are added. The part <purpose> is especially difficult to
define for some dataset and may be revisited.

A dataset name is built as follow:
<purpose>-<content>-<source>-<resolution>-<start_year>-<end-year>-<frequency>-v<version>[-<optional-string>].zarr
The <content> is built from several parts, separated with '-'. All lower
case. Uses "-", letters and numbers. No underscores "_", no dots "."

Example: aifs-od-an-oper-0001-mars-o96-1979-2022-1h-v5 <purpose> = aifs
because the data is used to train the AIFS model. <content> The content
of the dataset CAN have four parts, such as:
<class>-<type>-<stream><expver> <class>= od Operational archive ("class"
is a MARS keyword) <type> = an Analysis ("type" is a MARS keyword)
<stream> = oper Atmospheric model ("stream" is a MARS keyword) <expver>
= 0001 (operational model) <source> = mars (data is from MARS), could be
"opendap" or other. <resolution> = o96 (other : n320, 0p2 for 0.2
degree) <start_year>-<end-year> = 1979-2022 (or 2020-2020 for if all the
data is included in the year 2020) <frequency> = 1h (other : 6h)
<version> = This version of the content of the dataset, e.g. which
variables, levels, etc, this is not the version of the format.
<optional-string> = Experimental datasets can have additional text in
the name.
This is the current naming conventions for datasets in the Anemoi
registry. It will need to be updated and adapted as more datasets are
added. The part **purpose** is especially difficult to define for
some datasets and may be revisited.

The tables below provides more details and some examples.

.. list-table:: Dataset naming conventions
:widths: 20 80
:header-rows: 1

- - Component
- Description

- - **purpose**

- Can be `aifs` because the data is used to train the AIFS model.
Is also sometime `metno` for data from the Norwegian
Meteorological Institute. This definition may need to be
revisited.

- - **content**

- The content of the dataset CAN have four parts, such as:
*class-type-stream-expver*

- **class**: od Operational archive (*class* is a MARS
keyword)
- **type**: an Analysis (*type* is a MARS keyword)
- **stream**: oper Atmospheric model (*stream* is a MARS
keyword)
- **expver**: 0001 (operational model)

- - **source**
- mars (when data is from MARS), could be *opendap* or other.

- - **resolution**
- o96 (could be : n320, 0p2 for 0.2 degree)

- - **start-year**
- 1979 if the first validity time is in 1979.

- - **end-year**

- 2020 if the first validity time is in 2020. Notice that if the
dataset is from 18.04.2020 to 19.07.2020, the star-year and
end-year are both 2020. For instance in
aifs-od-an-oper-0001-mars-o96-2020-2020-6h-v5

- - **frequency**
- 1h (could be : 6h)

- - **version**

- This is the version of the content of the dataset, e.g. which
variables, levels, etc. This is not the version of the format.
There must be a "v" before the version number. The "v" is not
part of the version number. For instance ...-v5 is the fifth
version of the content of the dataset.

- - **extra-str**

- Experimental datasets can have additional text in the name.
This extra string can contain additional `-`. It provides
additional information about the content of dataset.

.. list-table:: Examples
:widths: 100

- - aifs-od-an-oper-0001-mars-o96-1979-2022-1h-v5
- - aifs-ea-an-oper-0001-mars-o96-1979-2022-6h-v6
- - aifs-ea-an-enda-0001-mars-o96-1979-2022-6h-v6-recentered-on-oper
- - aifs-ea-an-oper-0001-mars-n320-1979-2022-6h-v4
- - inca-an-oper-0001-gridefix-1km-2023-2024-10m-v1

0 comments on commit 41908c3

Please sign in to comment.