Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revise scope of the ml model extension to focus on model card and inference #15

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 18 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@
- **Field Name Prefix:** ml-model
- **Scope:** Item, Collection
- **Extension [Maturity Classification](https://github.com/radiantearth/stac-spec/tree/master/extensions/README.md#extension-maturity):** Proposal
- **Owner**: @duckontheweb
- **Owner**: @rbavery

This document explains the ML Model Extension to the [SpatioTemporal Asset
Catalog](https://github.com/radiantearth/stac-spec) (STAC) specification.
Catalog](https://github.com/radiantearth/stac-spec) (STAC) specification.

- Examples:
- [Item example](examples/dummy/item.json): Shows the basic usage of the extension in a STAC Item
Expand All @@ -18,9 +18,9 @@ Catalog](https://github.com/radiantearth/stac-spec) (STAC) specification.
## Scope & Vision

The goal of the STAC ML Model Extension is to provide a way of cataloging machine learning (ML) models that operate on Earth observation (EO) data
described as a STAC catalog. The metadata related to machine learning models and their related artifacts (e.g. training data, performance metrics,
described as a STAC catalog. The metadata related to machine learning models and their related artifacts (e.g. training data, performance metrics, inference requirements
etc.) can be extremely broad. This extension limits its scope to ML model metadata that aids in the discoverability and usability/reusability of
these models for the following types of use-cases:
these models for inference (model prediction). The spec also places emphasis on providing model card information for each model, so that users understand the performance and limitations of the model. Therefore, the main use case for this extension is as follows:

- **Adoption of Models in Analytic Pipelines**

Expand All @@ -29,27 +29,7 @@ these models for the following types of use-cases:
classification). Consider the example of a global non-profit organization that wants to use ML to track deforestation. A data engineer from this
organization might be interested in discovering segmentation models that accurately produce land cover classes over parts of South America using
Sentinel 2 imagery. The STAC ML Model Extension aims to support this use-case by describing metadata related to the recommended area over which
the model may be used, a description of the model architecture and type of input data it requires, and links to containerized model images or
model files that can be used to run the model to generate inferences.

- **Re-training of Existing Models in New Contexts**

The process of training ML models on Earth observation data can be extremely time-consuming and costly due to the volume of data required.
Providing tools that ease the discovery of existing models and training data will make ML models more accessible by reducing this training
burden. Suppose the non-profit from the previous example found a model that generated the kind of predictions they were interested in, but was
not applicable to their region of interest. Rather than creating a new model from scratch, the organization might be interested in using transfer
learning to re-train the existing model on data from their area of interest. In this case, they would need enough information about the training
environment and model architecture to reproduce the model weights and continue training the model using new data. The STAC ML Model Extensions
aims to support this use-case by providing links to serialized versions of the model (e.g. a PyTorch checkpoint file) as well as enough detail
about the training environment that a data scientist could reasonably implement transfer learning using new data.

- **Reproducibility of ML Experiments**

The ability to reproduce published ML experiments is crucial for verifying and building upon previous ML research. Increasingly, individuals and
institutions are making an effort to publish code and examples along with academic publications to enable this kind of reproducibility. However,
the quality and usability of this code and related documentation can vary widely and there are currently no standards that ensure that a new
researcher could reproduce a given set of published results from the documentation. The STAC ML Model Extension aims to address this issue by
providing a detailed description of the training data and environment used in a ML model experiment.
the model may be used, a description of the model architecture and type of input data it requires, and links to model files that can be used to run the model to generate inferences.

## Item Properties

Expand All @@ -66,35 +46,35 @@ these models for the following types of use-cases:

#### ml-model:learning_approach

Describes the learning approach used to train the model. It is STRONGLY RECOMMENDED that you use one of the
Describes the learning approach used to train the model. It is STRONGLY RECOMMENDED that you use one of the
following values, but other values are allowed.

- `"supervised"`
- `"unsupervised"`
- `"semi-supervised"`
- `"self-supervised"`
- `"reinforcement-learning"`

#### ml-model:prediction_type

Describes the type of predictions made by the model. It is STRONGLY RECOMMENDED that you use one of the
Describes the type of predictions made by the model. It is STRONGLY RECOMMENDED that you use one of the
following values, but other values are allowed. Note that not all Prediction Type values are valid
for a given [Learning Approach](#ml-modellearning_approach).

- `"object-detection"`
- `"classification"`
- `"segmentation"`
- `"regression"`
- `"multi-modal"`

### ml-model:training-os
### ml-model:training-architecture

It is STRONGLY RECOMMENDED that one of the following operating system identifiers (taken from the Python [`sys.platform`
values](https://docs.python.org/3/library/sys.html#sys.platform) be used whenever possible:
It is STRONGLY RECOMMENDED that one of the following architecture identifiers, used by Docker and Golang:(https://go.dev/doc/install/source#environment). For example:

- `aix`
- `linux`
- `win32`
- `cygwin`
- `darwin`
- `linux-amd64` - Linux x86-64
- `windows-amd64` - Windows x86-64
- `darwin-arm64` - MacOs Silicon
- `darwin-amd64` - MacOs x86-64

## Asset Objects

Expand All @@ -120,7 +100,7 @@ While the Compose file defines nearly all of the parameters required to run the
directory containing input data should be mounted to the container and to which host directory the output predictions should be written. The Compose
file MUST define volume mounts for input and output data using the Compose
[Interpolation syntax](https://github.com/compose-spec/compose-spec/blob/master/spec.md#interpolation). The input data volume MUST be defined by an
`INPUT_DATA` variable and the output data volume MUST be defined by an `OUTPUT_DATA` variable.
`INPUT_DATA` variable and the output data volume MUST be defined by an `OUTPUT_DATA` variable.

For example, the following Compose file snippet would mount the host input directory to `/var/data/input` in the container and would mount the host
output data directory to `/var/data/output` in the host container. In this contrived example, the script to run the model takes 2 arguments: the
Expand Down Expand Up @@ -208,10 +188,10 @@ extension, please open a PR to include it in the `examples` directory. Here are

### Running tests

The same checks that run as checks on PR's are part of the repository and can be run locally to verify that changes are valid.
The same checks that run as checks on PR's are part of the repository and can be run locally to verify that changes are valid.
To run tests locally, you'll need `npm`, which is a standard part of any [node.js installation](https://nodejs.org/en/download/).

First you'll need to install everything with npm once. Just navigate to the root of this repository and on
First you'll need to install everything with npm once. Just navigate to the root of this repository and on
your command line run:
```bash
npm install
Expand Down
Loading