Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update README to reflect v1 #328

Merged
merged 3 commits into from
Jul 8, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 102 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,79 +1,129 @@
# Expression Matrix Service

[![Production Health Check](https://status.data.humancellatlas.org/service/matrix-health-check-prod.svg)](https://matrix.data.humancellatlas.org/)
[![Master Deployment Status](https://status.data.humancellatlas.org/build/HumanCellAtlas/matrix-service/prod.svg)](https://allspark.dev.data.humancellatlas.org/HumanCellAtlas/matrix-service/pipelines)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the allspark build badges stop working? I remember looking into fixing this a while ago but couldn't figure it out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I think the badge we were using was hosted on @mweiden's service status dashboard which made them public. Not sure what happened, but those badges don't work anymore. GitLab badges still work, but you have to authenticate to see them which doesn't work in a README:

https://allspark-prod.data.humancellatlas.org/HumanCellAtlas/matrix-service/badges/prod/pipeline.svg

[![Test Coverage](https://codecov.io/gh/HumanCellAtlas/matrix-service/branch/master/graph/badge.svg)](https://codecov.io/gh/HumanCellAtlas/matrix-service)

## Overview

The Matrix Service (MS) provides an interface to aggregrate, query and access gene expression matrices stored in the
[Human Cell Atlas](https://staging.data.humancellatlas.org/) [Data Coordination
Platform](https://www.humancellatlas.org/data-sharing) (DCP). Expression data are read from the
[DCP Data Store](https://github.com/HumanCellAtlas/data-store), processed in [AWS Lambda](https://aws.amazon.com/lambda/)
and [AWS Batch](https://aws.amazon.com/batch/) and the results are stored in [Amazon S3](https://aws.amazon.com/s3/)
buckets. The service exposes a [REST API](https://matrix.staging.data.humancellatlas.org) for querying and retrieving
expression matrix results with support for the following [file formats](#file-formats).

### Components

The logical flow of an expression matrix request is illustrated in the diagram below
[[LucidChart](https://www.lucidchart.com/invitations/accept/cdb424df-a72f-4391-9549-e83364c7234c)].
A description of each component follows.

![alt text](matrix_architecture.svg)

#### Matrix API
The Matrix Service consumes data from the [HCA](https://prod.data.humancellatlas.org/)
[Data Store](https://github.com/HumanCellAtlas/data-store) to dynamically generate cell by gene expression matrices.
Users can select cells to include in their matrix by specifying metadata and expression value filters via the API.
Matrices also include metadata per cell for which fields to include can be specified in the POST request. For a quick
example to get started, try this
[Jupytner Notebook vignette](https://github.com/HumanCellAtlas/matrix-service/blob/master/docs/HCA%20Matrix%20Service%20to%20Scanpy.ipynb).

For information on the technical architecture of the service, please see
[Matrix Service Technical Architecture](https://allspark.dev.data.humancellatlas.org/HumanCellAtlas/matrix-service/wikis/Technical-Architecture).

## API: https://matrix.data.humancellatlas.org

The complete API documentation is available [here](https://matrix.data.humancellatlas.org).

### Requesting a matrix
Expression matrices are generated asynchronously for which results are retrieved via a polling architecture.
To request the generation of a matrix, submit a POST request to `/v1/matrix` and receive a job ID. Use this ID to poll
`/v1/matrix/<ID>` to retrieve the status and results of your request.

When requesting a matrix, users are required to select cells by specifying [metadata/expression data filters](#Filter).
Optionally, they may also specify which [metadata fields](#Fields) to include in the matrix, the
[output format](#Format) and the [feature type](#Feature) to describe. These 4 fields are supplied in the body of the
POST request:
```json
{
"filter": {},
"fields": [
"string"
],
"format": "string",
"feature": "string"
}
```
#### Filter

The REST API is a [Chalice](https://github.com/aws/chalice) app that adopts [Swagger/OpenAPI](https://swagger.io/)'s
approach to specification driven development and leverages [Connexion](https://github.com/zalando/connexion) for
input parameter validation. The Chalice app is deployed with [Amazon API Gateway](https://aws.amazon.com/api-gateway/)
and [AWS Lambda](https://aws.amazon.com/lambda/). The full API documentation can be found
[here](https://matrix.staging.data.humancellatlas.org).
To select cells, the API supports a simple yet expressive language for specifying complex metadata and expression data
filters capable of representing nested AND/OR structures as a JSON object. There are two types of filter objects to
achieve this:

#### Lambdas
*Comparison filter*
```
{
"op": one of [ =, !=, >, <, >=, <=, in ],
"field": a metadata filter,
"value": string or int or list
}
```

The preparation of an expression matrix occurs in the following five stages: the driver, mapper, worker, reducer and
the converter. The first four stages are deployed in AWS Lambda and are collectively responsible for preparing a
[zarr file structure](https://zarr.readthedocs.io/en/stable/) representing the resultant expression matrix. The
following table provides a description of each lambda:
*Logical filter*
```
{
"op": one of [ and, or, not ],
"value": array of 2 filter objects if op==and|or, filter object if op==not
}
```

| **Lambda** | **Description** |
|---|---|
| Driver | Initializes the matrix request in DynamoDB tables responsible for tracking the request's progress and invokes N mapper lambdas distributing the load of input bundles. |
| Mapper | For each input bundle, reads its metadata to retrieve chunking boundaries (i.e. a subset of matrix rows) used in parallel processing; invokes M worker lambdas distributing determined chunks. |
| Worker | For each chunk, apply the user-supplied query and write the matched rows to the resultant expression matrix in S3. The last worker to complete will invoke the reducer lambda. |
| Reducer | Finalizes the resultant expression matrix's zarr structure in S3. If the user-requested file format is not ``zarr``, invokes an AWS Batch job to convert the zarr to the desired file format. Otherwise, completes the request. |
These filter types can be recursively nested via the `value` field of a logical filter.

#### File Converter
*Filter object examples*

A file conversion job deployed on AWS Batch is used to support multiple output [file formats](#file-formats). This job
converts ``.zarr`` expression matrices to the desired file format and writes the result to S3.
Select all full length cells:
```
...
"filter": {
"op": ">=",
"value": "full length",
"field": "library_preparation_protocol.end_bias"
}
...
```

#### DynamoDB
Select all cells from the "Single cell transcriptome analysis of human pancreas" project with at least 3000 genes
detected:
```
...
"filter": {
"op": "and",
"value": [
{
"op": "=",
"value": "Single cell transcriptome analysis of human pancreas",
"field": "project.project_core.project_short_name"},
{
"op": ">=",
"value": 3000,
"field": "genes_detected"
}
]
}
...
```

DynamoDB tables are used to track the state and progress of a request. The following is a description of the tables:
The list of available filter names is available at `/v1/filters`. To retrieve more information about a specific filter,
GET `/v1/filters/<filter>`.

| **Table name** | **Description** |
|---|---|
| Cache table | Caches requests by a hash of its input parameters. |
| State table | Tracks the progress of a request. |
| Output table | Stores output values of the request (e.g. file format, errors). |
| Lock table | Manages locks for across distributed nodes. |
#### Fields

Users can specify a list of metadata fields to be exported with an expression matrix. The list of available metadata
fields is available at `/v1/fields`. More information about a specific field is available at
`/v1/fields/<field>`.
calvinnhieu marked this conversation as resolved.
Show resolved Hide resolved

### File formats
#### Format

The DCP MS enables users to prepare expression matrices in several formats by supplying the `format` parameter in the
POST request to the `/matrix` endpoint. The following is a list of supported file formats:
The Matrix Service supports generating matrices in the following 3 formats:

- [.zarr](https://zarr.readthedocs.io/en/stable/) (default)
- [.loom](http://loompy.org/)
- [.loom](http://loompy.org/) (default)
calvinnhieu marked this conversation as resolved.
Show resolved Hide resolved
- [.csv](https://en.wikipedia.org/wiki/Comma-separated_values)
- [.mtx](https://math.nist.gov/MatrixMarket/formats.html)

The API also makes this information available via the `/matrix/formats` endpoint.
This list is also available at `/v1/formats` with additional information for a specific format available at
`/v1/formats/<format>`.

#### Feature

## Getting Started
The Matrix Service also supports generating cell by transcript matrices in addition to cell by gene matrices. To select
the feature type, specify either `gene` (default) or `transcript` in the POST request. The list of available features is
available at `/v1/features` with additional information for a specific feature available at `/v1/features/<feature>`.
calvinnhieu marked this conversation as resolved.
Show resolved Hide resolved

## Developer Getting Started

### Requirements

Expand Down Expand Up @@ -120,14 +170,3 @@ cd chalice
make build && cd ..
./scripts/matrix-service-api.py
```

#### Logs

All API logs, AWS Lambda logs and AWS Batch logs can be found in
[Amazon CloudWatch Metrics](https://console.aws.amazon.com/cloudwatch/home?region=us-east-1) under following prefixes:

| **Component** | **Log Group Prefix** | **Log Stream Prefix** |
|---|---|---|
| Matrix API | _/aws/lambda/matrix-service-api-_ | - |
| Lambdas | _/aws/lambda/dcp-matrix-service-_ | - |
| File converter | _/aws/batch/job_ | _dcp-matrix-converter-job-definition-_ |
2 changes: 1 addition & 1 deletion matrix_architecture.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.