Skip to content

cellarium-ai/cellarium-neu-challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 

Repository files navigation

NEU | Cellarium Challenge

Description

Cell Annotation Service (CAS) is a tool that can be used for a rapid search of the cells based on their raw count matrices. This service can be used by analysts to perform annotation and search of their cells.

Data

The Raw Count Matrix in scRNA-seq data is a sparse, high-dimensional array where each column represents a gene, and each row represents a cell from the tissue. Each value in the matrix indicates the count of a particular gene expressed in a specific cell during the experiment, based on the RNA molecules captured by the sequencing machine. The scRNA-seq data is structured as follows:

image

Typically, these matrices have 30,000 to 40,000 columns and can vary in the number of rows, sometimes reaching several millions. Most of the values represent 0, just a small portion of the values in the matrix has numbers other than 0.

Data Flow of Cell Annotation

  1. Dimensionality reduction First scRNA-seq data comes from the client and goes through dimensionality reduction (PCA). As the output PCA gives us 512 dimension vectors.
  2. Nearest Neighbor Search Then 512 dimension representations of the input cells go to a Nearest Neighbor Search engine, which returns an array of cells that are close to the querying cell (potentially meaningful biological context). Let’s say for our input example (Table 1) we have an output like this:
[
 {
   "query_id": "Cell 1",
   "matches": [
     {"id": 12321, "cell_type": "T cell", "distance": 0.789},
     {"id": 123145, "cell_type": "lymphocyte", "distance": 0.790},
     {"id": 1231, "cell_type": "alpha-beta T cell", "distance": 0.80}
   ]
 },
 {
   "query_id": "Cell 2",
   "matches": [
     {"id": 113543, "cell_type": "MHC-II-negative non-classical monocyte", "distance": 0.812},
     {"id": 1908, "cell_type": "native cell", "distance": 0.701},
     {"id": 12, "cell_type": "leukocyte", "distance": 0.67}
   ]
 },
 {
   "query_id": "Cell 3",
   "matches": [
     {"id": 1012342, "cell_type": "MHC-II-negative non-classical monocyte", "distance": 0.93},
     {"id": 56753,"cell_type": "Gr1-low non-classical monocyte", "distance": 0.82},
     {"id": 623456, "cell_type": "leukocyte", "distance": 0.710221}
   ]
 }
]

This would help us to annotate the query cell with the cell type returned from the Nearest Neighbor engine.

  1. Summarize context based on Nearest Neighbor Search output (the step that needs action)

Problem

Cell Type is a categorical variable. Hierarchy of cell types is represented by Ontology. You can find the specific ontology used for the hierarchy of the data we work with here. The main problem of this challenge lies in summarizing the Nearest Neighbor Search responses. Including all the neighbors in the response is difficult to interpret for annotation purposes, as the response sometimes contains multiple reasonable matches. Our reference datasets are annotated at different levels of the cell ontology. As a result, a similarity query results in a mixture of annotations at different granularities. Note: the complexity of the problem lies in the fact that higher level cell types belong to multiple lower cell types in hierarchy, however those lower level cell types (which are all parents for the lower level cell type) can have different parents and have no connection. Example:

Let’s take a look at CD86-positive plasmablast in the ontology graph. It has different branches of parents:

image

It can be a leukocyte and a motile cell, and both of these will be true. If we go to a deeper level, we can see that it belongs to the antibody-secreting cell, B cell, and lymphocyte of B lineage. Having a result predicting any of these classes would be meaningful. When the nearest neighbor search returns results with various cell types, we need to ensure that we aren't using cell types that are too granular, as this risks deviating too far from the ground truth. Similarly, we should avoid generalizing too much with a parent cell type that doesn't accurately represent the cell.

Challenge

Develop an algorithm that, based on the response from the nearest neighbor search engine, can return a reasonable aggregation of cell types while ranking them. You are required to propose the algorithm and describe it in detail. While the code for the prototype is not mandatory, it would be a valuable addition. Please include the resources you used, such as links to papers or articles. The most important criteria that we will look at are the approach you propose for the problem's solution and how well you can use online resources to develop one.

NEU | Cellarium Challenge

Description

Cell Annotation Service (CAS) is a tool that can be used for a rapid search of the cells based on their raw count matrices. This service can be used by analysts to perform annotation and search of their cells.

Data

The Raw Count Matrix in scRNA-seq data is a sparse, high-dimensional array where each column represents a gene, and each row represents a cell from the tissue. Each value in the matrix indicates the count of a particular gene expressed in a specific cell during the experiment, based on the RNA molecules captured by the sequencing machine. The scRNA-seq data is structured as follows:

image

Typically, these matrices have 30,000 to 40,000 columns and can vary in the number of rows, sometimes reaching several millions. Most of the values represent 0, just a small portion of the values in the matrix has numbers other than 0.

Data Flow of Cell Annotation

  1. Dimensionality reduction
    First scRNA-seq data comes from the client and goes through dimensionality reduction (PCA). As the output PCA gives us 512 dimension vectors.

  2. Nearest Neighbor Search
    Then 512 dimension representations of the input cells go to a Nearest Neighbor Search engine, which returns an array of cells that are close to the querying cell (potentially meaningful biological context). Let’s say for our input example (Table 1) we have an output like this:

    [
      {
        "query_id": "Cell 1",
        "matches": [
          {"id": 12321, "cell_type": "T cell", "distance": 0.789},
          {"id": 123145, "cell_type": "lymphocyte", "distance": 0.790},
          {"id": 1231, "cell_type": "alpha-beta T cell", "distance": 0.80}
        ]
      },
      {
        "query_id": "Cell 2",
        "matches": [
          {"id": 113543, "cell_type": "MHC-II-negative non-classical monocyte", "distance": 0.812},
          {"id": 1908, "cell_type": "native cell", "distance": 0.701},
          {"id": 12, "cell_type": "leukocyte", "distance": 0.67}
        ]
      },
      {
        "query_id": "Cell 3",
        "matches": [
          {"id": 1012342, "cell_type": "MHC-II-negative non-classical monocyte", "distance": 0.93},
          {"id": 56753, "cell_type": "Gr1-low non-classical monocyte", "distance": 0.82},
          {"id": 623456, "cell_type": "leukocyte", "distance": 0.710221}
        ]
      }
    ]

    This would help us to annotate the query cell with the cell type returned from the Nearest Neighbor engine.

  3. Summarize context based on Nearest Neighbor Search output (the step that needs action)

Problem

Cell Type is a categorical variable. The hierarchy of cell types is represented by Ontology. You can find the specific ontology used for the hierarchy of the data we work with here. The main problem of this challenge lies in summarizing the Nearest Neighbor Search responses.

Including all the neighbors in the response is difficult to interpret for annotation purposes, as the response sometimes contains multiple reasonable matches. Our reference datasets are annotated at different levels of the cell ontology. As a result, a similarity query results in a mixture of annotations at different granularities.

Example

Let’s take a look at CD86-positive plasmablast in the ontology graph. It has different branches of parents:

image

It can be a leukocyte and a motile cell, and both of these will be true. If we go to a deeper level, we can see that it belongs to the antibody-secreting cell, B cell, and lymphocyte of B lineage. Having a result predicting any of these classes would be meaningful. When the nearest neighbor search returns results with various cell types, we need to ensure that we aren't using cell types that are too granular, as this risks deviating too far from the ground truth. Similarly, we should avoid generalizing too much with a parent cell type that doesn't accurately represent the cell.

Challenge

Develop an algorithm that, based on the response from the nearest neighbor search engine, can return a reasonable aggregation of cell types while ranking them. You are required to propose the algorithm and describe it in detail. While the code for the prototype is not mandatory, it would be a valuable addition. Please include the resources you used, such as links to papers or articles. The most important criteria that we will look at are the approach you propose for the problem's solution and how well you can use online resources to develop one.

Submission

Send your submission to:

before June 6, 2024, 11:59 PM. Please include "NEU-Cellarium-Challenge" in the email subject line. Feel free to use the same email if you have any questions regarding the challenge task.

Materials

Please find a Jupyter notebook attached with examples of the data. You can use the notebook (NEU-Broad-Challenge.ipynb) as a starter point for the challenge.

The challenge is provided by the Cellarium team at the Data Sciences Platform, Broad Institute, and Professor Nik Bear Brown from Northeastern University's College of Engineering.

Submission

Send your submission to:

[email protected]

and CC Prof Nik Bear Brown:

[email protected]

before June 6, 2024, 11:59 PM.

Please include "NEU-Cellarium-Challenge" in the email subject line.

Feel free to use the same email if you have any questions regarding the challenge task.

Materials

Please find a jupyter notebook attached with examples of the data. You can use the notebook (NEU-Broad-Challenge.ipynb) as a starter point for the challenge.

Certainly! Single-cell RNA sequencing (scRNA-seq) is a powerful technique used to analyze gene expression at the single-cell level. Here’s a detailed explanation:

Overview of scRNA-seq Data

scRNA-seq allows researchers to examine the transcriptome of individual cells, providing insights into cellular functions, states, and interactions that are not possible with bulk RNA sequencing, which averages the gene expression across many cells.

Key Components of scRNA-seq Data

  1. Cells:

    • Each cell in a sample is isolated, and its RNA is captured and sequenced individually.
    • The data is collected for thousands to millions of individual cells in a single experiment.
  2. Genes:

    • Each column in the scRNA-seq data matrix represents a gene.
    • The genes are the same across all cells in the dataset, typically including all known genes for the organism being studied.
  3. Raw Count Matrix:

    • The raw count matrix is a high-dimensional array where rows represent individual cells and columns represent genes.
    • Each value in the matrix indicates the number of RNA molecules (transcripts) for a particular gene observed in a specific cell.
    • The matrix is typically sparse, meaning that most values are zero, as not all genes are expressed in every cell at any given time.

Example of a Raw Count Matrix

Cell/Gene Gene1 Gene2 Gene3 ... GeneN
Cell1 0 2 0 ... 5
Cell2 3 0 1 ... 0
Cell3 0 1 4 ... 2
... ... ... ... ... ...
CellM 1 0 0 ... 3

Data Processing and Analysis

  1. Dimensionality Reduction:

    • Due to the high dimensionality of the data (thousands of genes), techniques like Principal Component Analysis (PCA) are used to reduce the number of dimensions while retaining most of the important information.
    • This helps in visualizing and analyzing the data more effectively.
  2. Clustering:

    • Cells with similar gene expression profiles are grouped together into clusters.
    • Each cluster often represents a distinct cell type or cell state.
  3. Annotation:

    • Clusters are annotated based on known markers for specific cell types.
    • This step involves identifying which genes are highly expressed in each cluster and matching these patterns to known cell types.
  4. Differential Expression Analysis:

    • Identifying genes that are differentially expressed between different clusters or conditions.
    • This helps in understanding the functional differences between cell types or states.

Applications of scRNA-seq

  • Cell Type Identification: Discovering new cell types and understanding the cellular composition of tissues.
  • Developmental Biology: Studying how cells differentiate and develop over time.
  • Disease Research: Identifying cellular changes associated with diseases such as cancer, neurological disorders, and infections.
  • Immune Response: Understanding how individual immune cells respond to pathogens.

Summary

scRNA-seq provides a detailed view of gene expression at the single-cell level, enabling researchers to uncover the complexity and heterogeneity of biological systems. The data generated from scRNA-seq experiments is processed and analyzed to identify and annotate different cell types, understand cellular functions, and explore how cells interact and change in response to various conditions.

The challenge is provided by the Cellarium team at the Data Sciences Platform, Broad Institute, and Professor Nik Bear Brown from Northeastern University's College of Engineering.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published