Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove genes.tsv.gz from mtx format #424

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docker-images/matrix-converter/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
36
37
21 changes: 16 additions & 5 deletions matrix/common/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -459,7 +459,7 @@ class MetadataSchemaName(Enum):
MatrixFormat.MTX.value: """
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this content need to be updated somewhere else too?

<h2>HCA Matrix Service MTX Output</h2>
<p>The mtx-formatted output from the matrix service is a zip archive that contains
three files:</p>
four files:</p>
<table class="table table-striped table-bordered">
<thead>
<tr>
Expand All @@ -477,11 +477,18 @@ class MetadataSchemaName(Enum):
<td>Cell metadata</td>
</tr>
<tr>
<td>&lt;directory_name&gt;/genes.tsv.gz</td>
<td>&lt;directory_name&gt;/features.tsv.gz</td>
<td>Gene (or transcript) metadata</td>
</tr>
<tr>
<td>&lt;directory_name&gt;/barcodes.tsv.gz</td>
<td>Cell barcodes</td>
</tr>
</tbody>
</table>
<p>For 10x experiments, this format adheres to the Cell Ranger
<a href="https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices">
feature-barcode matrix</a> specification.</p>

<h3><code>matrix.mtx.gz</code></h3>
<p>This file contains expression values in the
Expand All @@ -494,8 +501,8 @@ class MetadataSchemaName(Enum):
<p>The expression values are meant to be a "raw" count, so for SmartSeq2 experiments, this
is the <code>expected_count</code> field from
<a href="http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-expression.html#output">RSEM
output</a>. For 10x experiments analyzed with Cell Ranger, this is read from the
<code>matrix.mtx</code> file that Cell Ranger produces as its filtered feature-barcode matrix.</p>
output</a>. For 10x experiments analyzed with Optimus, this is read from the
<a href="https://zarr.readthedocs.io/en/stable">zarr</a> array produced by the pipeline.</p>

<h3><code>cells.tsv.gz</code></h3>
<p>Each row of the cell metadata table represents a cell, and each column is a different metadata
Expand All @@ -504,10 +511,14 @@ class MetadataSchemaName(Enum):
fields, <code>genes_detected</code> for example, are calculated during secondary analysis.
Full descriptions of those fields are forthcoming.</p>

<h3><code>genes.tsv.gz</code></h3>
<h3><code>features.tsv.gz</code></h3>
<p>The gene metadata contains basic information about the genes in the count matrix.
Each row is a gene, and each row corresponds to the same row in the expression mtx file.
Note that <code>featurename</code> is not unique.</p>

<h3><code>barcodes.tsv.gz</code></h3>
<p>A list of cell barcodes corresponding to the columns found in matrix.mtx.gz.
Note that barcodes may not be unique.</p>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mckinsel What do you recommend we use instead to ensure uniqueness across projects? Will storing something other than barcodes in this file cause confusion?

"""
}

Expand Down
5 changes: 2 additions & 3 deletions matrix/docker/matrix_converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ def _grouper(iterable, n):
yield cells_df

def _to_mtx(self):
"""Write a zip file with an mtx and two metadata tsvs from Redshift query
"""Write a zip file with an mtx and three metadata tsvs from Redshift query
manifests.

Returns:
Expand Down Expand Up @@ -219,11 +219,10 @@ def _to_mtx(self):
cell_count += pivoted.shape[1]
cellkeys.extend(pivoted.columns.to_list())

self._write_out_gene_dataframe(results_dir, "genes.tsv.gz", compression=True)
self._write_out_cell_dataframe(results_dir, "cells.tsv.gz", cell_df, cellkeys, compression=True)
self._write_out_barcode_dataframe(results_dir, "barcodes.tsv.gz", cell_df, cellkeys)

file_names = ["features.tsv.gz", "genes.tsv.gz", "matrix.mtx.gz", "cells.tsv.gz", "barcodes.tsv.gz"]
file_names = ["features.tsv.gz", "matrix.mtx.gz", "cells.tsv.gz", "barcodes.tsv.gz"]
zip_path = self._zip_up_matrix_output(results_dir, file_names)
return zip_path

Expand Down
2 changes: 1 addition & 1 deletion terraform/modules/matrix-service/infra/converter_batch.tf
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,7 @@ resource "aws_batch_job_definition" "converter_job_def" {
container_properties = <<CONTAINER_PROPERTIES
{
"command": [],
"image": "humancellatlas/matrix-converter:36",
"image": "humancellatlas/matrix-converter:37",
"memory": 8192,
"vcpus": 2,
"jobRoleArn": "${aws_iam_role.converter_job_role.arn}",
Expand Down
18 changes: 14 additions & 4 deletions tests/functional/test_conversions.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,11 +209,10 @@ def test_mtx(self, mock_upload_method):
# Check the components of the zip file
members = mtx_output.namelist()
self.assertIn("test.mtx/matrix.mtx.gz", members)
self.assertIn("test.mtx/genes.tsv.gz", members)
self.assertIn("test.mtx/cells.tsv.gz", members)
self.assertIn("test.mtx/features.tsv.gz", members)
self.assertIn("test.mtx/barcodes.tsv.gz", members)
self.assertEqual(len(members), 5)
self.assertEqual(len(members), 4)

# Read in the cell and gene tables. We need both for mtx files
# since the mtx itself is just numbers and indices.
Expand All @@ -223,8 +222,19 @@ def test_mtx(self, mock_upload_method):
mtx_cells[row["cellkey"]] = row

mtx_genes = collections.OrderedDict()
for row in csv.DictReader(io.StringIO(gzip.GzipFile(fileobj=io.BytesIO(
mtx_output.read("test.mtx/genes.tsv.gz"))).read().decode()), delimiter='\t'):
for row in csv.DictReader(
io.StringIO(gzip.GzipFile(fileobj=io.BytesIO(
mtx_output.read("test.mtx/features.tsv.gz"))).read().decode()),
delimiter='\t',
fieldnames=["featurekey",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required since features.tsv.gz per specification does not include headers. Will this be confusing to users given we're removing genes.tsv.gz?

"featurename",
"featuretype",
"featuretype_10x",
"chromosome",
"featurestart",
"featureend",
"isgene",
"genus_species"]):
mtx_genes[row["featurekey"]] = row

# Read the expression values. This is supposed to be aligned with
Expand Down
3 changes: 0 additions & 3 deletions tests/unit/docker/test_matrix_converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -385,9 +385,6 @@ def test__to_mtx(self, mock_parse_manifest, mock_load_cell_results, mock_write_g
test_data["genes_df"].to_csv(os.path.join(results_dir, "features.tsv.gz"),
index_label="featurekey",
sep="\t", compression="gzip")
test_data["genes_df"].to_csv(os.path.join(results_dir, "genes.tsv.gz"),
index_label="featurekey",
sep="\t", compression="gzip")
self.matrix_converter.local_output_filename = "unit_test__to_mtx.zip"
zip_path = self.matrix_converter._to_mtx()

Expand Down