File format for AIRR Clone object? #421

bcorrie · 2020-06-25T19:33:08Z

Have we come to a consensus as to what the file format for the Clone object should be? I know @eharkins has done some validation on the AIRR Clone data model, did you use the JSON format for the object for that validation. I think so as I recall some discussion about JSON in #323. This is also relevant to validation in the python and R libraries in #328.

I suppose I am wondering if we are intending on supporting a TSV file format for Clones as well, as a compact/simple format? I believe at least the Clone object is very flat, simple object much like rearrangements. Are we intending on supporting a Clone TSV file? Did I miss such a discussion? 8-)

The text was updated successfully, but these errors were encountered:

scharch · 2020-06-25T20:02:55Z

I can't find it now, but I'm pretty sure we decided it would be JSON/YAML because each Clone can reference multiple sequence_ids and multiple Trees, and we don't like nesting in TSVs :)

javh · 2020-06-25T20:13:05Z

That's my recollection as well.

Though, I do think we talked briefly about a TSV without the sequences field and with a single Newick string in the tree column (one consensus tree, with nodes identified by sequence_id) as a simplified format. There would be no nesting required, but you'll need the accompanying Rearrangement data to get the sequence info at the Node level.

eharkins · 2020-06-25T23:29:26Z

@bcorrie yes, we use JSON in Olmsted; e.g.

bcorrie · 2020-06-26T15:23:19Z

OK - we are looking at loading clone data into the iReceptor repositories so they are searchable via the ADC API (see #422) - so loading JSON/YAML it is...

So one related question. This would imply that we are assuming (hoping?) that most software tools that produce clones would support this format as a data interchange format?

eharkins · 2020-06-28T20:50:32Z

I'm not sure if this helps/answers but the most common (the only for now) pipeline that includes Olmsted also includes a python utility to read clone info in the format it is produced (by the clone inference tools) and shape it into the Olmsted schema in JSON format.

scharch · 2020-06-29T13:40:15Z

When I eventually get around to implementing this in SONAR, it will produce the AIRR-compatible YAML file, yes.

bcorrie · 2020-06-29T17:02:22Z

So one related question. This would imply that we are assuming (hoping?) that most software tools that produce clones would support this format as a data interchange format?

I guess another way of asking this question is, what are the most widely used Clone (in the sense of the AIRR Clone data model) generating software tools? Since we don't support clones yet, I am far from being an expert on this.

For rearrangements we support loading AIRR TSV, IMGT VQuest, igblast (through its AIRR TSV output), and MiXCR.

If we only support AIRR JSON Clone files, that in the short term will be very limiting I think. What other tools should I be considering from a data loading perspective?

bcorrie · 2020-06-29T17:04:51Z

I'm not sure if this helps/answers but the most common (the only for now) pipeline that includes Olmsted also includes a python utility to read clone info in the format it is produced (by the clone inference tools) and shape it into the Olmsted schema in JSON format.

@eharkins what pipeline would that be? If there is something that reads output from the clone inference tools, that would probably be very helpful.

eharkins · 2020-06-29T18:36:35Z

@bcorrie clone inference tool (partis) -> clone processing and phylogenetics pipeline (CFT) -> utility to convert into olmsted schema format (olmsted) -> olmsted

bcorrie · 2020-06-29T21:17:31Z

@schristley suggested using a JSON object similar to that used in Repertoire, essentially something like this:

  clone_response:
    type: object
    properties:
      Info:
        $ref: '#/definitions/info_object'
      Clone:
        $ref: '#/definitions/clone_list'

Where clone_list above is just an Array of JSON Clone objects from the AIRR Spec. The above is essentially what the proposed Clone ADC API returns, which would be ideal...

So an AIRR Clone file would look like:

{
"Info":{STANDARD AIRR INFO OBJECT},
"Clone":
  [
    {"clone_id":"REP1-ID1", "germline_alignment":"...", "repertoire_id":"REP1","data_processing_id":"DP1",...},
    {"clone_id":"REP1-ID2", "germline_alignment":"...", "repertoire_id":"REP1","data_processing_id":"DP1",...},
    ...
    {"clone_id":"REP2-ID1", "germline_alignment":"...", "repertoire_id":"REP2","data_processing_id":"DP1",...},
    ...
    {"clone_id":"REP3-ID1", "germline_alignment":"...", "repertoire_id":"REP3","data_processing_id":"DP1",...},
    ...
  ]
}

In this case the clone_id is unique within a Repertoire, or perhaps more precisely within a repertoire_id/data_processing_id pair.

Does that make sense?

bcorrie · 2020-06-30T16:39:21Z

@schristley points out in #422 that from a repository perspective, we would want clone_id to be like sequence_id and repertoire_id (and probably cell_id) in that they are unique within the context of the analysis being performed.

@schristley mentioned this in relation to both cell_id and clone_id here: #409 (comment). That is these are ephemeral things that can be modified as needed to ensure uniqueness and link records as required. Does that make sense to people?

bcorrie · 2020-07-10T16:50:16Z

Any further thoughts on the file format suggested in: #421 (comment)?

We are about to embark on writing a data loader into the iReceptor Turnkey repository for the AIRR Clone format, and I would like to try and get some basic agreement before we go to far... 8-)

If I assume the above for now, am I going to regret it later?

I think the uniqueness discussion of clone_id is independent of the file format.

scharch · 2020-07-12T19:16:40Z

It's not clear to me that the clone file is supposed to have its own Info object; I thought it would be linked to the Repertoire metadata. OTOH, that probably depends a bit on how we decided to define a repertoire and how to handle Clones defined across multiple repertoires....

bussec · 2020-07-13T18:12:28Z

Will there be any possibility to group Cells into a Clone?

scharch · 2020-07-13T18:20:23Z

@bussec: Eventually. See #317

bussec · 2020-07-13T18:31:09Z

@scharch: Thanks, then I will move that part of the discussion there.

bcorrie · 2020-07-14T01:09:02Z

It's not clear to me that the clone file is supposed to have its own Info object; I thought it would be linked to the Repertoire metadata. OTOH, that probably depends a bit on how we decided to define a repertoire and how to handle Clones defined across multiple repertoires....

My question would be whether or not there would likely be a Clone file as a stand alone thing and if so, would the Info section be useful? The Info object is mostly about defining version information for the AIRR Spec that is used to define the rest of the file. As such, I think it is essentially independent of use (I don't think there is anything Repertoire specific in it). From the docs section for Repertoire files:

The file can (optionally) contain an Info object, at the beginning of the file, based upon the Info schema in the OpenAPI V2 specification. If provided, version in Info should reference the version of the AIRR schema for the file.

Given that we are using JSON and therefore structured text, the Info block gives a way of ensuring that two files use the same AIRR version (e.g. v1.3), which will be potentially important for tools in the future. It provides a mechanism for one to take a set of Repertoire, Clone, and Rearrangement files and do a sanity check to ensure that the files are compatible.

If this is unlikely or unimportant, then it is not required. With that said, it doesn't result in much overhead and does provide some utility...

scharch · 2020-07-14T16:54:07Z

@bcorrie ok, yeah, thanks

bcorrie · 2020-07-22T19:10:11Z

Anyone object to me closing this issue? We are starting to work on the data loader, and I would like to get consensus before moving forward... It seems like we have it, with the caveat that if the fields change, the format changes. By consensus I meant that we agree that the file format is JSON, and at least for now we are OK with having the Info block as part of the format?

scharch · 2020-07-22T19:14:54Z

ok with me

bcorrie · 2020-07-27T22:05:01Z

Closing this issue

franasa mentioned this issue Jun 26, 2020

File formats for a MIAIRR singe cell exporter #409

Closed

schristley mentioned this issue Jun 30, 2020

Clone api #422

Merged

bcorrie closed this as completed Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File format for AIRR Clone object? #421

File format for AIRR Clone object? #421

bcorrie commented Jun 25, 2020

scharch commented Jun 25, 2020

javh commented Jun 25, 2020 •

edited

Loading

eharkins commented Jun 25, 2020

bcorrie commented Jun 26, 2020

eharkins commented Jun 28, 2020

scharch commented Jun 29, 2020

bcorrie commented Jun 29, 2020 •

edited

Loading

bcorrie commented Jun 29, 2020

eharkins commented Jun 29, 2020

bcorrie commented Jun 29, 2020 •

edited

Loading

bcorrie commented Jun 30, 2020 •

edited

Loading

bcorrie commented Jul 10, 2020 •

edited

Loading

scharch commented Jul 12, 2020

bussec commented Jul 13, 2020

scharch commented Jul 13, 2020

bussec commented Jul 13, 2020

bcorrie commented Jul 14, 2020

scharch commented Jul 14, 2020

bcorrie commented Jul 22, 2020

scharch commented Jul 22, 2020

bcorrie commented Jul 27, 2020

File format for AIRR Clone object? #421

File format for AIRR Clone object? #421

Comments

bcorrie commented Jun 25, 2020

scharch commented Jun 25, 2020

javh commented Jun 25, 2020 • edited Loading

eharkins commented Jun 25, 2020

bcorrie commented Jun 26, 2020

eharkins commented Jun 28, 2020

scharch commented Jun 29, 2020

bcorrie commented Jun 29, 2020 • edited Loading

bcorrie commented Jun 29, 2020

eharkins commented Jun 29, 2020

bcorrie commented Jun 29, 2020 • edited Loading

bcorrie commented Jun 30, 2020 • edited Loading

bcorrie commented Jul 10, 2020 • edited Loading

scharch commented Jul 12, 2020

bussec commented Jul 13, 2020

scharch commented Jul 13, 2020

bussec commented Jul 13, 2020

bcorrie commented Jul 14, 2020

scharch commented Jul 14, 2020

bcorrie commented Jul 22, 2020

scharch commented Jul 22, 2020

bcorrie commented Jul 27, 2020

javh commented Jun 25, 2020 •

edited

Loading

bcorrie commented Jun 29, 2020 •

edited

Loading

bcorrie commented Jun 29, 2020 •

edited

Loading

bcorrie commented Jun 30, 2020 •

edited

Loading

bcorrie commented Jul 10, 2020 •

edited

Loading