Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File format for AIRR Clone object? #421

Closed
bcorrie opened this issue Jun 25, 2020 · 21 comments
Closed

File format for AIRR Clone object? #421

bcorrie opened this issue Jun 25, 2020 · 21 comments

Comments

@bcorrie
Copy link
Contributor

bcorrie commented Jun 25, 2020

Have we come to a consensus as to what the file format for the Clone object should be? I know @eharkins has done some validation on the AIRR Clone data model, did you use the JSON format for the object for that validation. I think so as I recall some discussion about JSON in #323. This is also relevant to validation in the python and R libraries in #328.

I suppose I am wondering if we are intending on supporting a TSV file format for Clones as well, as a compact/simple format? I believe at least the Clone object is very flat, simple object much like rearrangements. Are we intending on supporting a Clone TSV file? Did I miss such a discussion? 8-)

@scharch
Copy link
Contributor

scharch commented Jun 25, 2020

I can't find it now, but I'm pretty sure we decided it would be JSON/YAML because each Clone can reference multiple sequence_ids and multiple Trees, and we don't like nesting in TSVs :)

@javh
Copy link
Contributor

javh commented Jun 25, 2020

That's my recollection as well.

Though, I do think we talked briefly about a TSV without the sequences field and with a single Newick string in the tree column (one consensus tree, with nodes identified by sequence_id) as a simplified format. There would be no nesting required, but you'll need the accompanying Rearrangement data to get the sequence info at the Node level.

@eharkins
Copy link
Contributor

@bcorrie yes, we use JSON in Olmsted; e.g.

@bcorrie
Copy link
Contributor Author

bcorrie commented Jun 26, 2020

OK - we are looking at loading clone data into the iReceptor repositories so they are searchable via the ADC API (see #422) - so loading JSON/YAML it is...

So one related question. This would imply that we are assuming (hoping?) that most software tools that produce clones would support this format as a data interchange format?

@eharkins
Copy link
Contributor

I'm not sure if this helps/answers but the most common (the only for now) pipeline that includes Olmsted also includes a python utility to read clone info in the format it is produced (by the clone inference tools) and shape it into the Olmsted schema in JSON format.

@scharch
Copy link
Contributor

scharch commented Jun 29, 2020

When I eventually get around to implementing this in SONAR, it will produce the AIRR-compatible YAML file, yes.

@bcorrie
Copy link
Contributor Author

bcorrie commented Jun 29, 2020

So one related question. This would imply that we are assuming (hoping?) that most software tools that produce clones would support this format as a data interchange format?

I guess another way of asking this question is, what are the most widely used Clone (in the sense of the AIRR Clone data model) generating software tools? Since we don't support clones yet, I am far from being an expert on this.

For rearrangements we support loading AIRR TSV, IMGT VQuest, igblast (through its AIRR TSV output), and MiXCR.

If we only support AIRR JSON Clone files, that in the short term will be very limiting I think. What other tools should I be considering from a data loading perspective?

@bcorrie
Copy link
Contributor Author

bcorrie commented Jun 29, 2020

I'm not sure if this helps/answers but the most common (the only for now) pipeline that includes Olmsted also includes a python utility to read clone info in the format it is produced (by the clone inference tools) and shape it into the Olmsted schema in JSON format.

@eharkins what pipeline would that be? If there is something that reads output from the clone inference tools, that would probably be very helpful.

@eharkins
Copy link
Contributor

@bcorrie clone inference tool (partis) -> clone processing and phylogenetics pipeline (CFT) -> utility to convert into olmsted schema format (olmsted) -> olmsted

@bcorrie
Copy link
Contributor Author

bcorrie commented Jun 29, 2020

@schristley suggested using a JSON object similar to that used in Repertoire, essentially something like this:

  clone_response:
    type: object
    properties:
      Info:
        $ref: '#/definitions/info_object'
      Clone:
        $ref: '#/definitions/clone_list'

Where clone_list above is just an Array of JSON Clone objects from the AIRR Spec. The above is essentially what the proposed Clone ADC API returns, which would be ideal...

So an AIRR Clone file would look like:

{
"Info":{STANDARD AIRR INFO OBJECT},
"Clone":
  [
    {"clone_id":"REP1-ID1", "germline_alignment":"...", "repertoire_id":"REP1","data_processing_id":"DP1",...},
    {"clone_id":"REP1-ID2", "germline_alignment":"...", "repertoire_id":"REP1","data_processing_id":"DP1",...},
    ...
    {"clone_id":"REP2-ID1", "germline_alignment":"...", "repertoire_id":"REP2","data_processing_id":"DP1",...},
    ...
    {"clone_id":"REP3-ID1", "germline_alignment":"...", "repertoire_id":"REP3","data_processing_id":"DP1",...},
    ...
  ]
}

In this case the clone_id is unique within a Repertoire, or perhaps more precisely within a repertoire_id/data_processing_id pair.

Does that make sense?

@schristley schristley mentioned this issue Jun 30, 2020
@bcorrie
Copy link
Contributor Author

bcorrie commented Jun 30, 2020

@schristley points out in #422 that from a repository perspective, we would want clone_id to be like sequence_id and repertoire_id (and probably cell_id) in that they are unique within the context of the analysis being performed.

@schristley mentioned this in relation to both cell_id and clone_id here: #409 (comment). That is these are ephemeral things that can be modified as needed to ensure uniqueness and link records as required. Does that make sense to people?

@bcorrie
Copy link
Contributor Author

bcorrie commented Jul 10, 2020

Any further thoughts on the file format suggested in: #421 (comment)?

We are about to embark on writing a data loader into the iReceptor Turnkey repository for the AIRR Clone format, and I would like to try and get some basic agreement before we go to far... 8-)

If I assume the above for now, am I going to regret it later?

I think the uniqueness discussion of clone_id is independent of the file format.

@scharch
Copy link
Contributor

scharch commented Jul 12, 2020

It's not clear to me that the clone file is supposed to have its own Info object; I thought it would be linked to the Repertoire metadata. OTOH, that probably depends a bit on how we decided to define a repertoire and how to handle Clones defined across multiple repertoires....

@bussec
Copy link
Member

bussec commented Jul 13, 2020

Will there be any possibility to group Cells into a Clone?

@scharch
Copy link
Contributor

scharch commented Jul 13, 2020

@bussec: Eventually. See #317

@bussec
Copy link
Member

bussec commented Jul 13, 2020

@scharch: Thanks, then I will move that part of the discussion there.

@bcorrie
Copy link
Contributor Author

bcorrie commented Jul 14, 2020

It's not clear to me that the clone file is supposed to have its own Info object; I thought it would be linked to the Repertoire metadata. OTOH, that probably depends a bit on how we decided to define a repertoire and how to handle Clones defined across multiple repertoires....

My question would be whether or not there would likely be a Clone file as a stand alone thing and if so, would the Info section be useful? The Info object is mostly about defining version information for the AIRR Spec that is used to define the rest of the file. As such, I think it is essentially independent of use (I don't think there is anything Repertoire specific in it). From the docs section for Repertoire files:

The file can (optionally) contain an Info object, at the beginning of the file, based upon the Info schema in the OpenAPI V2 specification. If provided, version in Info should reference the version of the AIRR schema for the file.

Given that we are using JSON and therefore structured text, the Info block gives a way of ensuring that two files use the same AIRR version (e.g. v1.3), which will be potentially important for tools in the future. It provides a mechanism for one to take a set of Repertoire, Clone, and Rearrangement files and do a sanity check to ensure that the files are compatible.

If this is unlikely or unimportant, then it is not required. With that said, it doesn't result in much overhead and does provide some utility...

@scharch
Copy link
Contributor

scharch commented Jul 14, 2020

@bcorrie ok, yeah, thanks

@bcorrie
Copy link
Contributor Author

bcorrie commented Jul 22, 2020

Anyone object to me closing this issue? We are starting to work on the data loader, and I would like to get consensus before moving forward... It seems like we have it, with the caveat that if the fields change, the format changes. By consensus I meant that we agree that the file format is JSON, and at least for now we are OK with having the Info block as part of the format?

@scharch
Copy link
Contributor

scharch commented Jul 22, 2020

ok with me

@bcorrie
Copy link
Contributor Author

bcorrie commented Jul 27, 2020

Closing this issue

@bcorrie bcorrie closed this as completed Jul 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants