-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File format for AIRR Clone object? #421
Comments
I can't find it now, but I'm pretty sure we decided it would be JSON/YAML because each |
That's my recollection as well. Though, I do think we talked briefly about a TSV without the |
OK - we are looking at loading clone data into the iReceptor repositories so they are searchable via the ADC API (see #422) - so loading JSON/YAML it is... So one related question. This would imply that we are assuming (hoping?) that most software tools that produce clones would support this format as a data interchange format? |
I'm not sure if this helps/answers but the most common (the only for now) pipeline that includes Olmsted also includes a python utility to read clone info in the format it is produced (by the clone inference tools) and shape it into the Olmsted schema in JSON format. |
When I eventually get around to implementing this in SONAR, it will produce the AIRR-compatible YAML file, yes. |
I guess another way of asking this question is, what are the most widely used Clone (in the sense of the AIRR Clone data model) generating software tools? Since we don't support clones yet, I am far from being an expert on this. For rearrangements we support loading AIRR TSV, IMGT VQuest, igblast (through its AIRR TSV output), and MiXCR. If we only support AIRR JSON Clone files, that in the short term will be very limiting I think. What other tools should I be considering from a data loading perspective? |
@eharkins what pipeline would that be? If there is something that reads output from the clone inference tools, that would probably be very helpful. |
@bcorrie clone inference tool (partis) -> clone processing and phylogenetics pipeline (CFT) -> utility to convert into olmsted schema format (olmsted) -> olmsted |
@schristley suggested using a JSON object similar to that used in
Where clone_list above is just an Array of JSON So an AIRR Clone file would look like:
In this case the clone_id is unique within a Does that make sense? |
@schristley points out in #422 that from a repository perspective, we would want @schristley mentioned this in relation to both cell_id and clone_id here: #409 (comment). That is these are ephemeral things that can be modified as needed to ensure uniqueness and link records as required. Does that make sense to people? |
Any further thoughts on the file format suggested in: #421 (comment)? We are about to embark on writing a data loader into the iReceptor Turnkey repository for the AIRR Clone format, and I would like to try and get some basic agreement before we go to far... 8-) If I assume the above for now, am I going to regret it later? I think the uniqueness discussion of clone_id is independent of the file format. |
It's not clear to me that the clone file is supposed to have its own |
Will there be any possibility to group |
@scharch: Thanks, then I will move that part of the discussion there. |
My question would be whether or not there would likely be a Clone file as a stand alone thing and if so, would the Info section be useful? The Info object is mostly about defining version information for the AIRR Spec that is used to define the rest of the file. As such, I think it is essentially independent of use (I don't think there is anything Repertoire specific in it). From the docs section for Repertoire files:
Given that we are using JSON and therefore structured text, the Info block gives a way of ensuring that two files use the same AIRR version (e.g. v1.3), which will be potentially important for tools in the future. It provides a mechanism for one to take a set of Repertoire, Clone, and Rearrangement files and do a sanity check to ensure that the files are compatible. If this is unlikely or unimportant, then it is not required. With that said, it doesn't result in much overhead and does provide some utility... |
@bcorrie ok, yeah, thanks |
Anyone object to me closing this issue? We are starting to work on the data loader, and I would like to get consensus before moving forward... It seems like we have it, with the caveat that if the fields change, the format changes. By consensus I meant that we agree that the file format is JSON, and at least for now we are OK with having the Info block as part of the format? |
ok with me |
Closing this issue |
Have we come to a consensus as to what the file format for the
Clone
object should be? I know @eharkins has done some validation on the AIRR Clone data model, did you use the JSON format for the object for that validation. I think so as I recall some discussion about JSON in #323. This is also relevant to validation in the python and R libraries in #328.I suppose I am wondering if we are intending on supporting a TSV file format for Clones as well, as a compact/simple format? I believe at least the Clone object is very flat, simple object much like rearrangements. Are we intending on supporting a Clone TSV file? Did I miss such a discussion? 8-)
The text was updated successfully, but these errors were encountered: