Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FASTA descriptions #857

Open
theosanderson opened this issue Jul 10, 2024 · 6 comments
Open

FASTA descriptions #857

theosanderson opened this issue Jul 10, 2024 · 6 comments

Comments

@theosanderson
Copy link

theosanderson commented Jul 10, 2024

(Assuming that it doesn't) it would be nice if LAPIS could support returning FASTAs with "descriptions", e.g. https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta has both an accession and then a description, separated by a space.

image

There are various implementations possible, from hardcoding the description field to allowing it to be selected in the query.

loculus-project/loculus#2284 (comment)

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Jul 10, 2024

I'll write a primer on Fasta header specs/grammar, so that whoever picks this up doesn't need to do the research themself.

Per the BioPython docs, it appears that id is the beginning of the header string until the first space. The rest is parsed into description. name seems to not be present in fasta files:

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
                        IUPAC.protein),
                   id="YP_025292.1", name="HokC",
                   description="toxic membrane protein")

print(record.format("fasta"))
>YP_025292.1 toxic membrane protein
MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF

From https://biopython.org/wiki/SeqRecord

The most authoritative "reference implementation" seems to be Bio::SeqIO::fasta: https://metacpan.org/pod/Bio::SeqIO::fasta

Relevant implementation:
https://github.com/bioperl/bioperl-live/blob/9ce0d304f42ef3a6808e7e94be42cb56bb51068d/lib/Bio/SeqIO/fasta.pm#L281-L307

This is the grammar I (with the help of ChatGPT, but verifying correctness) deduced for fasta headers:

fasta_header ::= ">" identifier ( " " description )? "\n"
identifier    ::= ( printable_char_no_space )+
description   ::= ( printable_char - "\n" )*

printable_char_no_space ::= ? all printable ASCII characters except space ?
printable_char          ::= ? all printable ASCII characters ?

Biopython uses the same grammar, as far as I can tell: https://github.com/biopython/biopython/blob/af00cf6c79887d80df8673e5cacde0786415ce34/Bio/SeqIO/FastaIO.py

@fengelniederhammer
Copy link
Contributor

In my impression, FASTA headers are more or less arbitrary. If we want to implement this feature, I would propose to aim for a general solution that is flexible and allows the use case that Loculus aims for.

Maybe the following approach makes sense:

  • add parameters to /sample/alignedNucleotideSequences, /sample/unalignedNucleotideSequences and /sample/alignedAminoAcidSequences
    • fields. That would create a unified API with /details and /aggregated. Default: [<primaryKey>].
    • delimiter. Default: /.
  • Pass on the fields parameter to the corresponding SILO actions.
  • SILO then returns the sequences + the requested fields (in the usual NDJSON format)
  • LAPIS maps the JSON lines to fasta format, computing the FASTA headers from the fields (in the user-supplied order) and concatenating them with delimiter.

Open question: What happens if the primary key is not in the fields?

  1. silently add it (to the beginning? to the end?)
  2. ignore it and leave the user with potentially useless fasta files
  3. throw an error

Probably 2. is more reasonable, since multiple metadata fields might be unique enough to be useful. That would give the user the flexibility to query for other unique fields that are not the primary key.

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Jul 10, 2024

As I've commented elsewhere, in order to allow semantic parsing of the header into fasta id and description, there needs to be a whitespace after the id (and the id must not contain whitespace).

There are two options here:

  • a) Try to keep the semantics of fasta headers, so that people parsing with BioPerl or BioPython are happy: this would mean one can't just concatenate fields as proposed above. One option would be to always output the primary key as the id, and make fields be the description
  • b) Don't bother with fasta specs re id and description, and treat the header as a single line of csv with customizable separator

I think @fengelniederhammer proposes option b, but we should probably consciously decide between the two, as option a has the advantage of being more spec compliant.

To make it more concrete:
Using the following query: /sample/unalignedNucleotideSequence?fields=displayName,releaseDate&fieldSeparator=| looks like this
for option a):

>LOC0123.1 |Germany/LOC0123.1/2023-01-04|2024-06-14
or
>LOC0123.1 Germany/LOC0123.1/2023-01-04|2024-06-14

for option b):

>LOC0123.1|Germany/LOC0123.1/2023-01-04|2024-06-14

Option b could result in surprising parse results using Bio(Python|Perl) in some cases, e.g. when there's spaces in the fields:

>LOC0123.1|Germany/LOC0123.1/2023-01-04|Homo sapiens

would parse as:

id='LOC0123.1|Germany/LOC0123.1/2023-01-04|Homo'
description=sapiens

Whereas option a:

>LOC0123.1 |Germany/LOC0123.1/2023-01-04|Homo sapiens
or
>LOC0123.1 Germany/LOC0123.1/2023-01-04|Homo sapiens

would parse as one would expect:

id='LOC0123.1'
description='|Germany/LOC0123.1/2023-01-04|Homo sapiens'
or
description='Germany/LOC0123.1/2023-01-04|Homo sapiens'

@chaoran-chen
Copy link
Member

Thank you very much for the suggestion! I fully agree that this would be a useful feature.

Timeline-wise: I think that this is not trivial to implement, at least not if it should be efficient (which is one of the core values for LAPIS). Given that our focus at the moment is to polish SILO and prepare the official public launch, I don't think that we will work on this issue in the upcoming 2-3 months.

@corneliusroemer
Copy link
Contributor

Gentle reminder that this would be really good to have

@fengelniederhammer
Copy link
Contributor

LAPIS is now able to return sequences as JSON and NDJSON (#971, as of 0.3.6). Maybe this works as a workaround in the meantime? You could fetch the sequences and the /details as JSON in Loculus and build the FASTA header yourself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants