FASTA descriptions #857

theosanderson · 2024-07-10T15:37:20Z

(Assuming that it doesn't) it would be nice if LAPIS could support returning FASTAs with "descriptions", e.g. https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta has both an accession and then a description, separated by a space.

There are various implementations possible, from hardcoding the description field to allowing it to be selected in the query.

loculus-project/loculus#2284 (comment)

corneliusroemer · 2024-07-10T16:43:17Z

I'll write a primer on Fasta header specs/grammar, so that whoever picks this up doesn't need to do the research themself.

Per the BioPython docs, it appears that id is the beginning of the header string until the first space. The rest is parsed into description. name seems to not be present in fasta files:

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
                        IUPAC.protein),
                   id="YP_025292.1", name="HokC",
                   description="toxic membrane protein")

print(record.format("fasta"))

>YP_025292.1 toxic membrane protein
MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF

From https://biopython.org/wiki/SeqRecord

The most authoritative "reference implementation" seems to be Bio::SeqIO::fasta: https://metacpan.org/pod/Bio::SeqIO::fasta

Relevant implementation:
https://github.com/bioperl/bioperl-live/blob/9ce0d304f42ef3a6808e7e94be42cb56bb51068d/lib/Bio/SeqIO/fasta.pm#L281-L307

This is the grammar I (with the help of ChatGPT, but verifying correctness) deduced for fasta headers:

fasta_header ::= ">" identifier ( " " description )? "\n"
identifier    ::= ( printable_char_no_space )+
description   ::= ( printable_char - "\n" )*

printable_char_no_space ::= ? all printable ASCII characters except space ?
printable_char          ::= ? all printable ASCII characters ?

Biopython uses the same grammar, as far as I can tell: https://github.com/biopython/biopython/blob/af00cf6c79887d80df8673e5cacde0786415ce34/Bio/SeqIO/FastaIO.py

fengelniederhammer · 2024-07-10T17:45:54Z

In my impression, FASTA headers are more or less arbitrary. If we want to implement this feature, I would propose to aim for a general solution that is flexible and allows the use case that Loculus aims for.

Maybe the following approach makes sense:

add parameters to /sample/alignedNucleotideSequences, /sample/unalignedNucleotideSequences and /sample/alignedAminoAcidSequences
- fields. That would create a unified API with /details and /aggregated. Default: [<primaryKey>].
- delimiter. Default: /.
Pass on the fields parameter to the corresponding SILO actions.
SILO then returns the sequences + the requested fields (in the usual NDJSON format)
LAPIS maps the JSON lines to fasta format, computing the FASTA headers from the fields (in the user-supplied order) and concatenating them with delimiter.

Open question: What happens if the primary key is not in the fields?

silently add it (to the beginning? to the end?)
ignore it and leave the user with potentially useless fasta files
throw an error

Probably 2. is more reasonable, since multiple metadata fields might be unique enough to be useful. That would give the user the flexibility to query for other unique fields that are not the primary key.

corneliusroemer · 2024-07-10T17:55:01Z

As I've commented elsewhere, in order to allow semantic parsing of the header into fasta id and description, there needs to be a whitespace after the id (and the id must not contain whitespace).

There are two options here:

a) Try to keep the semantics of fasta headers, so that people parsing with BioPerl or BioPython are happy: this would mean one can't just concatenate fields as proposed above. One option would be to always output the primary key as the id, and make fields be the description
b) Don't bother with fasta specs re id and description, and treat the header as a single line of csv with customizable separator

I think @fengelniederhammer proposes option b, but we should probably consciously decide between the two, as option a has the advantage of being more spec compliant.

To make it more concrete:
Using the following query: /sample/unalignedNucleotideSequence?fields=displayName,releaseDate&fieldSeparator=| looks like this
for option a):

>LOC0123.1 |Germany/LOC0123.1/2023-01-04|2024-06-14
or
>LOC0123.1 Germany/LOC0123.1/2023-01-04|2024-06-14

for option b):

>LOC0123.1|Germany/LOC0123.1/2023-01-04|2024-06-14

Option b could result in surprising parse results using Bio(Python|Perl) in some cases, e.g. when there's spaces in the fields:

>LOC0123.1|Germany/LOC0123.1/2023-01-04|Homo sapiens

would parse as:

id='LOC0123.1|Germany/LOC0123.1/2023-01-04|Homo'
description=sapiens

Whereas option a:

>LOC0123.1 |Germany/LOC0123.1/2023-01-04|Homo sapiens
or
>LOC0123.1 Germany/LOC0123.1/2023-01-04|Homo sapiens

would parse as one would expect:

id='LOC0123.1'
description='|Germany/LOC0123.1/2023-01-04|Homo sapiens'
or
description='Germany/LOC0123.1/2023-01-04|Homo sapiens'

chaoran-chen · 2024-07-11T11:46:08Z

Thank you very much for the suggestion! I fully agree that this would be a useful feature.

Timeline-wise: I think that this is not trivial to implement, at least not if it should be efficient (which is one of the core values for LAPIS). Given that our focus at the moment is to polish SILO and prepare the official public launch, I don't think that we will work on this issue in the upcoming 2-3 months.

corneliusroemer · 2024-10-21T13:13:50Z

Gentle reminder that this would be really good to have

fengelniederhammer · 2024-10-21T13:28:09Z

LAPIS is now able to return sequences as JSON and NDJSON (#971, as of 0.3.6). Maybe this works as a workaround in the meantime? You could fetch the sequences and the /details as JSON in Loculus and build the FASTA header yourself?

corneliusroemer mentioned this issue Aug 22, 2024

ENH: Add display name after sequence id in downloads, or allow customization of fasta id, so it can contain metadata or display name loculus-project/loculus#2284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FASTA descriptions #857

FASTA descriptions #857

theosanderson commented Jul 10, 2024 •

edited

Loading

corneliusroemer commented Jul 10, 2024 •

edited

Loading

fengelniederhammer commented Jul 10, 2024

corneliusroemer commented Jul 10, 2024 •

edited

Loading

chaoran-chen commented Jul 11, 2024

corneliusroemer commented Oct 21, 2024

fengelniederhammer commented Oct 21, 2024

FASTA descriptions #857

FASTA descriptions #857

Comments

theosanderson commented Jul 10, 2024 • edited Loading

corneliusroemer commented Jul 10, 2024 • edited Loading

fengelniederhammer commented Jul 10, 2024

corneliusroemer commented Jul 10, 2024 • edited Loading

chaoran-chen commented Jul 11, 2024

corneliusroemer commented Oct 21, 2024

fengelniederhammer commented Oct 21, 2024

theosanderson commented Jul 10, 2024 •

edited

Loading

corneliusroemer commented Jul 10, 2024 •

edited

Loading

corneliusroemer commented Jul 10, 2024 •

edited

Loading