-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FASTA descriptions #857
Comments
I'll write a primer on Fasta header specs/grammar, so that whoever picks this up doesn't need to do the research themself. Per the BioPython docs, it appears that from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
IUPAC.protein),
id="YP_025292.1", name="HokC",
description="toxic membrane protein")
print(record.format("fasta"))
From https://biopython.org/wiki/SeqRecord The most authoritative "reference implementation" seems to be Relevant implementation: This is the grammar I (with the help of ChatGPT, but verifying correctness) deduced for fasta headers: fasta_header ::= ">" identifier ( " " description )? "\n"
identifier ::= ( printable_char_no_space )+
description ::= ( printable_char - "\n" )*
printable_char_no_space ::= ? all printable ASCII characters except space ?
printable_char ::= ? all printable ASCII characters ? Biopython uses the same grammar, as far as I can tell: https://github.com/biopython/biopython/blob/af00cf6c79887d80df8673e5cacde0786415ce34/Bio/SeqIO/FastaIO.py |
In my impression, FASTA headers are more or less arbitrary. If we want to implement this feature, I would propose to aim for a general solution that is flexible and allows the use case that Loculus aims for. Maybe the following approach makes sense:
Open question: What happens if the primary key is not in the
Probably 2. is more reasonable, since multiple metadata fields might be unique enough to be useful. That would give the user the flexibility to query for other unique fields that are not the primary key. |
As I've commented elsewhere, in order to allow semantic parsing of the header into fasta There are two options here:
I think @fengelniederhammer proposes option b, but we should probably consciously decide between the two, as option a has the advantage of being more spec compliant. To make it more concrete:
for option b):
Option b could result in surprising parse results using Bio(Python|Perl) in some cases, e.g. when there's spaces in the fields:
would parse as:
Whereas option a:
would parse as one would expect:
|
Thank you very much for the suggestion! I fully agree that this would be a useful feature. Timeline-wise: I think that this is not trivial to implement, at least not if it should be efficient (which is one of the core values for LAPIS). Given that our focus at the moment is to polish SILO and prepare the official public launch, I don't think that we will work on this issue in the upcoming 2-3 months. |
Gentle reminder that this would be really good to have |
LAPIS is now able to return sequences as JSON and NDJSON (#971, as of 0.3.6). Maybe this works as a workaround in the meantime? You could fetch the sequences and the |
(Assuming that it doesn't) it would be nice if LAPIS could support returning FASTAs with "descriptions", e.g. https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta has both an accession and then a description, separated by a space.
There are various implementations possible, from hardcoding the description field to allowing it to be selected in the query.
loculus-project/loculus#2284 (comment)
The text was updated successfully, but these errors were encountered: