-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New source: SILVA taxonomy #348
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #348 +/- ##
=======================================
Coverage ? 51.95%
=======================================
Files ? 187
Lines ? 12168
Branches ? 1857
=======================================
Hits ? 6322
Misses ? 5607
Partials ? 239 ☔ View full report in Codecov by Sentry. |
src/pyobo/sources/silva.py
Outdated
reference=Reference(prefix="ena.embl", identifier=accession, name=organism) | ||
) | ||
# Do NOT annotate the new term with a rank (leave it unranked). | ||
new_term.append_parent(Reference(prefix=PREFIX, identifier=species_taxon_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't ENA terms represent nucleotide sequences derived from experiments? Can they also represent projects?
From what I understand, they aren't actually themselves representing taxa. Therefore this parent/child relationship doesn't make sense.
The hard work of making a PyOBO source is really understanding what is the relationship SILVA means when it mentions its internal taxonomy and ENA sequences. I can't do this hard work for you in detail, but from a high level it seems like the sequence was derived from an individual of the taxonomy.
Then, there's two options:
- Find an existing RO relationship that is appropriate for this. Maybe http://purl.obolibrary.org/obo/RO_0001001, even though it's not a perfect ontological fit. Maybe OBI is a better place to look
- mint an ad-hoc one yourself within the scope of this file, e.g., like in
pyobo/src/pyobo/sources/clinicaltrials.py
Line 23 in ada760b
HAS_INTERVENTION = TypeDef(
If you go the second route, make sure that you do a good job describing what the relationship means (in a concise way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your detailed feedback. I completely understand where the hard work lies, and I truly appreciate the guidance you provided. Your suggestions—either reusing an existing RO relationship (like RO_0001001) or minting an ad-hoc one (as in clinicaltrials.py)—are exactly the direction I was hoping for.
I’ll explore those options further. Alternatively, I might start by representing only down to the genus level (as shown in the taxonomy files) until I fully understand the nuances of the lower levels.
Thanks again for steering this work in the right direction!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an example ad-hoc typedef that you can fill in (or duplicate) for your purposes
src/pyobo/sources/silva.py
Outdated
logger.setLevel(logging.WARNING) | ||
|
||
TYPEDEF = TypeDef( | ||
reference=default_reference(PREFIX, "fixme", name="fixme"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixme here
Summary of Changes
|
Summary
This pull request implements the SILVA taxonomy as an ontology converter. The module converts SILVA small subunit (SSU) taxonomy data into OBO (and OWL) format and is organized in a style similar to the GTDB module.
Key Decisions and Implementation Details
Internal SILVA Taxonomy IDs:
SSU vs. LSU Taxonomy:
"""Convert SILVA small subunit (ssu) taxonomy to OBO format."""
Handling ENA Accession Numbers:
TAXRANK:0000006 (species)
to all ENA entries.Version Introduced Information:
Code Organization
Module-Level Structure:
PREFIX
is defined as"silva.taxon"
.Rank Mapping:
SILVA_RANK_TO_TAXRANK
covers all SILVA taxonomic ranks.Main Processing Steps:
Main Taxonomy File:
";"
(ignoring empty strings)."Bacteria"
or"Actinomycetota"
), and the parent is determined by joining all but the last element.Taxmap File:
ena.embl:
.I welcome any feedback or suggestions on URL handling, inclusion of the version introduced field, or any other aspect of the implementation.
Please let me know if further details are needed.