New source: SILVA taxonomy #348

jplfaria · 2025-02-12T06:35:47Z

Summary

This pull request implements the SILVA taxonomy as an ontology converter. The module converts SILVA small subunit (SSU) taxonomy data into OBO (and OWL) format and is organized in a style similar to the GTDB module.

Key Decisions and Implementation Details

Internal SILVA Taxonomy IDs:
- As discussed in #1306, the internal SILVA taxonomy ID URLs do not resolve externally.
- For example, I used URLs such as:
```
https://www.arb-silva.de/no_cache/download/archive/current/Exports/taxonomy/10081
```
  with the idea that navigating to:
```
https://www.arb-silva.de/no_cache/download/archive/current/Exports/taxonomy/
```
  will allow users to locate the files by internal IDs.
- Note: I am open to alternative suggestions (e.g. using URLs that resolve directly to bioregistry.io).
SSU vs. LSU Taxonomy:
- This module is specific for SILVA small subunit (SSU) taxonomy.
- The large subunit (LSU) taxonomy is distinct and should be handled separately.
- The module’s docstring clearly states:
```
"""Convert SILVA small subunit (ssu) taxonomy to OBO format."""
```
Handling ENA Accession Numbers:
- At the lowest taxonomy level, SILVA uses accession numbers that resolve to ENA entries.
- Initial Approach:
  - I initially implemented these as cross-references (xrefs) at the genus level (e.g., each ENA accession was added as an xref within the genus term).
  - Example (OBO):
```
[Term]
id: silva.taxon:58060
name: Angustibacter
xref: ena.embl:AB234237 ! uncultured bacterium
xref: ena.embl:AB512285 ! Angustibacter luteus
is_a: silva.taxon:58059
property_value: TAXRANK:1000000 TAXRANK:0000005 ! has rank genus
```
- Revised Approach:
  - I then decided to represent each ENA accession as its own term, with a parent association determined using the taxmap file (see below).
  - Example (OWL snippet):
```
<owl:Class rdf:about="https://www.ebi.ac.uk/ena/browser/view/KP324679">
    <rdfs:subClassOf rdf:resource="https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/47493"/>
    <oboInOwl:id>ena.embl:KP324679</oboInOwl:id>
    <rdfs:label>Fanellia korema</rdfs:label>
</owl:Class>
```
- Rank Annotation for ENA Terms:
  - Originally, I was assigning TAXRANK:0000006 (species) to all ENA entries.
  - However, because these ENA entries may represent either species or strain, I decided not to assign any rank to the ENA-derived terms.
Version Introduced Information:
- The main SILVA taxonomy file includes a column with the version in which a given taxon ID was introduced.
- Example in OBO Format:
```
[Term]
id: silva.taxon:58060
name: Angustibacter
property_value: version_introduced "138.2"
...
```
- I am open to discussion on whether to include this field by default as I can see it being useful but confusing.

Code Organization

Module-Level Structure:
- A global constant PREFIX is defined as "silva.taxon".
Rank Mapping:
- The dictionary SILVA_RANK_TO_TAXRANK covers all SILVA taxonomic ranks.
Main Processing Steps:
1. Main Taxonomy File:
  - Each row is split on ";" (ignoring empty strings).
  - The term’s name is set to the last element (e.g., "Bacteria" or "Actinomycetota"), and the parent is determined by joining all but the last element.
2. Taxmap File:
  - A new term is created for each ENA accession with prefix ena.embl:.
  - These terms are linked as children of the corresponding main taxonomy term, but no rank is assigned to them.

I welcome any feedback or suggestions on URL handling, inclusion of the version introduced field, or any other aspect of the implementation.

Please let me know if further details are needed.

codecov · 2025-02-12T06:38:48Z

Codecov Report

Attention: Patch coverage is 33.82353% with 45 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@ada760b). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/pyobo/sources/silva.py	32.83%	45 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #348   +/-   ##
=======================================
  Coverage        ?   51.95%           
=======================================
  Files           ?      187           
  Lines           ?    12168           
  Branches        ?     1857           
=======================================
  Hits            ?     6322           
  Misses          ?     5607           
  Partials        ?      239

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cthoyt · 2025-02-13T11:59:08Z

src/pyobo/sources/silva.py

+                reference=Reference(prefix="ena.embl", identifier=accession, name=organism)
+            )
+            # Do NOT annotate the new term with a rank (leave it unranked).
+            new_term.append_parent(Reference(prefix=PREFIX, identifier=species_taxon_id))


Don't ENA terms represent nucleotide sequences derived from experiments? Can they also represent projects?

From what I understand, they aren't actually themselves representing taxa. Therefore this parent/child relationship doesn't make sense.

The hard work of making a PyOBO source is really understanding what is the relationship SILVA means when it mentions its internal taxonomy and ENA sequences. I can't do this hard work for you in detail, but from a high level it seems like the sequence was derived from an individual of the taxonomy.

Then, there's two options:

Find an existing RO relationship that is appropriate for this. Maybe http://purl.obolibrary.org/obo/RO_0001001, even though it's not a perfect ontological fit. Maybe OBI is a better place to look

mint an ad-hoc one yourself within the scope of this file, e.g., like in

pyobo/src/pyobo/sources/clinicaltrials.py

Line 23 in ada760b

HAS_INTERVENTION = TypeDef(

If you go the second route, make sure that you do a good job describing what the relationship means (in a concise way)

Thank you for your detailed feedback. I completely understand where the hard work lies, and I truly appreciate the guidance you provided. Your suggestions—either reusing an existing RO relationship (like RO_0001001) or minting an ad-hoc one (as in clinicaltrials.py)—are exactly the direction I was hoping for.

I’ll explore those options further. Alternatively, I might start by representing only down to the genus level (as shown in the taxonomy files) until I fully understand the nuances of the lower levels.

Thanks again for steering this work in the right direction!

cthoyt

I added an example ad-hoc typedef that you can fill in (or duplicate) for your purposes

cthoyt · 2025-02-16T11:12:47Z

src/pyobo/sources/silva.py

+logger.setLevel(logging.WARNING)
+
+TYPEDEF = TypeDef(
+    reference=default_reference(PREFIX, "fixme", name="fixme"),


jplfaria · 2025-02-20T04:46:51Z

Summary of Changes

New TypeDef Added:
Introduced HAS_TAXONOMIC_CLASSIFICATION to capture the relationship between an ENA accession (representing a genome sequence) and the taxonomic classification assigned by SILVA.
Rationale:
Instead of using a parent/child relationship—which implies a fixed hierarchical level—this new typedef reflects that SILVA can classify sequences to varying levels (often down to genus, but sometimes only to higher ranks). This approach better represents the flexible nature of the taxonomic assignments provided by SILVA.

Implementation Details:
The new typedef is defined as follows:

HAS_TAXONOMIC_CLASSIFICATION = TypeDef(
    reference=default_reference(PREFIX, "has_taxonomic_classification", name="has taxonomic classification"),
    definition="Indicates that the genome sequence represented by an ENA accession is classified under this taxon by SILVA.",
    is_metadata_tag=True,
)

adding module for silva taxonomy

d267ea4

cthoyt added 3 commits February 13, 2025 12:45

Run ruff

205fad3

Update silva.py

b99644a

Merge branch 'main' into pr/348

9ff0766

cthoyt reviewed Feb 13, 2025

View reviewed changes

cthoyt changed the title ~~adding module for silva taxonomy~~ New source: SILVA taxonomy Feb 13, 2025

cthoyt added the Nomenclature Data Nomenclature to convert to OBO label Feb 13, 2025

Add typedef example

a159f4a

cthoyt requested changes Feb 16, 2025

View reviewed changes

cthoyt added 2 commits February 16, 2025 10:22

Update silva.py

09a07e9

Update __init__.py

68ca239

cthoyt reviewed Feb 16, 2025

View reviewed changes

Replace parent relationship with HAS_TAXONOMIC_CLASSIFICATION typedef

b4fa6e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New source: SILVA taxonomy #348

New source: SILVA taxonomy #348

jplfaria commented Feb 12, 2025

codecov bot commented Feb 12, 2025 •

edited

Loading

cthoyt Feb 13, 2025 •

edited

Loading

jplfaria Feb 13, 2025

cthoyt left a comment

cthoyt Feb 16, 2025

jplfaria commented Feb 20, 2025

New source: SILVA taxonomy #348

Are you sure you want to change the base?

New source: SILVA taxonomy #348

Conversation

jplfaria commented Feb 12, 2025

Summary

Key Decisions and Implementation Details

Code Organization

codecov bot commented Feb 12, 2025 • edited Loading

Codecov Report

cthoyt Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

jplfaria Feb 13, 2025

Choose a reason for hiding this comment

cthoyt left a comment

Choose a reason for hiding this comment

cthoyt Feb 16, 2025

Choose a reason for hiding this comment

jplfaria commented Feb 20, 2025

Summary of Changes

codecov bot commented Feb 12, 2025 •

edited

Loading

cthoyt Feb 13, 2025 •

edited

Loading