Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New source: SILVA taxonomy #348

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

jplfaria
Copy link
Contributor

Summary

This pull request implements the SILVA taxonomy as an ontology converter. The module converts SILVA small subunit (SSU) taxonomy data into OBO (and OWL) format and is organized in a style similar to the GTDB module.


Key Decisions and Implementation Details

  • Internal SILVA Taxonomy IDs:

    • As discussed in #1306, the internal SILVA taxonomy ID URLs do not resolve externally.
    • For example, I used URLs such as:
      https://www.arb-silva.de/no_cache/download/archive/current/Exports/taxonomy/10081
      
      with the idea that navigating to:
      https://www.arb-silva.de/no_cache/download/archive/current/Exports/taxonomy/
      
      will allow users to locate the files by internal IDs.
    • Note: I am open to alternative suggestions (e.g. using URLs that resolve directly to bioregistry.io).
  • SSU vs. LSU Taxonomy:

    • This module is specific for SILVA small subunit (SSU) taxonomy.
    • The large subunit (LSU) taxonomy is distinct and should be handled separately.
    • The module’s docstring clearly states:
      """Convert SILVA small subunit (ssu) taxonomy to OBO format."""
  • Handling ENA Accession Numbers:

    • At the lowest taxonomy level, SILVA uses accession numbers that resolve to ENA entries.
    • Initial Approach:
      • I initially implemented these as cross-references (xrefs) at the genus level (e.g., each ENA accession was added as an xref within the genus term).
      • Example (OBO):
        [Term]
        id: silva.taxon:58060
        name: Angustibacter
        xref: ena.embl:AB234237 ! uncultured bacterium
        xref: ena.embl:AB512285 ! Angustibacter luteus
        is_a: silva.taxon:58059
        property_value: TAXRANK:1000000 TAXRANK:0000005 ! has rank genus
        
    • Revised Approach:
      • I then decided to represent each ENA accession as its own term, with a parent association determined using the taxmap file (see below).
      • Example (OWL snippet):
        <owl:Class rdf:about="https://www.ebi.ac.uk/ena/browser/view/KP324679">
            <rdfs:subClassOf rdf:resource="https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/47493"/>
            <oboInOwl:id>ena.embl:KP324679</oboInOwl:id>
            <rdfs:label>Fanellia korema</rdfs:label>
        </owl:Class>
    • Rank Annotation for ENA Terms:
      • Originally, I was assigning TAXRANK:0000006 (species) to all ENA entries.
      • However, because these ENA entries may represent either species or strain, I decided not to assign any rank to the ENA-derived terms.
  • Version Introduced Information:

    • The main SILVA taxonomy file includes a column with the version in which a given taxon ID was introduced.
    • Example in OBO Format:
      [Term]
      id: silva.taxon:58060
      name: Angustibacter
      property_value: version_introduced "138.2"
      ...
      
    • I am open to discussion on whether to include this field by default as I can see it being useful but confusing.

Code Organization

  • Module-Level Structure:

    • A global constant PREFIX is defined as "silva.taxon".
  • Rank Mapping:

    • The dictionary SILVA_RANK_TO_TAXRANK covers all SILVA taxonomic ranks.
  • Main Processing Steps:

    1. Main Taxonomy File:

      • Each row is split on ";" (ignoring empty strings).
      • The term’s name is set to the last element (e.g., "Bacteria" or "Actinomycetota"), and the parent is determined by joining all but the last element.
    2. Taxmap File:

      • A new term is created for each ENA accession with prefix ena.embl:.
      • These terms are linked as children of the corresponding main taxonomy term, but no rank is assigned to them.

I welcome any feedback or suggestions on URL handling, inclusion of the version introduced field, or any other aspect of the implementation.

Please let me know if further details are needed.

Copy link

codecov bot commented Feb 12, 2025

Codecov Report

Attention: Patch coverage is 33.82353% with 45 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@ada760b). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/pyobo/sources/silva.py 32.83% 45 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #348   +/-   ##
=======================================
  Coverage        ?   51.95%           
=======================================
  Files           ?      187           
  Lines           ?    12168           
  Branches        ?     1857           
=======================================
  Hits            ?     6322           
  Misses          ?     5607           
  Partials        ?      239           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

reference=Reference(prefix="ena.embl", identifier=accession, name=organism)
)
# Do NOT annotate the new term with a rank (leave it unranked).
new_term.append_parent(Reference(prefix=PREFIX, identifier=species_taxon_id))
Copy link
Member

@cthoyt cthoyt Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't ENA terms represent nucleotide sequences derived from experiments? Can they also represent projects?

From what I understand, they aren't actually themselves representing taxa. Therefore this parent/child relationship doesn't make sense.

The hard work of making a PyOBO source is really understanding what is the relationship SILVA means when it mentions its internal taxonomy and ENA sequences. I can't do this hard work for you in detail, but from a high level it seems like the sequence was derived from an individual of the taxonomy.

Then, there's two options:

  1. Find an existing RO relationship that is appropriate for this. Maybe http://purl.obolibrary.org/obo/RO_0001001, even though it's not a perfect ontological fit. Maybe OBI is a better place to look
  2. mint an ad-hoc one yourself within the scope of this file, e.g., like in
    HAS_INTERVENTION = TypeDef(

If you go the second route, make sure that you do a good job describing what the relationship means (in a concise way)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your detailed feedback. I completely understand where the hard work lies, and I truly appreciate the guidance you provided. Your suggestions—either reusing an existing RO relationship (like RO_0001001) or minting an ad-hoc one (as in clinicaltrials.py)—are exactly the direction I was hoping for.

I’ll explore those options further. Alternatively, I might start by representing only down to the genus level (as shown in the taxonomy files) until I fully understand the nuances of the lower levels.

Thanks again for steering this work in the right direction!

@cthoyt cthoyt changed the title adding module for silva taxonomy New source: SILVA taxonomy Feb 13, 2025
@cthoyt cthoyt added the Nomenclature Data Nomenclature to convert to OBO label Feb 13, 2025
Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an example ad-hoc typedef that you can fill in (or duplicate) for your purposes

logger.setLevel(logging.WARNING)

TYPEDEF = TypeDef(
reference=default_reference(PREFIX, "fixme", name="fixme"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixme here

@jplfaria
Copy link
Contributor Author

Summary of Changes

  • New TypeDef Added:
    Introduced HAS_TAXONOMIC_CLASSIFICATION to capture the relationship between an ENA accession (representing a genome sequence) and the taxonomic classification assigned by SILVA.

  • Rationale:
    Instead of using a parent/child relationship—which implies a fixed hierarchical level—this new typedef reflects that SILVA can classify sequences to varying levels (often down to genus, but sometimes only to higher ranks). This approach better represents the flexible nature of the taxonomic assignments provided by SILVA.

  • Implementation Details:
    The new typedef is defined as follows:

    HAS_TAXONOMIC_CLASSIFICATION = TypeDef(
        reference=default_reference(PREFIX, "has_taxonomic_classification", name="has taxonomic classification"),
        definition="Indicates that the genome sequence represented by an ENA accession is classified under this taxon by SILVA.",
        is_metadata_tag=True,
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Nomenclature Data Nomenclature to convert to OBO
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants