Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract individual information from NCBI #3

Open
NoopDog opened this issue Jan 28, 2025 · 5 comments
Open

Extract individual information from NCBI #3

NoopDog opened this issue Jan 28, 2025 · 5 comments
Assignees

Comments

@NoopDog
Copy link
Collaborator

NoopDog commented Jan 28, 2025

How can we do this?

@Smeds
Copy link
Collaborator

Smeds commented Jan 29, 2025

From the VGP hub (https://hgdownload.soe.ucsc.edu/hubs/VGP/) you are able to see individual from 'NCBI assembly' column. The following bold part GCA_009819605.1_bSteHir1.pri . Where bSteHir is the Tree of life id and the number should be unique for an individual.

Using the following url /genome/accession/{accessions}/dataset_report from the ncbi rest api should make it possible to find the individual information. Here's example for Zebra Finch.

GCF_003957565.2

"assembly_name": "bTaeGut1.4.pri"

{
  "reports": [
    {
      "accession": "GCF_003957565.2",
      "current_accession": "GCF_003957565.2",
      "paired_accession": "GCA_003957565.4",
      "source_database": "SOURCE_DATABASE_REFSEQ",
      "organism": {
        "tax_id": 59729,
        "organism_name": "Taeniopygia guttata",
        "common_name": "zebra finch",
        "infraspecific_names": {
          "isolate": "Blue55",
          "sex": "female"
        }
      },
      "assembly_info": {
        "assembly_level": "Chromosome",
        "assembly_status": "current",
        "paired_assembly": {
          "accession": "GCA_003957565.4",
          "status": "current",
          "changed": "chromosome MT."
        },
        "assembly_name": "bTaeGut1.4.pri",
        "assembly_type": "haploid",
      }
   }
   ]
}

GCF_008822105.2

"assembly_name": "bTaeGut2.pat.W.v2",

@Smeds
Copy link
Collaborator

Smeds commented Jan 29, 2025

I have added some information about extraction of individual @NoopDog

@NoopDog
Copy link
Collaborator Author

NoopDog commented Jan 29, 2025

Thanks, @Smeds! Are you saying that the individual name is encoded in the assembly name and is not explicitly represented in the API response?

Also, what is

"infraspecific_names": {
"isolate": "Blue55",
"sex": "female"
}

Is Blue55 also an ID for the individual? Or is this something else?

Cheers
D

@Smeds
Copy link
Collaborator

Smeds commented Jan 29, 2025

Could be that NCBI aren't really working with individual information, they have there different assemblies. Looks like isolate (Blue55) is something common for an organism, same value for bTaeGut2 .

{
  "reports": [
    {
      "accession": "GCF_008822105.2",
      "current_accession": "GCF_008822105.2",
      "paired_accession": "GCA_008822105.2",
      "source_database": "SOURCE_DATABASE_REFSEQ",
      "organism": {
        "tax_id": 59729,
        "organism_name": "Taeniopygia guttata",
        "common_name": "zebra finch",
        "infraspecific_names": {
          "isolate": "Blue55",
          "sex": "female"
        }
      },
      "assembly_info": {
        "assembly_level": "Chromosome",
        "assembly_status": "suppressed",
        "paired_assembly": {
          "accession": "GCA_008822105.2",
          "status": "current",
          "changed": "chromosome MT."
        },
        "assembly_name": "bTaeGut2.pat.W.v2",
      }}]}

If you look at the genome ark v1 page, they are using the TolID with a number to different between individuals, https://www.genomeark.org/vgp-all/Taeniopygia_guttata.html

@Smeds
Copy link
Collaborator

Smeds commented Jan 29, 2025

Also wanted to add: NOTE: there are assemblies that doesn't have correct tree of life ids, like GCF_011100685.1_UU_Cfam_GSD_1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants