-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract individual information from NCBI #3
Comments
From the VGP hub (https://hgdownload.soe.ucsc.edu/hubs/VGP/) you are able to see individual from 'NCBI assembly' column. The following bold part GCA_009819605.1_bSteHir1.pri . Where bSteHir is the Tree of life id and the number should be unique for an individual. Using the following url /genome/accession/{accessions}/dataset_report from the ncbi rest api should make it possible to find the individual information. Here's example for Zebra Finch. GCF_003957565.2 "assembly_name": "bTaeGut1.4.pri" {
"reports": [
{
"accession": "GCF_003957565.2",
"current_accession": "GCF_003957565.2",
"paired_accession": "GCA_003957565.4",
"source_database": "SOURCE_DATABASE_REFSEQ",
"organism": {
"tax_id": 59729,
"organism_name": "Taeniopygia guttata",
"common_name": "zebra finch",
"infraspecific_names": {
"isolate": "Blue55",
"sex": "female"
}
},
"assembly_info": {
"assembly_level": "Chromosome",
"assembly_status": "current",
"paired_assembly": {
"accession": "GCA_003957565.4",
"status": "current",
"changed": "chromosome MT."
},
"assembly_name": "bTaeGut1.4.pri",
"assembly_type": "haploid",
}
}
]
} GCF_008822105.2 "assembly_name": "bTaeGut2.pat.W.v2", |
I have added some information about extraction of individual @NoopDog |
Thanks, @Smeds! Are you saying that the individual name is encoded in the assembly name and is not explicitly represented in the API response? Also, what is "infraspecific_names": { Is Blue55 also an ID for the individual? Or is this something else? Cheers |
Could be that NCBI aren't really working with individual information, they have there different assemblies. Looks like isolate (Blue55) is something common for an organism, same value for bTaeGut2 . {
"reports": [
{
"accession": "GCF_008822105.2",
"current_accession": "GCF_008822105.2",
"paired_accession": "GCA_008822105.2",
"source_database": "SOURCE_DATABASE_REFSEQ",
"organism": {
"tax_id": 59729,
"organism_name": "Taeniopygia guttata",
"common_name": "zebra finch",
"infraspecific_names": {
"isolate": "Blue55",
"sex": "female"
}
},
"assembly_info": {
"assembly_level": "Chromosome",
"assembly_status": "suppressed",
"paired_assembly": {
"accession": "GCA_008822105.2",
"status": "current",
"changed": "chromosome MT."
},
"assembly_name": "bTaeGut2.pat.W.v2",
}}]}
If you look at the genome ark v1 page, they are using the TolID with a number to different between individuals, https://www.genomeark.org/vgp-all/Taeniopygia_guttata.html |
Also wanted to add: NOTE: there are assemblies that doesn't have correct tree of life ids, like GCF_011100685.1_UU_Cfam_GSD_1.0 |
How can we do this?
The text was updated successfully, but these errors were encountered: