-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
index WikiData taxa #58
Comments
As part of an effort to establish bi-directional links between GloBI and WikiData (see globalbioticinteractions/globalbioticinteractions#209), I've prepared some flat taxon files (a "cache" and a "map") with shallow hierarchy (child-parent taxa only). I am hoping you could index this , specifically the taxonCache.tsv.gz contained in https://depot.globalbioticinteractions.org/datasets/org/globalbioticinteractions/taxon/0.4.1/taxon-0.4.1-wikidata.zip . @rdmpage @dimus @diatomsRcool @jhammock Curious to hear your thoughts on the value of getting access to WikiData via global names resolver / GloBI. |
@jhpoelen, sounds good, I'll import it, give me a couple of days |
@jhpoelen how did you make your wikidata dump at the https://depot.globalbioticinteractions.org/datasets/org/globalbioticinteractions/taxon/0.4.1/taxon-0.4.1-wikidata.zip link? is there a way to get just subset of wikidata that has taxonomy data? or did you download all wikidata dump and filter out taxonomy data? |
@sckott here's a preprint that should answer your questions: https://peerj.com/preprints/26951 (thanks @diatomsRcool for sharing it!) . If you still have questions comments concerns, please do holler. |
great, thanks much @jhpoelen |
@jhpoelen trying to download https://zenodo.org/record/1211767 and it seems like its broken or so - just spinning in download progress regardless of internet speed. Is that file the taxonomy subset of wikidata or all of wikidata? |
Interesting. Do you mind reporting the apparent download issues to zenodo? Perhaps a bug on their end. About the data: the dataset is a copy of a dump of all of wikidata. The copy was published because wikidata seems to erase their snapshots, and I wanted to be able to reproduce the extraction of taxonomic data. If you'd like to re-do or improve the extraction of taxonomic data, I'd suggest to grab a recent snapshot of wikidata at https://www.wikidata.org/wiki/Wikidata:Database_download#JSON_dumps_(recommended) . Alternatively, you can use https://zenodo.org/record/1213477/files/wikidata-taxon-info20171227.tsv.gz from http://doi.org/10.5281/zenodo.1213477. This archive contains a minimal taxon graph extracted from the json dump using methods described in preprint. Curious to hear what you'll end up using and what your use case is. |
Thanks @jhpoelen Will report it to them. Was thinking of using wikidata as a data source in our R package https://github.com/ropensci/taxizedb - it's a interface to SQL databases of taxonomic data, currently COL, ITIS, Plantlist, NCBI, GBIF backbone. For wikidata, I can take your Do you keep an updated run of |
is there some description of the fields in the |
I've just added some more description to http://doi.org/10.5281/zenodo.1213477 . Please let me know if that helps. At some point, I am hoping to publish machine readable schema's along with the tsv files.
I think that keeping an up-to-date copies would be pretty valuable. I didn't get a chance to do that yet, but I did publish instructions on how to rebuild the file here: https://github.com/bio-guoda/guoda-datasets/tree/master/wikidata . Curious to see how you'd be going about it. Instead of using apache spark (needs infrastructure + learning curve), perhaps a brute force approach with jq , the > 20GB wikidata dump, a bunch of internet bandwidth and some bash scripting would work also. |
Thanks, that metadata works for me. Right, I've no experience in spark or java/scala, so i might try the jq/bash approach - at least I'm pretty familiar with jq. |
@jhpoelen that Zenodo file should be fixed now |
@sckott thanks for sharing. Do you happen to know the root cause of the accessibility of the wikidata dump via Zenodo? |
|
Thanks for reporting/sharing. I believe this might have been the related commit - zenodo/zenodo@76db02d . |
Have you considered indexing WikiData taxa ? This would help me to retrieve associated metadata like images, common names and associated taxon ids.
E.g., https://www.wikidata.org/wiki/Q19537 .
The text was updated successfully, but these errors were encountered: