Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index WikiData taxa #58

Open
jhpoelen opened this issue Jan 2, 2018 · 15 comments
Open

index WikiData taxa #58

jhpoelen opened this issue Jan 2, 2018 · 15 comments

Comments

@jhpoelen
Copy link

jhpoelen commented Jan 2, 2018

Have you considered indexing WikiData taxa ? This would help me to retrieve associated metadata like images, common names and associated taxon ids.

E.g., https://www.wikidata.org/wiki/Q19537 .

@jhpoelen
Copy link
Author

jhpoelen commented Mar 5, 2018

As part of an effort to establish bi-directional links between GloBI and WikiData (see globalbioticinteractions/globalbioticinteractions#209), I've prepared some flat taxon files (a "cache" and a "map") with shallow hierarchy (child-parent taxa only). I am hoping you could index this , specifically the taxonCache.tsv.gz contained in https://depot.globalbioticinteractions.org/datasets/org/globalbioticinteractions/taxon/0.4.1/taxon-0.4.1-wikidata.zip .

@rdmpage @dimus @diatomsRcool @jhammock Curious to hear your thoughts on the value of getting access to WikiData via global names resolver / GloBI.

@dimus
Copy link
Member

dimus commented Mar 5, 2018

@jhpoelen, sounds good, I'll import it, give me a couple of days

@sckott
Copy link

sckott commented May 22, 2018

@jhpoelen how did you make your wikidata dump at the https://depot.globalbioticinteractions.org/datasets/org/globalbioticinteractions/taxon/0.4.1/taxon-0.4.1-wikidata.zip link? is there a way to get just subset of wikidata that has taxonomy data? or did you download all wikidata dump and filter out taxonomy data?

@jhpoelen
Copy link
Author

@sckott here's a preprint that should answer your questions: https://peerj.com/preprints/26951 (thanks @diatomsRcool for sharing it!) . If you still have questions comments concerns, please do holler.

@sckott
Copy link

sckott commented May 22, 2018

great, thanks much @jhpoelen

@sckott
Copy link

sckott commented May 26, 2018

@jhpoelen trying to download https://zenodo.org/record/1211767 and it seems like its broken or so - just spinning in download progress regardless of internet speed. Is that file the taxonomy subset of wikidata or all of wikidata?

@jhpoelen
Copy link
Author

Interesting. Do you mind reporting the apparent download issues to zenodo? Perhaps a bug on their end.

About the data: the dataset is a copy of a dump of all of wikidata. The copy was published because wikidata seems to erase their snapshots, and I wanted to be able to reproduce the extraction of taxonomic data. If you'd like to re-do or improve the extraction of taxonomic data, I'd suggest to grab a recent snapshot of wikidata at https://www.wikidata.org/wiki/Wikidata:Database_download#JSON_dumps_(recommended) .

Alternatively, you can use https://zenodo.org/record/1213477/files/wikidata-taxon-info20171227.tsv.gz from http://doi.org/10.5281/zenodo.1213477. This archive contains a minimal taxon graph extracted from the json dump using methods described in preprint.

Curious to hear what you'll end up using and what your use case is.

@sckott
Copy link

sckott commented May 29, 2018

Thanks @jhpoelen

Will report it to them.

Was thinking of using wikidata as a data source in our R package https://github.com/ropensci/taxizedb - it's a interface to SQL databases of taxonomic data, currently COL, ITIS, Plantlist, NCBI, GBIF backbone. For wikidata, I can take your wikidata-taxon-info20171227.tsv file and drop into sqlite

Do you keep an updated run of wikidata-taxon-info20171227.tsv going somewhere? if not, I might think about doing that so our users can get an updated set of wikidata data as they like

@sckott
Copy link

sckott commented May 29, 2018

is there some description of the fields in the wikidata-taxon-info20171227.tsv somewhere?

@jhpoelen
Copy link
Author

jhpoelen commented May 29, 2018

is there some description of the fields in the wikidata-taxon-info20171227.tsv somewhere?

I've just added some more description to http://doi.org/10.5281/zenodo.1213477 . Please let me know if that helps. At some point, I am hoping to publish machine readable schema's along with the tsv files.

Do you keep an updated run of wikidata-taxon-info20171227.tsv going somewhere? if not, I might think about doing that so our users can get an updated set of wikidata data as they like

I think that keeping an up-to-date copies would be pretty valuable. I didn't get a chance to do that yet, but I did publish instructions on how to rebuild the file here: https://github.com/bio-guoda/guoda-datasets/tree/master/wikidata . Curious to see how you'd be going about it. Instead of using apache spark (needs infrastructure + learning curve), perhaps a brute force approach with jq , the > 20GB wikidata dump, a bunch of internet bandwidth and some bash scripting would work also.

@sckott
Copy link

sckott commented May 29, 2018

Thanks, that metadata works for me.

Right, I've no experience in spark or java/scala, so i might try the jq/bash approach - at least I'm pretty familiar with jq.

@sckott
Copy link

sckott commented May 30, 2018

@jhpoelen that Zenodo file should be fixed now

@jhpoelen
Copy link
Author

@sckott thanks for sharing. Do you happen to know the root cause of the accessibility of the wikidata dump via Zenodo?

@sckott
Copy link

sckott commented May 30, 2018

a bug in our system that caused the browser to try to actually display the binary file in a new page

@jhpoelen
Copy link
Author

Thanks for reporting/sharing. I believe this might have been the related commit - zenodo/zenodo@76db02d .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants