index WikiData taxa #58

jhpoelen · 2018-01-02T19:54:35Z

Have you considered indexing WikiData taxa ? This would help me to retrieve associated metadata like images, common names and associated taxon ids.

E.g., https://www.wikidata.org/wiki/Q19537 .

jhpoelen · 2018-03-05T19:03:13Z

As part of an effort to establish bi-directional links between GloBI and WikiData (see globalbioticinteractions/globalbioticinteractions#209), I've prepared some flat taxon files (a "cache" and a "map") with shallow hierarchy (child-parent taxa only). I am hoping you could index this , specifically the taxonCache.tsv.gz contained in https://depot.globalbioticinteractions.org/datasets/org/globalbioticinteractions/taxon/0.4.1/taxon-0.4.1-wikidata.zip .

@rdmpage @dimus @diatomsRcool @jhammock Curious to hear your thoughts on the value of getting access to WikiData via global names resolver / GloBI.

dimus · 2018-03-05T19:16:38Z

@jhpoelen, sounds good, I'll import it, give me a couple of days

sckott · 2018-05-22T17:17:40Z

@jhpoelen how did you make your wikidata dump at the https://depot.globalbioticinteractions.org/datasets/org/globalbioticinteractions/taxon/0.4.1/taxon-0.4.1-wikidata.zip link? is there a way to get just subset of wikidata that has taxonomy data? or did you download all wikidata dump and filter out taxonomy data?

jhpoelen · 2018-05-22T18:06:59Z

@sckott here's a preprint that should answer your questions: https://peerj.com/preprints/26951 (thanks @diatomsRcool for sharing it!) . If you still have questions comments concerns, please do holler.

sckott · 2018-05-22T19:20:29Z

great, thanks much @jhpoelen

sckott · 2018-05-26T16:29:07Z

@jhpoelen trying to download https://zenodo.org/record/1211767 and it seems like its broken or so - just spinning in download progress regardless of internet speed. Is that file the taxonomy subset of wikidata or all of wikidata?

jhpoelen · 2018-05-26T17:08:52Z

Interesting. Do you mind reporting the apparent download issues to zenodo? Perhaps a bug on their end.

About the data: the dataset is a copy of a dump of all of wikidata. The copy was published because wikidata seems to erase their snapshots, and I wanted to be able to reproduce the extraction of taxonomic data. If you'd like to re-do or improve the extraction of taxonomic data, I'd suggest to grab a recent snapshot of wikidata at https://www.wikidata.org/wiki/Wikidata:Database_download#JSON_dumps_(recommended) .

Alternatively, you can use https://zenodo.org/record/1213477/files/wikidata-taxon-info20171227.tsv.gz from http://doi.org/10.5281/zenodo.1213477. This archive contains a minimal taxon graph extracted from the json dump using methods described in preprint.

Curious to hear what you'll end up using and what your use case is.

sckott · 2018-05-29T16:39:13Z

Thanks @jhpoelen

Will report it to them.

Was thinking of using wikidata as a data source in our R package https://github.com/ropensci/taxizedb - it's a interface to SQL databases of taxonomic data, currently COL, ITIS, Plantlist, NCBI, GBIF backbone. For wikidata, I can take your wikidata-taxon-info20171227.tsv file and drop into sqlite

Do you keep an updated run of wikidata-taxon-info20171227.tsv going somewhere? if not, I might think about doing that so our users can get an updated set of wikidata data as they like

sckott · 2018-05-29T16:47:59Z

is there some description of the fields in the wikidata-taxon-info20171227.tsv somewhere?

jhpoelen · 2018-05-29T18:39:19Z

is there some description of the fields in the wikidata-taxon-info20171227.tsv somewhere?

I've just added some more description to http://doi.org/10.5281/zenodo.1213477 . Please let me know if that helps. At some point, I am hoping to publish machine readable schema's along with the tsv files.

Do you keep an updated run of wikidata-taxon-info20171227.tsv going somewhere? if not, I might think about doing that so our users can get an updated set of wikidata data as they like

I think that keeping an up-to-date copies would be pretty valuable. I didn't get a chance to do that yet, but I did publish instructions on how to rebuild the file here: https://github.com/bio-guoda/guoda-datasets/tree/master/wikidata . Curious to see how you'd be going about it. Instead of using apache spark (needs infrastructure + learning curve), perhaps a brute force approach with jq , the > 20GB wikidata dump, a bunch of internet bandwidth and some bash scripting would work also.

sckott · 2018-05-29T18:42:34Z

Thanks, that metadata works for me.

Right, I've no experience in spark or java/scala, so i might try the jq/bash approach - at least I'm pretty familiar with jq.

sckott · 2018-05-30T16:50:53Z

@jhpoelen that Zenodo file should be fixed now

jhpoelen · 2018-05-30T16:56:51Z

@sckott thanks for sharing. Do you happen to know the root cause of the accessibility of the wikidata dump via Zenodo?

sckott · 2018-05-30T17:02:19Z

a bug in our system that caused the browser to try to actually display the binary file in a new page

jhpoelen · 2018-05-30T17:13:09Z

Thanks for reporting/sharing. I believe this might have been the related commit - zenodo/zenodo@76db02d .

bjonnh mentioned this issue Aug 15, 2018

add wikidata to the source list #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index WikiData taxa #58

index WikiData taxa #58

jhpoelen commented Jan 2, 2018

jhpoelen commented Mar 5, 2018

dimus commented Mar 5, 2018

sckott commented May 22, 2018

jhpoelen commented May 22, 2018

sckott commented May 22, 2018

sckott commented May 26, 2018

jhpoelen commented May 26, 2018

sckott commented May 29, 2018

sckott commented May 29, 2018

jhpoelen commented May 29, 2018 •

edited

Loading

sckott commented May 29, 2018

sckott commented May 30, 2018

jhpoelen commented May 30, 2018

sckott commented May 30, 2018

jhpoelen commented May 30, 2018

index WikiData taxa #58

index WikiData taxa #58

Comments

jhpoelen commented Jan 2, 2018

jhpoelen commented Mar 5, 2018

dimus commented Mar 5, 2018

sckott commented May 22, 2018

jhpoelen commented May 22, 2018

sckott commented May 22, 2018

sckott commented May 26, 2018

jhpoelen commented May 26, 2018

sckott commented May 29, 2018

sckott commented May 29, 2018

jhpoelen commented May 29, 2018 • edited Loading

sckott commented May 29, 2018

sckott commented May 30, 2018

jhpoelen commented May 30, 2018

sckott commented May 30, 2018

jhpoelen commented May 30, 2018

jhpoelen commented May 29, 2018 •

edited

Loading