A CKAN extension to generate the /data.json file and to harvest data sources from a remote /data.json file according to the U.S. Project Open Data metadata specification (http://project-open-data.github.io/).
This plugin creates a new view at /data.json (or other configurable path) that outputs the contents of the data catalog in the Project Open Data JSON metadata format. It also creates a view at /data.jsonld which outputs the same in JSON-LD format.
The plugin also provides a harvester to import datasets from other remote /data.json files.
This module assumes metadata is stored in CKAN in the way we do it on http://hub.healthdata.gov. If you're storing metadata under different key names, you'll have to revise ckanext/datajson/plugin.py accordingly.
To install, activate your CKAN virtualenv and then install the module in develop mode, which just puts the directory in your Python path.
. path/to/pyenv/bin/activate
python setup.py develop
Then in your CKAN .ini file, add ``datajson'' to your ckan.plugins line:
ckan.plugins = (other plugins here...) datajson
That's the plugin for /data.json output. To make the harvester available, also add:
ckan.plugins = (other plugins here...) harvest datajson_harvest
If you're running CKAN via WSGI, we found a strange Python dependency bug. It might only affect development environments. The fix was to revise wsgi.py and add:
import ckanext
before
from paste.deploy import loadapp
Then restart your server and check out:
http://yourdomain.com/data.json
and
http://yourdomain.com/data.jsonld
If you're deploying inside Apache, some caching would be a good idea because generating the /data.json file can take a good few moments. Enable the cache modules:
a2enmod cache
a2enmod disk_cache
And then in your Apache configuration add:
CacheEnable disk /data.json
CacheRoot /tmp/apache_cache
CacheDefaultExpire 120
CacheIgnoreCacheControl On
CacheIgnoreNoLastMod On
CacheStoreNoStore On
And be sure to create /tmp/apache_cache and make it writable by the Apache process.
Generating this file is a little slow, so an alternative instead of caching is to generate the file periodically (e.g. in a cron job). In that case, you'll want to change the path that CKAN generates the file at to something other than /data.json. In your CKAN .ini file, in the app:main section, add:
ckanext.datajson.path = /internal/data.json
Now create a crontab file ("mycrontab") to download this URL to a file on disk every ten minutes:
0-59/10 * * * * wget -qO /path/to/static/data.json http://localhost/internal/data.json
And activate your crontab like so:
crontab mycrontab
In Apache, we'll want to block outside access to the "internal" URL, and also map the URL /data.json to the static file. In your httpd.conf, add:
Alias /data.json /path/to/static/data.json
<Location /internal/>
Order deny,allow
Allow from 127.0.0.1
Deny from all
</Location>
And then restart Apache. Wait for the cron job to run once, then check if /data.json loads (and it should be fast!). Also double check that http://yourdomain.com/internal/data.json gives a 403 forbidden error when accessed from some other location.
You can customize the URL that generates the data.json output:
ckanext.datajson.path = /data.json
ckanext.datajsonld.path = /data.jsonld
ckanext.datajsonld.id = http://www.youragency.gov/data.json
If ckanext.datajsonld.path is omitted, it defaults to replacing ".json" in your ckanext.datajson.path path with ".jsonld", so it probably won't need to be specified.
The option ckanext.datajsonld.id is the @id value used to identify the data catalog itself. If not given, it defaults to ckan.site_url.
You'll also need to set up the CKAN harvester extension. See the CKAN harvester README at https://github.com/okfn/ckanext-harvest for how to do that. You'll set some configuration variables and then initialize the CKAN harvester plugin using:
paster --plugin=ckanext-harvest harvester initdb --config=/path/to/ckan.ini
Now you can set up a new DataJson harvester by visiting:
http://yourdomain.com/harvest
And when configuring the data source, just choose "/data.json" as the source type.
Written by the HealthData.gov team.
As a work of the United States Government the files in this repository are in the U.S. public domain. Additionally, we waive copyright and neighboring rights worldwide through the Creative Commons CC0 Public Domain Dedication http://creativecommons.org/publicdomain/zero/1.0/.