Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: clean up data organization/add basic fetching scripts (#558)
* Use a centralized location for data in user space, rather than storing everything within a subdirectory of the repo. This is really more of a convenience for me because it means keeping fewer copies of the latest ChEMBL release around. It reflects a strategy we've used in our lil data management tool. * Add a script to download all DGIdb-relevant files, including the latest versions of updateable stuff. I know we'd previously maintained some updater modules for this purpose but I'd prefer something that reuses said data management tool above, it's a little less maintenance for me. * Put all DGIdb static files into an S3 bucket. By default this won't be accessible to non-NCH users but we could probably arrange something if necessary. Hypothetically some of this data is supposed to be non-public, so I don't want to put it on a public bucket for now. * Assign version values dynamically where possible. Previously, dgidb generally liked it when files are named "claims.tsv" or some variation therein, but we've used a file naming pattern in our other projects along the lines of <source>_<the kind of stuff>_<version>.<filetype>, since not every file includes its version as part of the data. So now, the file downloader is in charge of identifying the version of the data and naming the file as needed, and then DGI importers can just read version values from filenames. * Remove some unused/no-longer-functional/now-redundant code related to updating/downloading The workflow for updating would go like rake db:reset fetch/update data with python3 scripts/download_files.py run imports
- Loading branch information