feat: clean up data organization/add basic fetching scripts (#558)

* Use a centralized location for data in user space, rather than storing everything within a subdirectory of the repo. This is really more of a convenience for me because it means keeping fewer copies of the latest ChEMBL release around. It reflects a strategy we've used in our lil data management tool. * Add a script to download all DGIdb-relevant files, including the latest versions of updateable stuff. I know we'd previously maintained some updater modules for this purpose but I'd prefer something that reuses said data management tool above, it's a little less maintenance for me. * Put all DGIdb static files into an S3 bucket. By default this won't be accessible to non-NCH users but we could probably arrange something if necessary. Hypothetically some of this data is supposed to be non-public, so I don't want to put it on a public bucket for now. * Assign version values dynamically where possible. Previously, dgidb generally liked it when files are named "claims.tsv" or some variation therein, but we've used a file naming pattern in our other projects along the lines of <source>_<the kind of stuff>_<version>.<filetype>, since not every file includes its version as part of the data. So now, the file downloader is in charge of identifying the version of the data and naming the file as needed, and then DGI importers can just read version values from filenames. * Remove some unused/no-longer-functional/now-redundant code related to updating/downloading The workflow for updating would go like rake db:reset fetch/update data with python3 scripts/download_files.py run imports
dgidb · Dec 20, 2024 · 31dcf81 · 31dcf81
1 parent 5737396
commit 31dcf81
Show file tree

Hide file tree

Showing 20 changed files with 675 additions and 554 deletions.
diff --git a/README.md b/README.md
@@ -21,11 +21,11 @@ cd dgidb-v5
 
 ### Server setup
 
-First, you may need to switch your Ruby version with RVM to match the version declared in the first few lines of the [Gemfile](server/Gemfile). For example, to switch to version 3.0.4:
+First, you may need to switch your Ruby version with RVM to match the version declared in the first few lines of the [Gemfile](server/Gemfile). For example, to switch to version 3.1.0:
 
 ```shell
-rvm install 3.0.4
-rvm 3.0.4
+rvm install 3.1.0
+rvm 3.1.0
 ```
 
 From the repo root, enter the [server subdirectory](server/):
@@ -55,22 +55,10 @@ pg_ctl -D /opt/homebrew/var/postgres start
 # on older macs you may need to use a different path instead, eg "pg_ctl -D /usr/local/var/postgres start"
 ```
 
-Database initialization utilities are in-progress, so for now, the easiest way to get a working database is to manually create it using the `psql` command. First, enter the psql console:
+The database must be constructed manually. This command will also vary, but it should be something like this:
 
 ```
-psql -d postgres  # if you are opening psql for the first time, you'll need to connect to the database 'postgres'
-# should produce a prompt like the following:
-# psql (14.2)
-# Type "help" for help.
-#
-# jss009=#
-```
-
-Within the psql console, create the DGIdb database, then quit:
-
-```
-CREATE DATABASE dgidb;
-\q
+createdb -U postgres dgidb
 ```
 
 Next, back in the main shell, import a database dump file (ask on Slack if you need the latest file):
@@ -95,100 +83,34 @@ To perform a data load from scratch, first run the `reset` task to provide a cle
 rake db:reset
 ```
 
-Most DGIdb data comes from static files, typically called `claims.tsv`. The data loader classes expect `server/lib/data/` to contain the following files:
-
-```
-lib/data
-├── bader_lab
-│   └── claims.tsv
-├── cancer_commons
-│   └── claims.tsv
-├── caris_molecular_intelligence
-│   └── claims.tsv
-├── cgi
-│   └── claims.tsv
-├── chembl
-│   └── chembl.db
-├── clearity_foundation_biomarkers
-│   └── claims.tsv
-├── clearity_foundation_clinical_trial
-│   └── claims.tsv
-├── cosmic
-│   └── claims.csv
-├── dgene
-│   └── claims.tsv
-├── drugbank
-│   └── claims.xml
-├── dtc
-│   └── claims.csv
-├── ensembl
-│   └── claims.tsv
-├── entrez
-│   └── claims.tsv
-├── fda
-│   └── claims.tsv
-├── foundation_one_genes
-│   └── claims.tsv
-├── go
-│   └── targets.tsv
-├── guide_to_pharmacology
-│   ├── interactions.csv
-│   └── targets_and_families.csv
-├── hingorani_casas
-│   └── claims.tsv
-├── hopkins_groom
-│   └── claims.tsv
-├── human_protein_atlas
-│   └── claims.tsv
-├── idg
-│   ├── claims.json
-│   └── claims.tsv
-├── msk_impact
-│   └── claims.tsv
-├── my_cancer_genome
-│   └── claims.tsv
-├── my_cancer_genome_clinical_trial
-│   └── claims.tsv
-├── nci
-│   ├── claims.tsv
-│   └── claims.xml
-├── oncokb
-│   ├── drug_claim.csv
-│   ├── gene_claim.csv
-│   ├── gene_claim_aliases.csv
-│   ├── interaction_claim.csv
-│   ├── interaction_claim_attributes.csv
-│   └── interaction_claim_links.csv
-├── oncomine
-│   └── claims.tsv
-├── pharmgkb
-│   └── claims.tsv
-├── russ_lampel
-│   └── claims.tsv
-├── talc
-│   └── claims.tsv
-├── tdg_clinical_trial
-│   ├── claims.tsv
-├── tempus
-│   └── claims.tsv
-├── tend
-│   └── claims.tsv
-└── ttd
-    └── claims.csv
-```
-
-First, load claims:
+Some Python libraries are required for importing data. From the repo root, create a Python virtual environment and install required dependencies:
+
+```shell
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -r scripts/requirements.txt
+```
+
+A Python script is supplied to ensure that primary source data is available. This can also be used to acquire new versions of data that supply discrete releases (like ChEMBL):
+
+```
+python3 scripts/download_files.py
+```
+
+Then, load claims:
 
 ```shell
 rake dgidb:import:all
 ```
 
-Then, run grouping. By default, the groupers will expect a normalizer service to be running locally on port 8000; use the `THERAPY_HOSTNAME` and `GENE_HOSTNAME` environment variables to specify alternate hosts:
+Then, run grouping. See documentation for the [therapy](https://github.com/cancervariants/therapy-normalization) and [gene](https://gene-normalizer.readthedocs.io/stable/) normalizers for more.
+
+By default, the groupers will expect a normalizer service to be running locally on port 8000; use the `THERAPY_HOSTNAME` and `GENE_HOSTNAME` environment variables to specify alternate hosts:
 
 ```shell
-export THERAPY_HOSTNAME=http://localhost:7999  # no trailing backslash
+export THERAPY_HOSTNAME=http://localhost:7999
 rake dgidb:group:drugs
-export GENE_HOSTNAME=http://localhost:7998  # no trailing backslash
+export GENE_HOSTNAME=http://localhost:7998
 rake dgidb:group:genes
 rake dgidb:group:interactions
 ```