Skip to content

Commit

Permalink
feat: clean up data organization/add basic fetching scripts (#558)
Browse files Browse the repository at this point in the history
* Use a centralized location for data in user space, rather than storing everything within a subdirectory of the repo. This is really more of a convenience for me because it means keeping fewer copies of the latest ChEMBL release around. It reflects a strategy we've used in our lil data management tool.
* Add a script to download all DGIdb-relevant files, including the latest versions of updateable stuff. I know we'd previously maintained some updater modules for this purpose but I'd prefer something that reuses said data management tool above, it's a little less maintenance for me.
* Put all DGIdb static files into an S3 bucket. By default this won't be accessible to non-NCH users but we could probably arrange something if necessary. Hypothetically some of this data is supposed to be non-public, so I don't want to put it on a public bucket for now.
* Assign version values dynamically where possible. Previously, dgidb generally liked it when files are named "claims.tsv" or some variation therein, but we've used a file naming pattern in our other projects along the lines of <source>_<the kind of stuff>_<version>.<filetype>, since not every file includes its version as part of the data. So now, the file downloader is in charge of identifying the version of the data and naming the file as needed, and then DGI importers can just read version values from filenames.
* Remove some unused/no-longer-functional/now-redundant code related to updating/downloading

The workflow for updating would go like

    rake db:reset
    fetch/update data with python3 scripts/download_files.py
    run imports
  • Loading branch information
jsstevenson authored Dec 20, 2024
1 parent 5737396 commit 31dcf81
Show file tree
Hide file tree
Showing 20 changed files with 675 additions and 554 deletions.
128 changes: 25 additions & 103 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ cd dgidb-v5

### Server setup

First, you may need to switch your Ruby version with RVM to match the version declared in the first few lines of the [Gemfile](server/Gemfile). For example, to switch to version 3.0.4:
First, you may need to switch your Ruby version with RVM to match the version declared in the first few lines of the [Gemfile](server/Gemfile). For example, to switch to version 3.1.0:

```shell
rvm install 3.0.4
rvm 3.0.4
rvm install 3.1.0
rvm 3.1.0
```

From the repo root, enter the [server subdirectory](server/):
Expand Down Expand Up @@ -55,22 +55,10 @@ pg_ctl -D /opt/homebrew/var/postgres start
# on older macs you may need to use a different path instead, eg "pg_ctl -D /usr/local/var/postgres start"
```

Database initialization utilities are in-progress, so for now, the easiest way to get a working database is to manually create it using the `psql` command. First, enter the psql console:
The database must be constructed manually. This command will also vary, but it should be something like this:

```
psql -d postgres # if you are opening psql for the first time, you'll need to connect to the database 'postgres'
# should produce a prompt like the following:
# psql (14.2)
# Type "help" for help.
#
# jss009=#
```

Within the psql console, create the DGIdb database, then quit:

```
CREATE DATABASE dgidb;
\q
createdb -U postgres dgidb
```

Next, back in the main shell, import a database dump file (ask on Slack if you need the latest file):
Expand All @@ -95,100 +83,34 @@ To perform a data load from scratch, first run the `reset` task to provide a cle
rake db:reset
```

Most DGIdb data comes from static files, typically called `claims.tsv`. The data loader classes expect `server/lib/data/` to contain the following files:

```
lib/data
├── bader_lab
│ └── claims.tsv
├── cancer_commons
│ └── claims.tsv
├── caris_molecular_intelligence
│ └── claims.tsv
├── cgi
│ └── claims.tsv
├── chembl
│ └── chembl.db
├── clearity_foundation_biomarkers
│ └── claims.tsv
├── clearity_foundation_clinical_trial
│ └── claims.tsv
├── cosmic
│ └── claims.csv
├── dgene
│ └── claims.tsv
├── drugbank
│ └── claims.xml
├── dtc
│ └── claims.csv
├── ensembl
│ └── claims.tsv
├── entrez
│ └── claims.tsv
├── fda
│ └── claims.tsv
├── foundation_one_genes
│ └── claims.tsv
├── go
│ └── targets.tsv
├── guide_to_pharmacology
│ ├── interactions.csv
│ └── targets_and_families.csv
├── hingorani_casas
│ └── claims.tsv
├── hopkins_groom
│ └── claims.tsv
├── human_protein_atlas
│ └── claims.tsv
├── idg
│ ├── claims.json
│ └── claims.tsv
├── msk_impact
│ └── claims.tsv
├── my_cancer_genome
│ └── claims.tsv
├── my_cancer_genome_clinical_trial
│ └── claims.tsv
├── nci
│ ├── claims.tsv
│ └── claims.xml
├── oncokb
│ ├── drug_claim.csv
│ ├── gene_claim.csv
│ ├── gene_claim_aliases.csv
│ ├── interaction_claim.csv
│ ├── interaction_claim_attributes.csv
│ └── interaction_claim_links.csv
├── oncomine
│ └── claims.tsv
├── pharmgkb
│ └── claims.tsv
├── russ_lampel
│ └── claims.tsv
├── talc
│ └── claims.tsv
├── tdg_clinical_trial
│ ├── claims.tsv
├── tempus
│ └── claims.tsv
├── tend
│ └── claims.tsv
└── ttd
└── claims.csv
```

First, load claims:
Some Python libraries are required for importing data. From the repo root, create a Python virtual environment and install required dependencies:

```shell
python3 -m venv .venv
source .venv/bin/activate
pip install -r scripts/requirements.txt
```

A Python script is supplied to ensure that primary source data is available. This can also be used to acquire new versions of data that supply discrete releases (like ChEMBL):

```
python3 scripts/download_files.py
```

Then, load claims:

```shell
rake dgidb:import:all
```

Then, run grouping. By default, the groupers will expect a normalizer service to be running locally on port 8000; use the `THERAPY_HOSTNAME` and `GENE_HOSTNAME` environment variables to specify alternate hosts:
Then, run grouping. See documentation for the [therapy](https://github.com/cancervariants/therapy-normalization) and [gene](https://gene-normalizer.readthedocs.io/stable/) normalizers for more.

By default, the groupers will expect a normalizer service to be running locally on port 8000; use the `THERAPY_HOSTNAME` and `GENE_HOSTNAME` environment variables to specify alternate hosts:

```shell
export THERAPY_HOSTNAME=http://localhost:7999 # no trailing backslash
export THERAPY_HOSTNAME=http://localhost:7999
rake dgidb:group:drugs
export GENE_HOSTNAME=http://localhost:7998 # no trailing backslash
export GENE_HOSTNAME=http://localhost:7998
rake dgidb:group:genes
rake dgidb:group:interactions
```
Expand Down
Loading

0 comments on commit 31dcf81

Please sign in to comment.