Benchmarking test sets #9

poleksic · 2020-03-04T00:27:12Z

Hi Daniel,
My question concerns the benchmarking data sets in Fig. 3 of the paper "Systematic integration of biomedical knowledge prioritizes drugs for repurposing". Are those available for download? I tried to compile the test data myself using DrugCentral and the other datasets you made available as part of the project. However, I can't get the number of non-indications to match those in Fig 3.

I believe I understand how you compile non-indications for "Disease Modifying" dataset. Basically, 208,413 = (1552 - 14) * (137 - 1) - 755 (where 1552 is #compounds, 14 is #disconnected compounds, 137 is #diseases, 1 is # disconnected diseases and 755 is #DM indications). But, how do you compute the set of non-indications for Drug Central? In particular, where does 207,572 (Fig. 3) come from? Same for Clinical Trials and Symptomatic data sets.
Thank you.

dhimmel · 2020-03-04T19:27:16Z

This first comment will contain helpful resources. I'll make a second comment to answer more of the questions.

My question concerns the benchmarking data sets in Fig. 3 of the paper. Are those available for download?

Embedding the figure below for reference:

Are those available for download?

Yes, I believe prediction/predictions/probabilities.tsv is the best place to see the full set of compound-disease pairs that can be filtered to generated the various benchmark sets.

compound_id	compound_name	disease_id	disease_name	prior_prob	prediction	training_prediction	compound_percentile	disease_percentile
DB01048	Abacavir	DOID:10652	Alzheimer's disease	0.004753	0.000930405137780005	0.00112945581330063	0.125	0.154746423927178
DB05812	Abiraterone	DOID:10652	Alzheimer's disease	0.004753	0.00379528958481219	0.00460442828313575	0.757352941176471	0.842652795838752
DB00659	Acamprosate	DOID:10652	Alzheimer's disease	0.004753	0.0162300916490301	0.0196380147334522	0.985294117647059	0.988296488946684
DB00284	Acarbose	DOID:10652	Alzheimer's disease	0.004753	0.00146927328449796	0.00178340350395021	0.595588235294118	0.368660598179454
DB01193	Acebutolol	DOID:10652	Alzheimer's disease	0.004753	0.00177375424093999	0.00215284205236242	0.772058823529412	0.472041612483745

dhimmel · 2020-03-04T19:39:42Z

In the probabilities.tsv table above:

status is the true labels column for "Disease Modifying" in the figure
status_trials is the true labels column for "Clinical Trial"
status_drugcentral is the true labels for "DrugCentral"

You'll see the numbers of positives and negatives in the figure will match the n_pos and n_neg in the cell 14 table in prediction/4-predictr.ipynb.

I created a Python notebook that shows that the columns in probabilities.tsv have the same counts of positives and negatives as Figure 3A. It also computes a status_sym column which equates to "Symptomatic" in the figure.

For Symptomatic, positives are compound-disease pairs with a palliates edge in Hetionet. Negatives are all other compound-disease pairs excluding disease-modifying indications. See #7.

Figure 3A was made in the notebook prediction/6-vizr.ipynb. The relevant code here is:

grouped_df = prob_df %>%
  dplyr::mutate(DM = category %in% 'DM', SYM = category %in% 'SYM') %>%
  dplyr::rename(net_status=status) %>%
  tidyr::gather(context, status, DM, SYM, status_trials, status_drugcentral) %>%
  dplyr::filter(context == 'DM' | net_status == 0) %>%
  dplyr::filter(!is.na(status)) %>%

dhimmel · 2020-03-04T20:16:14Z

I believe I understand how you compile non-indications for "Disease Modifying" dataset. Basically, 208,413 = (1552 - 14) * (137 - 1) - 755 (where 1552 is #compounds, 14 is #disconnected compounds, 137 is #diseases, 1 is # disconnected diseases and 755 is #DM indications).

Yes!

But, how do you compute the set of non-indications for Drug Central? In particular, where does 207,572 (Fig. 3) come from? Same for Clinical Trials and Symptomatic data sets.

207,572 is the number DrugCentral negatives. 208 is the number of DrugCentral positives. 1388 compound-disease pairs have a missing value for status_drugcentral representing all indications in PharmacotherapyDB (including DM, SYM, and NOT treatments). I don't recall why we remove NOT treatments from the negatives, but it's a small number of observations.

Does that answer everything?

poleksic · 2020-03-04T22:25:47Z

Thank you for a thorough explanation and for pointing to the probabilities.tsv file! On a side note, removing NOT treatments from negatives in DrugCentral is justified as it (similar to the removal of DMs from positives) helps prevent overfitting effect on performance (your classifier is trained to recognize negatives too). Does this make sense (or am I perhaps missing something simple)?

dhimmel · 2020-03-09T17:01:55Z

We define NOT treatments as:

non-indication meaning a drug that neither therapeutically changes the underlying or downstream biology nor treats a significant symptom of the disease.

So compound-disease pairs labeled NOT in PharmacotherapyDB are more like negative observations compared to positives. However, because they were included in upstream indication resources, these NOT pairs also may have some properties of treatments, but not enough to meet our definition of DM or SYM according to the curators.

removing NOT treatments from negatives in DrugCentral is justified as it (similar to the removal of DMs from positives) helps prevent overfitting effect on performance

There are only 243 NOT compound-disease pairs in PharmacotherapyDB. Since we're using essentially all non-positives as negatives, with the exclusions detailed above, the impact on training / performance of including or not-including these 243 pairs as negatives is trivial.

dhimmel mentioned this issue Mar 4, 2020

Can't find the Symptomatic validation dataset #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking test sets #9

Benchmarking test sets #9

poleksic commented Mar 4, 2020

dhimmel commented Mar 4, 2020

dhimmel commented Mar 4, 2020 •

edited

Loading

dhimmel commented Mar 4, 2020

poleksic commented Mar 4, 2020

dhimmel commented Mar 9, 2020

Benchmarking test sets #9

Benchmarking test sets #9

Comments

poleksic commented Mar 4, 2020

dhimmel commented Mar 4, 2020

dhimmel commented Mar 4, 2020 • edited Loading

dhimmel commented Mar 4, 2020

poleksic commented Mar 4, 2020

dhimmel commented Mar 9, 2020

dhimmel commented Mar 4, 2020 •

edited

Loading