-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking test sets #9
Comments
This first comment will contain helpful resources. I'll make a second comment to answer more of the questions.
Embedding the figure below for reference:
Yes, I believe
|
In the
You'll see the numbers of positives and negatives in the figure will match the I created a Python notebook that shows that the columns in For Symptomatic, positives are compound-disease pairs with a palliates edge in Hetionet. Negatives are all other compound-disease pairs excluding disease-modifying indications. See #7. Figure 3A was made in the notebook grouped_df = prob_df %>%
dplyr::mutate(DM = category %in% 'DM', SYM = category %in% 'SYM') %>%
dplyr::rename(net_status=status) %>%
tidyr::gather(context, status, DM, SYM, status_trials, status_drugcentral) %>%
dplyr::filter(context == 'DM' | net_status == 0) %>%
dplyr::filter(!is.na(status)) %>% |
Yes!
207,572 is the number DrugCentral negatives. 208 is the number of DrugCentral positives. 1388 compound-disease pairs have a missing value for Does that answer everything? |
Thank you for a thorough explanation and for pointing to the probabilities.tsv file! On a side note, removing NOT treatments from negatives in DrugCentral is justified as it (similar to the removal of DMs from positives) helps prevent overfitting effect on performance (your classifier is trained to recognize negatives too). Does this make sense (or am I perhaps missing something simple)? |
We define NOT treatments as:
So compound-disease pairs labeled
There are only 243 |
Hi Daniel,
My question concerns the benchmarking data sets in Fig. 3 of the paper "Systematic integration of biomedical knowledge prioritizes drugs for repurposing". Are those available for download? I tried to compile the test data myself using DrugCentral and the other datasets you made available as part of the project. However, I can't get the number of non-indications to match those in Fig 3.
I believe I understand how you compile non-indications for "Disease Modifying" dataset. Basically, 208,413 = (1552 - 14) * (137 - 1) - 755 (where 1552 is #compounds, 14 is #disconnected compounds, 137 is #diseases, 1 is # disconnected diseases and 755 is #DM indications). But, how do you compute the set of non-indications for Drug Central? In particular, where does 207,572 (Fig. 3) come from? Same for Clinical Trials and Symptomatic data sets.
Thank you.
The text was updated successfully, but these errors were encountered: