You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In MultitaskScaffoldSplitter, with certain datasets, you often see warning messages saying "two scaffolds match exactly?!?". This happens when the minimum Tanimoto distance between pairs of compounds from the two scaffolds is zero. One would think this means that the same compound somehow wound up in two different scaffold sets, but it doesn't.
The actual issue is that we generally compute Tanimoto distances with radius 2 (ECFP4) fingerprints, and two compounds with very different scaffolds can have the same ECFP4 fingerprint. Here are some examples of pairs of compounds that led to the above warning message:
with structures:
The one on the left has a 7- rather than a 6-membered ring, therefore a completely different scaffold. However, because an ECFP fingerprint simply represents a bag of chemical substructures found within a specified radius of each atom, these two molecules have exactly the same fingerprint at radius 2. You have to increase the radius to 3 to get different fingerprints for these two pairs of examples.
The solution for MultitaskScaffoldSplitter is simply to increase the radius used for computing the scaffold-scaffold distance matrices. We may want to do likewise in the other AMPL modules where fingerprints are commonly used for measuring and visualizing chemical diversity: chem_diversity, diversity_plots, compare_splits_plots and rdkit_easy. This is also something to think about when using ECFP features in AMPL models.
The text was updated successfully, but these errors were encountered:
In MultitaskScaffoldSplitter, with certain datasets, you often see warning messages saying "two scaffolds match exactly?!?". This happens when the minimum Tanimoto distance between pairs of compounds from the two scaffolds is zero. One would think this means that the same compound somehow wound up in two different scaffold sets, but it doesn't.
The actual issue is that we generally compute Tanimoto distances with radius 2 (ECFP4) fingerprints, and two compounds with very different scaffolds can have the same ECFP4 fingerprint. Here are some examples of pairs of compounds that led to the above warning message:
which have structures:
with structures:
The one on the left has a 7- rather than a 6-membered ring, therefore a completely different scaffold. However, because an ECFP fingerprint simply represents a bag of chemical substructures found within a specified radius of each atom, these two molecules have exactly the same fingerprint at radius 2. You have to increase the radius to 3 to get different fingerprints for these two pairs of examples.
The solution for MultitaskScaffoldSplitter is simply to increase the radius used for computing the scaffold-scaffold distance matrices. We may want to do likewise in the other AMPL modules where fingerprints are commonly used for measuring and visualizing chemical diversity: chem_diversity, diversity_plots, compare_splits_plots and rdkit_easy. This is also something to think about when using ECFP features in AMPL models.
The text was updated successfully, but these errors were encountered: