Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECFP4 fingerprints fail to distinguish some very different molecules #311

Open
mcloughlin2 opened this issue Jun 13, 2024 · 1 comment
Open
Assignees

Comments

@mcloughlin2
Copy link
Collaborator

In MultitaskScaffoldSplitter, with certain datasets, you often see warning messages saying "two scaffolds match exactly?!?". This happens when the minimum Tanimoto distance between pairs of compounds from the two scaffolds is zero. One would think this means that the same compound somehow wound up in two different scaffold sets, but it doesn't.

The actual issue is that we generally compute Tanimoto distances with radius 2 (ECFP4) fingerprints, and two compounds with very different scaffolds can have the same ECFP4 fingerprint. Here are some examples of pairs of compounds that led to the above warning message:

SMILES 1: CC(=O)N(C)[C@@H](Cc1ccccc1)C(=O)N(C)[C@@H](Cc1ccccc1)C(N)=O
SMILES 2: CC(=O)N(C)[C@@H](Cc1ccccc1)C(=O)N(C)[C@@H](Cc1ccccc1)C(=O)N(C)[C@@H](Cc1ccccc1)C(=O)N(C)[C@@H](Cc1ccccc1)C(N)=O

which have structures:
image image

SMILES 3: CC(CN1CCCCCC1)NC(=O)/C=N/O
SMILES 4: CC(CN1CCCCC1)NC(=O)/C=N/O

with structures:
image image
The one on the left has a 7- rather than a 6-membered ring, therefore a completely different scaffold. However, because an ECFP fingerprint simply represents a bag of chemical substructures found within a specified radius of each atom, these two molecules have exactly the same fingerprint at radius 2. You have to increase the radius to 3 to get different fingerprints for these two pairs of examples.

The solution for MultitaskScaffoldSplitter is simply to increase the radius used for computing the scaffold-scaffold distance matrices. We may want to do likewise in the other AMPL modules where fingerprints are commonly used for measuring and visualizing chemical diversity: chem_diversity, diversity_plots, compare_splits_plots and rdkit_easy. This is also something to think about when using ECFP features in AMPL models.

@paulsonak
Copy link
Collaborator

Perhaps we should change the default to radius 3 in all the places it's used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants