Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct atom configuration prior fingerprint calculation #94

Open
bachi55 opened this issue Jan 17, 2019 · 5 comments
Open

Correct atom configuration prior fingerprint calculation #94

bachi55 opened this issue Jan 17, 2019 · 5 comments

Comments

@bachi55
Copy link
Contributor

bachi55 commented Jan 17, 2019

Hello,

I was wondering about the correct usage of do.typing(), do.aromaticity() and do.isotopes() before calculating the fingerprints of a molecules parsed from its SMILES using parse.smiles().

Let me go through a few examples.

MACCS
The documentation of get.fingerprints contains:

smiles <- c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1','C1CCC1CC(CN(C)(C))CC(=O)CC')
mols <- parse.smiles(smiles)
fps <- lapply(mols, get.fingerprint, type='maccs')

There is no information, that the aromaticity might need to be perceived first as otherwise some SMARTS are not properly matched (?). In the CDK tests for the MACCSFingerprinter at least there is atom-typing and aromaticity-detection done.

Pubchem
For this fingerprinter the CDK tests indicate that implicit hydrogens should be converted to explicit ones, i.e. convert.implicit.to.explicit.

Klekota and Roth
Here the CDK tests indicate, that no "additional" function, e.g. typing or aromaticity detection, needs to be applied.

So I could continue the list of different fingerprinters, which seem to expect different "modifications" done to the parsed molecular structure. Some fingerprinters seems to take care about it internally (within the class).

I would like to know, how others are dealing with this issue? Which transformations you perform, when you calculate different fingerprints (I could believe that problem continues for descriptors as well)? Am I overthinking the problem here? Can a "to much modifications" (I called do.typing if it is not needed, e.g.) harm my calculation, i.e. wrong fingerprints?

I believe the problem should somehow be solved in CDK, but how can we maybe in the rcdk side make the documentation more precise?

Best regards,

Eric

@schymane
Copy link
Contributor

@bachi55 this is a great point and it's a problem we are encountering beyond the fingerprints (look at e.g. #73 and the massfuncs branch where we are trying to capture this for the mass calculations). Historically in RMassBank we patched this by doing the typing etc our side in wrapper functions to ensure consistency. What concerns me is that they are also order dependent (change the order of the commands and you get different results) and this is should not happen (or if it does, then users need guidance what order is "right").

We've had conversation (email) with @rajarshi about updating documentation to roxygen anyway and not only to they need roxygenising but also a revamp to deliver additional clarity. If we start to sum up the time I'm investing into clarifying these aspects to colleagues via email, it'd start adding up to the time it would take to upgrade the docs and the unit tests to ensure we're doing it right. Just need this magic thing called time ;-)

@rajarshi any idea how best to coordinate? I am hiring and hope to soon have either own capacity or team member capacity to potentially contribute to a joint effort as it's impacting what we need to achieve. My logical start point would be on the masses but we're still stuck with the NPEs ...

@bachi55 my approach for the masses was to build unit tests for all the "edge cases" I would expect and I would use those also to build the documentation to clarify (in examples) what to do and what not to do ... once we have them as unit tests we can then run them and interact with the CDK guys to check rcdk is doing what it should ...

@bachi55
Copy link
Contributor Author

bachi55 commented Jan 17, 2019

Hei Emma,

thanks for your comments. It's "good" to hear, that the problem is somehow known.

I am at the moment using your approach to wrap rcdk, when calculating the fingerprints to have at least consistent computations. And, I will update my code so that it can test for some "edge cases". Do you have a list of those for fingerprints as well?

Actually, I think it would be good, if the CDK framework would directly take care about the correct configurations, depending on the fingerprint used.

Best regards,

Eric

@schymane
Copy link
Contributor

schymane commented Jan 17, 2019 via email

@rajarshi
Copy link
Collaborator

rajarshi commented Jan 20, 2019

So, yes, this is a pain, but also a bit surprising. I had thought that the SMARTS parser would have performed the necessary configuration, but apparently not.

library(rcdk)

m1 <- parse.smiles("c1ccccc1")[[1]] # m1 will have all atoms aromatic
m2 <- parse.smiles("C1=CC=CC=C1")[[1]] # m2 will have no atoms aromatic

fp1 <- get.fingerprint(m1, type='maccs')
fp2 <- get.fingerprint(m2, type='maccs')
fp1@bits == fp2@bits

The last equality check will fail. A quick test suggests that the sequence

do.aromaticity()
do.typing()
do.isotopes()

will appropriately configure the molecule for all fingerprints. AFAICT, "over configuration" is not an issue. So applying the above sequence should not create an issue

But not that this doesn't change implicit H to explicit, yet the Pubchem FP seems to work OK (gives the same input for the same molecule input in different forms). So I guess the Pubchem FP is handling the conversion.

m1 <- parse.smiles("c1ccccc1C")[[1]] # m1 will have all atoms aromatic
m2 <- parse.smiles("[CH]1=[CH][CH]=[CH][CH]=C1C([H])([H])([H])")[[1]] # m2 will have no atoms aromatic

do.aromaticity(m1)
do.typing(m1)
do.isotopes(m1)
do.aromaticity(m2)
do.typing(m2)
do.isotopes(m2)

length(get.atoms(m1)) == 7 # No explicit H
length(get.atoms(m2)) == 10 # All explicit H

fp1 <- get.fingerprint(m1, type='pubchem')
fp2 <- get.fingerprint(m2, type='pubchem')
fp1@bits == fp2@bits

I agree with both of you that updating the docs would go far to clarifying usage and best practice. I had started on roxygenizing the docs, but now that the holidays are over, it's slowed down.

I would agree that if a fingerprinter requires config, it could do the configuration but at the cost of redoing something that has already been done (but the latest updates may avoid the performance hit by skipping config if already done). Probably a good idea to check on cdk-user

FWIW, the CDKDescUI tool explicitly performs all configuration when computing fingerprints.

@bachi55
Copy link
Contributor Author

bachi55 commented Jan 24, 2019

@rajarshi thank you for your comments!

I played around with your example an I think for PubChem fingerprints one still needs to apply convert.implicit.to.explicit to the molecule.

In your example I believe m2's SMILES is including explicit hydrogens whereas m1 does not. What I would expect is, that the PubChem bits 0 (>= 4 H) and 1 (>= 8 H) are on (TRUE), but they are not:

m1 <- parse.smiles("c1ccccc1C")[[1]] # m1 will have all atoms aromatic
m2 <- parse.smiles("[CH]1=[CH][CH]=[CH][CH]=C1C([H])([H])([H])")[[1]] # m2 will have no atoms aromatic

do.aromaticity(m1)
do.typing(m1)
do.isotopes(m1)

do.aromaticity(m2)
do.typing(m2)
do.isotopes(m2)

fp1 <- get.fingerprint(m1, type='pubchem')
fp2 <- get.fingerprint(m2, type='pubchem')

print(paste("m1: >= 4 H:", 1 %in% fp1@bits))  # PubChem bits are zero based but R not
print(paste("m2: >= 4 H:", 1 %in% fp2@bits))

print(paste("m1: >= 8 H:", 2 %in% fp1@bits)) 
print(paste("m2: >= 8 H:", 2 %in% fp2@bits))

Output:

[1] "m1: >= 4 H: FALSE"
[1] "m2: >= 4 H: FALSE"
[1] "m1: >= 8 H: FALSE"
[1] "m2: >= 8 H: FALSE"

If I add convert.implicit.to.explicit:

m1 <- parse.smiles("c1ccccc1C")[[1]] # m1 will have all atoms aromatic
m2 <- parse.smiles("[CH]1=[CH][CH]=[CH][CH]=C1C([H])([H])([H])")[[1]] # m2 will have no atoms aromatic

do.aromaticity(m1)
do.typing(m1)
do.isotopes(m1)
convert.implicit.to.explicit(m1)

do.aromaticity(m2)
do.typing(m2)
do.isotopes(m2)
convert.implicit.to.explicit(m2)

fp1 <- get.fingerprint(m1, type='pubchem')
fp2 <- get.fingerprint(m2, type='pubchem')

print(paste("m1: >= 4 H:", 1 %in% fp1@bits))  # PubChem bits are zero based but R not
print(paste("m2: >= 4 H:", 1 %in% fp2@bits))

print(paste("m1: >= 8 H:", 2 %in% fp1@bits)) 
print(paste("m2: >= 8 H:", 2 %in% fp2@bits))

We get the "desired" output:

[1] "m1: >= 4 H: TRUE"
[1] "m2: >= 4 H: TRUE"
[1] "m1: >= 8 H: TRUE"
[1] "m2: >= 8 H: TRUE"

At the moment I use the CDK unittests and the source code as information-source to decide about the correct configuration. However, as you said, if over-configuration is not an issue, it is more "save" to simply apply it to all molecules by default.

Best regards,

Eric

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants