Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mark atoms inside rings #12

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

mark atoms inside rings #12

wants to merge 6 commits into from

Conversation

ZontaNicola
Copy link
Contributor

adding functions to mark atoms inside rings so that we don't match molecules that have substitutions to those atoms.

note that I'm doing something wrong, because the query does not show up in the smiles, but the atoms are correctly identified (i.e. the atomic number is changed to 4...)

PLEASE DO NOT MODIFY template_smiles.h DIRECTLY

Add your templates to the templates.smi file in this repository. Add one CXSMILES per line, and please do not insert any blank lines.
Commit your changes and create the PR. The CI workflow will run on the new templates and validate them, and once they are approved, they will be automatically added to the header.

For more information please check https://github.com/rdkit/molecular_templates/blob/main/README.md

if aidx in inner_atoms:
# this atom cannot have substituents, so we substitute it with a query with a fixed degree
atom = mol.GetAtomWithIdx(aidx)
query_atom = Chem.AtomFromSmarts(f"[#4&D{atom.GetDegree()}]")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a reason for #4? I think dummy atoms/#0 would make sense but I remember you mentioning there being a reason for having specific atom types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that's because the &D2 part does not show up in CXSMILES (I need your help for this! It was working from C++), so I temporarily put #4 to make sure the atoms are correct... #0 works for me, #6 also works and maybe leads to slightly more readable SMILES. I saw in the code that there's a plan to use wildcards on the bonds too, that also makes sense to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D2 is a SMARTS feature, so it's not going to show up in output SMILES.
You could try generating CXSMARTS by calling Chem.MolToCXSmarts()

@rachelnwalker
Copy link
Collaborator

rachelnwalker commented Nov 14, 2024

To get the github actions bot to show pictures of the new/updated SMILES, you have to run this script and rewrite templates.smi -- since this is a change to the script directly, could you show a couple before/after resulting SMILES?

@ZontaNicola
Copy link
Contributor Author

sure, as soon as I can get the wildcards to show up in the SMILES...

@rachelnwalker
Copy link
Collaborator

sure, as soon as I can get the wildcards to show up in the SMILES...

I think you could also create the 'gallery' using this script and attach it manually to the PR https://github.com/rdkit/molecular_templates/blob/main/src/update_gallery.py

I'm not sure how great the rdkit image generation will be though with the SMARTS

@ZontaNicola
Copy link
Contributor Author

ZontaNicola commented Nov 18, 2024

after talking with @rachelnwalker I reverted the changes to the header as @ricrogz suggested. Once this code is approved we can update the .smi file in a second review.

Here's how the last three smiles look with the new code

[#0&D2]1-[!#1]-[#0&D3]2-[#0&D3]3-[!#1]-[#0&D2]-[#0&D3]4-[#0&D3]-1-[#0&D3]1-[!#1]-[!#1]-[#0&D3]-2-[#0&D3]2-[#0&D2]-[!#1]-[!#1]-1-[!#1]1-[!#1]-[#0&D2]-[#0&D3]-2-[#0&D3]-3-[!#1]-[!#1]-[#0&D3]-4-1 |(1.72266,-1.8102,;1.72266,-5.53747,;0.674242,-4.23843,;-0.825758,-4.23843,;-1.94049,-5.53747,;-1.94049,-1.8102,;-0.825758,-0.511159,;0.674242,-0.511159,;1.42424,0.787879,;2.92424,0.787879,;2.92424,-2.93939,;1.42424,-2.93939,;0.674242,-1.64036,;1.72266,-0.341318,;1.72266,3.38596,;0.674242,2.08692,;-0.825758,2.08692,;-1.94049,3.38596,;-1.94049,-0.341318,;-0.825758,-1.64036,;-1.57576,-2.93939,;-3.07576,-2.93939,;-3.07576,0.787879,;-1.57576,0.787879,)|

[!#1]1-[!#1]2-[#0&D3]3-[!#1]-[!#1]4-[!#1]5-[!#1]-[#0&D3]-3-[!#1]-1-[#0&D3]-5-[#0&D3]-2-4 |(-1.68833,-1.6056,;-0.474804,-0.723925,;-1.13662,-0.251107,;0.360708,1.45266,;-0.93833,2.20266,;-2.43833,2.20266,;-3.73737,1.45266,;-2.19718,-0.251107,;-2.90185,-0.723925,;-2.43833,0.70266,;-0.93833,0.70266,)|

[!#1]1-[!#1]-[!#1]2-[!#1]-[!#1]-1-[#0&D2]-2 |(0.562,0.77,;0.562,-0.77,;-0.9027,-1.2459,;-1.8079,0,;-0.9027,1.2459,;-0.5912,0,)|

@ZontaNicola
Copy link
Contributor Author

@greglandrum @rachelnwalker the new template atoms are marked as [!#1], which is fine by my, but I think will match the atoms we mark as outside of a ring system here https://github.dev/rdkit/rdkit/blob/eaf544ab6f7df12dd0f738474e44f9792d97d971/Code/GraphMol/Depictor/EmbeddedFrag.cpp#L429-L433.

Maybe [!#200] would be better? What do you think is the best solution in terms of speed/template readability?

@rachelnwalker
Copy link
Collaborator

@greglandrum @rachelnwalker the new template atoms are marked as [!#1], which is fine by my, but I think will match the atoms we mark as outside of a ring system here https://github.dev/rdkit/rdkit/blob/eaf544ab6f7df12dd0f738474e44f9792d97d971/Code/GraphMol/Depictor/EmbeddedFrag.cpp#L429-L433.

Maybe [!#200] would be better? What do you think is the best solution in terms of speed/template readability?

Yeah I think [!#200] is the best solution. Could we even mark the degree-specific atom queries with !#200 instead of #0? I guess it probably wouldn't make that much of a difference...

@ZontaNicola
Copy link
Contributor Author

implemented [!#200] for every atom

Copy link
Collaborator

@rachelnwalker rachelnwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but it would be helpful to see the gallery of template depictions if possible.

src/header_generation.py Outdated Show resolved Hide resolved
src/header_generation.py Outdated Show resolved Hide resolved
src/header_generation.py Show resolved Hide resolved
@ZontaNicola
Copy link
Contributor Author

Here's a few random examples of the new templates:
Screenshot 2024-12-03 at 06 52 01
Screenshot 2024-12-03 at 06 51 32
Screenshot 2024-12-03 at 06 51 11
Screenshot 2024-12-03 at 06 50 58
Screenshot 2024-12-03 at 06 50 08

@ZontaNicola
Copy link
Contributor Author

I've looked at a lot more and they all seem correct. The only thing is that centers with 3 substituents at 120° sometimes get marked as "internal" even if they could actually get a substitution: due to the symmetry of the system the average of the substituents falls exactly on the atom, so its direction is undetermined. I don't think that in general we would ever want a substitution there... maybe we can mark ALL centers with more than 2 bonds as [!#200&D3] so they never get substitutions. @rachelnwalker what do you think?

@rachelnwalker
Copy link
Collaborator

The third example you have has an atom with three substituents that we probably want to be able to attach to, so I don't think we should mark all of the atoms. I think its OK to not get a substitution with atoms with 3 equally spaced neighbors, though.

This lgtm though! In order to force the header update, I think we will need to make a second PR after this one (maybe add a template?) or otherwise trigger the GH action that generates the header and makes the PR

@ZontaNicola
Copy link
Contributor Author

sounds good, thanks

Copy link
Member

@greglandrum greglandrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One requested change

I would also suggest running yapf or black or some kind of python code formatter across this in order to cleanup the formatting and oddly indented comments. That's not a huge deal, but I do think it helps with maintainability to have consistently formatted code

src/header_generation.py Outdated Show resolved Hide resolved
@greglandrum
Copy link
Member

@ZontaNicola just a quick ping in case you missed the notification of my review

@ZontaNicola
Copy link
Contributor Author

@ZontaNicola just a quick ping in case you missed the notification of my review

thanks, I'm aware of it. I should be able to get back to this case soon

@ZontaNicola
Copy link
Contributor Author

@greglandrum I ran yapf and removed the unused function

Copy link
Member

@greglandrum greglandrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes look good to me.
@rachelnwalker or @ricrogz will need to confirm, but I think we should not be modifying the template_smiles.h directly in a PR
(I know you included the changes here in order to show what will happen, but I think that needs to be reverted?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants