Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predictive GFM 2025 #318

Open
wants to merge 33 commits into
base: Predictive_GFM_2025
Choose a base branch
from

Conversation

allaffa
Copy link
Collaborator

@allaffa allaffa commented Jan 12, 2025

Main changes:

  1. Consistency check across all large scale datasets for the format of the torch.geometric.data objects
              data_object = Data(
                   dataset_name="dataset_name",
                   natoms=natoms,
                   pos=pos,
                   cell=None,  # even if not needed, cell needs to be defined because ADIOS requires consistency across datasets
                   pbc=None,  # even if not needed, pbc needs to be defined because ADIOS requires consistency across datasets
                   edge_shifts=None,  # even if not needed, edge_shift needs to be defined because ADIOS requires consistency across datasets
                   atomic_numbers=atomic_numbers,  # Reshaping atomic_numbers to Nx1 tensor
                   chemical_composition=chemical_composition,
                   smiles_string=smiles_string,
                   x=x,
                   energy=energy,
                   energy_per_atom=energy_per_atom,
                   force=forces,
               )
  1. apply graphgps_transform to compute structural and positional Laplacian encodings
  2. Allow for parsed input argument choice to set compute_grad_energy. Default values is False
  3. default value of energy_per_atom is set to False, because we do not need to normalize for machine learning force fields
  4. Added chemical composition as one-dimensional vector with 118 entries. each entry counts the number of atoms for that chemical species in the atomistic structures

@allaffa allaffa added the enhancement New feature or request label Jan 12, 2025
@allaffa allaffa self-assigned this Jan 12, 2025
@allaffa allaffa changed the title Predictive gfm 2025 Predictive GFM 2025 Jan 12, 2025
@allaffa allaffa requested a review from RylieWeaver January 12, 2025 22:08
examples/qm7x/train.py Outdated Show resolved Hide resolved
@allaffa allaffa requested a review from RylieWeaver January 13, 2025 14:05
@allaffa allaffa requested a review from RylieWeaver January 16, 2025 17:28
@allaffa
Copy link
Collaborator Author

allaffa commented Jan 16, 2025

@RylieWeaver @ArCho48 @zachfox
I added smiles_string as an attribute to each Data object. This attribute is set to None for inorganic compounds, for which the SMILES representation does not make sense, and also for those organic molecules for which xyz2mol struggles reconstructing the nature of the chemical bonds between atoms. This would require running quantum mechanical calculations, which obviously is insane to think about in this context.

@allaffa allaffa requested a review from pzhanggit January 16, 2025 17:40
@allaffa
Copy link
Collaborator Author

allaffa commented Jan 16, 2025

@pzhanggit
This PR corresponds to the branch where we will try to perform our imbalanced, multi-source work.
Please take a look at the structure of the Data objects.

@allaffa allaffa requested a review from zachfox January 16, 2025 17:41
@allaffa
Copy link
Collaborator Author

allaffa commented Jan 16, 2025

@zachfox
Whenever we will move ahead with the conditional DM, please take a look at the Data structures of this PR

@pzhanggit
Copy link
Collaborator

pzhanggit commented Jan 23, 2025

@pzhanggit This PR corresponds to the branch where we will try to perform our imbalanced, multi-source work. Please take a look at the structure of the Data objects.

Thank you, Max @allaffa . Introducing a dataset_name ID looks good for our multi-source work. Let me know when you complete the datasets generation, and I'll start the multi-source model training.

About the changes in hydragnn/utils/descriptors_and_embeddings/smiles_utils.py, I suggest we move them to a standalone file.

Copy link
Collaborator

@pzhanggit pzhanggit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my replies to Max's comment

@allaffa
Copy link
Collaborator Author

allaffa commented Jan 23, 2025

@pzhanggit This PR corresponds to the branch where we will try to perform our imbalanced, multi-source work. Please take a look at the structure of the Data objects.

Thank you, Max @allaffa . Introducing a dataset_name ID looks good for our multi-source work. Let me know when you complete the datasets generation, and I'll start the multi-source model training.

About the changes in hydragnn/utils/descriptors_and_embeddings/smiles_utils.py, I suggest we move them to a standalone file.

@pzhanggit
Thanks, I moved the functionalities in a separate xyz2mol.py script

@allaffa allaffa requested a review from pzhanggit January 23, 2025 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants