Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement get_stats for Dataset and print it before training #251

Merged
merged 12 commits into from
Jun 13, 2024

Conversation

frostedoyster
Copy link
Collaborator

@frostedoyster frostedoyster commented Jun 12, 2024

This implements a get_stats() for Dataset. Closes #205.

Contributor (creator of pull-request) checklist

  • Tests updated (for new features and bugfixes)?
  • Documentation updated (for new features)?
  • Issue referenced (for PRs that solve an issue)?

📚 Documentation preview 📚: https://metatrain--251.org.readthedocs.build/en/251/

@frostedoyster
Copy link
Collaborator Author

frostedoyster commented Jun 12, 2024

TODO: gradients (edit: done)

@PicoCentauri the main issue here is that often we don't use our Dataset but rather Subsets from torch that don't have our repr. If we want to keep the __repr__ idea, we will have to make our own Subset that inherits from the one in torch. Otherwise we can extract the current __repr__ as a standalone function that takes in a Dataset or Subset

@frostedoyster frostedoyster changed the title Implement a __repr__ for Dataset and print datasets before training Implement get_stats for Dataset and print it before training Jun 12, 2024
@frostedoyster
Copy link
Collaborator Author

It's a bit cumbersome at the moment because having it as a method forces us to inherit from Subset from torch and also modify their train_test_split function that would otherwise return one of their Subsets, and not ours. These complications could be avoided if we made get_stats a standalone function

@frostedoyster frostedoyster marked this pull request as ready for review June 12, 2024 09:17
class Dataset:
"""A version of the `metatensor.learn.Dataset` class that allows for
the use of `mtm::` prefixes in the keys of the dictionary. See
https://github.com/lab-cosmo/metatensor/issues/621.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not for this PR, but I think this is merged now so should allow for these prefixes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not yet released, but we should do a patch release with this

src/metatrain/utils/data/dataset.py Outdated Show resolved Hide resolved

def get_stats(self, dataset_info: DatasetInfo) -> str:
if hasattr(self, "_cached_stats"):
return self._cached_stats # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To please codecov you should call the get_stats twice. But, is this really necessary to cache this? Should not take super long to compute?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I over-optimized. It will be gone

:param dict: A dictionary with the data to be stored in the dataset.
"""

def __init__(self, dict: Dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we keep a __repr__ that is also useful without the DatasetInfo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the point if we don't use it...

if dataset_len == 0:
return stats

target_names = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the target_names are in the datasetInfo?

Copy link
Collaborator Author

@frostedoyster frostedoyster Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but these are different. They also include the gradients. The variable name can be changed if you think that's a good idea. Something like target_names_with_gradients

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name is fine but maybe add a comment that they are different.

means = {key: sums[key] / n_elements[key] for key in target_names}

sum_of_squared_residuals = {key: 0.0 for key in target_names}
for sample in dataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you sum twice over the dataset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two iterations: one for the mean, one for the std (for the std you already need to know the mean)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you don't. You save a sum and sum of squares and do the mean and the standard deviation afterwords.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right

@frostedoyster frostedoyster merged commit dfc44f9 into main Jun 13, 2024
10 of 11 checks passed
@frostedoyster frostedoyster deleted the dataset-repr branch June 13, 2024 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

write the std deviation of the energy forces in the ouput log
4 participants