New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Implement `get_stats` for `Dataset` and print it before training #251

Merged

frostedoyster merged 12 commits into main from dataset-repr

Jun 13, 2024

Collaborator

frostedoyster commented Jun 12, 2024 •

edited

Loading

This implements a get_stats() for Dataset. Closes #205.

Contributor (creator of pull-request) checklist

Tests updated (for new features and bugfixes)?
Documentation updated (for new features)?
Issue referenced (for PRs that solve an issue)?

📚 Documentation preview 📚: https://metatrain--251.org.readthedocs.build/en/251/


          First attempt

8a7ad37

Collaborator Author

frostedoyster commented Jun 12, 2024 •

edited

Loading

TODO: gradients (edit: done)

@PicoCentauri the main issue here is that often we don't use our Dataset but rather Subsets from torch that don't have our repr. If we want to keep the __repr__ idea, we will have to make our own Subset that inherits from the one in torch. Otherwise we can extract the current __repr__ as a standalone function that takes in a Dataset or Subset

frostedoyster requested a review from PicoCentauri

June 12, 2024 05:07


          Implement get_stats

d61aaf4

frostedoyster changed the title ~~Implement a __repr__ for Dataset and print datasets before training~~ Implement get_stats for Dataset and print it before training


          Merge branch 'main' into dataset-repr

Collaborator Author

frostedoyster commented Jun 12, 2024

It's a bit cumbersome at the moment because having it as a method forces us to inherit from Subset from torch and also modify their train_test_split function that would otherwise return one of their Subsets, and not ours. These complications could be avoided if we made get_stats a standalone function

frostedoyster added 2 commits

June 12, 2024 11:16


          Implement tests

d3e66cd


          Merge branch 'dataset-repr' of https://github.com/lab-cosmo/metatrain …

aa57dfc

…into dataset-repr

frostedoyster marked this pull request as ready for review

June 12, 2024 09:17

jwa7 reviewed

View reviewed changes

src/metatrain/utils/data/dataset.py

+              class Dataset:
+                  """A version of the `metatensor.learn.Dataset` class that allows for
+                  the use of `mtm::` prefixes in the keys of the dictionary. See
+                  https://github.com/lab-cosmo/metatensor/issues/621.

Contributor

jwa7 Jun 12, 2024

Perhaps not for this PR, but I think this is merged now so should allow for these prefixes

Member

Luthaf Jun 12, 2024

This is not yet released, but we should do a patch release with this

PicoCentauri reviewed

View reviewed changes

src/metatrain/utils/data/dataset.py Outdated Show resolved Hide resolved

src/metatrain/utils/data/dataset.py Outdated

+                  def get_stats(self, dataset_info: DatasetInfo) -> str:
+                      if hasattr(self, "_cached_stats"):
+                          return self._cached_stats  # type: ignore

Contributor

PicoCentauri Jun 12, 2024

To please codecov you should call the get_stats twice. But, is this really necessary to cache this? Should not take super long to compute?

Collaborator Author

frostedoyster Jun 12, 2024

You're right, I over-optimized. It will be gone

src/metatrain/utils/data/dataset.py

+                  :param dict: A dictionary with the data to be stored in the dataset.
+                  """
+                  def __init__(self, dict: Dict):

Contributor

PicoCentauri Jun 12, 2024

should we keep a __repr__ that is also useful without the DatasetInfo?

Collaborator Author

frostedoyster Jun 12, 2024

I don't see the point if we don't use it...

src/metatrain/utils/data/dataset.py

+                  if dataset_len == 0:
+                      return stats
+                  target_names = []

Contributor

PicoCentauri Jun 12, 2024

I thought the target_names are in the datasetInfo?

Collaborator Author

frostedoyster Jun 12, 2024 •

edited

Loading

Yes but these are different. They also include the gradients. The variable name can be changed if you think that's a good idea. Something like target_names_with_gradients

Contributor

PicoCentauri Jun 12, 2024

Name is fine but maybe add a comment that they are different.

src/metatrain/utils/data/dataset.py Outdated

+                  means = {key: sums[key] / n_elements[key] for key in target_names}
+                  sum_of_squared_residuals = {key: 0.0 for key in target_names}
+                  for sample in dataset:

Contributor

PicoCentauri Jun 12, 2024

Why do you sum twice over the dataset?

Collaborator Author

frostedoyster Jun 12, 2024

Two iterations: one for the mean, one for the std (for the std you already need to know the mean)

Contributor

PicoCentauri Jun 12, 2024

No, you don't. You save a sum and sum of squares and do the mean and the standard deviation afterwords.

Collaborator Author

frostedoyster Jun 12, 2024

You're right

frostedoyster and others added 3 commits

June 12, 2024 22:06


          Update src/metatrain/utils/data/dataset.py

0a84729

Co-authored-by: Philip Loche <[email protected]>


          Merge branch 'main' into dataset-repr

d43977b


          Remove stats caching

60fa652

frostedoyster requested a review from PicoCentauri

June 12, 2024 20:34

frostedoyster and others added 3 commits

June 12, 2024 23:10


          Remove redundant loop, improve tests

f06c424


          Merge branch 'main' into dataset-repr


          Change formatting for mean and std of dataset

79cc462

PicoCentauri approved these changes

View reviewed changes


          Merge branch 'main' into dataset-repr

e209a4e

frostedoyster force-pushed the dataset-repr branch from 9fe0df6 to e209a4e Compare

June 13, 2024 06:28

frostedoyster merged commit dfc44f9 into main

10 of 11 checks passed

frostedoyster deleted the dataset-repr branch

June 13, 2024 06:35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet