Fix combined dataloader bug #163

frostedoyster · 2024-04-01T18:31:53Z

Before this, the combined dataloader was returning repeated samples within the same epoch

📚 Documentation preview 📚: https://metatensor-models--163.org.readthedocs.build/en/163/

PicoCentauri

The new CombinedDataLoader class that combines the old CombinedIterableDataset class and combine_dataloaders function seems useful.

I have some questions regarding its logic. Explenations and unit tests are missing but we can do this once it is working as we want.

PicoCentauri · 2024-04-02T07:03:39Z

src/metatensor/models/utils/data/combine_dataloaders.py

-    Combines multiple dataloaders into a single iterable dataset.
-    This is useful for combining multiple dataloaders into a single
-    dataloader. The new dataloader can be shuffled or not.
-
-    :param dataloaders: list of dataloaders to combine
-    :param shuffle: whether to shuffle the combined dataloader
-
-    :return: combined dataloader
-    """


I think the docstring was useful.

PicoCentauri · 2024-04-02T07:06:47Z

src/metatensor/models/utils/data/combine_dataloaders.py

+    def __len__(self):
+        return sum(len(dl) for dl in self.dataloaders)


this will change the behavior from before? Isn't this weird that if you ask for the len of a combined dataloader and you get the summed length of all elements? Usually a length of a list of arrays returns the number of arrays and not the sum the length of arrays...

Yes, it's just a matter of definitions. In this case, I think it should return the number of batches that the dataloader returns in one epoch to match the behavior of torch dataloaders. The number of batches that it returns is the sum of the number of batches in the individual dataloaders

I am not sure. But then we have to comment this at least.

Added the comment

I agree that a sum is the behavior we want (the length of a torch dataset is always the total number of elements inside you can index) if we decide to hide the fact that this is a combined dataset, i.e. if the user should use it as

for batch in dataset: # do stuff

However, my understanding was that the batches would not mix stuff from different datasets, is this right? if so, what happens when combining a dataset of size 7 and one of size 11 and trying to use a batch size of 10?

This is a combined dataloader, not dataset. It takes many dataloaders and returns the individual batches (shuffled or not, depending on what the caller wants)

right, and then the length of the combined dataloader is how many batches it will produce. Makes sense.

PicoCentauri · 2024-04-02T07:07:56Z

src/metatensor/models/utils/data/combine_dataloaders.py


+    def __iter__(self):
+        return self


Is this working? Shouldn't you return the next dataloader here instead of the full instance?

It does work. It comes from ChatGPT, but intuitively it makes sense:
By iterating, you effectively call iterable = iter(dataloader) and then next(iterable) (a bunch of times). Next is defined in the class, so it makes sense to return self

Okay then please add a test for iterator.

It's there already from the old function + class

frostedoyster · 2024-04-02T07:45:20Z

The tests are the same as those that were there before, but I've added an additional case that catches the bug. I'll add the docstring

PicoCentauri · 2024-04-03T06:58:17Z

Are you able to fox the tests? If this is done I take another look at the code.

frostedoyster · 2024-04-03T12:00:13Z

Yes, at some point. It's just the regression tests which always break (in this case due to the different batches being created by the new dataloader)

PicoCentauri

Good, just two minor doc comments.

I still don't understand why the regression tests are this fragile...

PicoCentauri · 2024-04-04T13:54:05Z

src/metatensor/models/utils/data/combine_dataloaders.py

+        """Creates the combined dataloader.
+
+        :param dataloaders: list of dataloaders to combine
+        :param shuffle: whether to shuffle the combined dataloader (this does not
+            act on the individual batches, but it shuffles the order in which
+            they are returned)
+
+        :return: the combined dataloader
+        """


There should be no docstring for the init. Everything should be in the class docstring.

PicoCentauri · 2024-04-04T13:55:35Z

src/metatensor/models/utils/data/combine_dataloaders.py

+        # this returns the total number of batches in all dataloaders
+        # (as opposed to the total number of samples or the number of
+        # individual dataloaders)


Maybe this should be visible mostly users -> move to docstrings

Once I am reading the code I don't need this information because the lines below already tells me this.

Fix combined dataloader bug

5d09d14

frostedoyster requested a review from PicoCentauri April 1, 2024 18:31

frostedoyster requested a review from abmazitov as a code owner April 1, 2024 18:31

PicoCentauri reviewed Apr 2, 2024

View reviewed changes

Add docstring

9feb92c

frostedoyster requested a review from PicoCentauri April 2, 2024 07:49

frostedoyster and others added 2 commits April 2, 2024 09:51

Add comment for __len__

02b145a

Merge branch 'main' into dataloader-bug

c213998

frostedoyster and others added 3 commits April 4, 2024 13:37

Merge branch 'main' into dataloader-bug

baaf90a

Debug regression test

5c9bf26

Merge branch 'main' into dataloader-bug

626094c

frostedoyster force-pushed the dataloader-bug branch from 42cc4c7 to 626094c Compare April 4, 2024 13:31

Fix regression tests

d61d2d4

PicoCentauri reviewed Apr 4, 2024

View reviewed changes

Suggestions from code review

4be826e

frostedoyster requested a review from PicoCentauri April 4, 2024 14:39

Show __len__ in docs

2c0340e

PicoCentauri approved these changes Apr 4, 2024

View reviewed changes

frostedoyster merged commit 6f02805 into main Apr 4, 2024
11 checks passed

frostedoyster deleted the dataloader-bug branch April 4, 2024 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix combined dataloader bug #163

Fix combined dataloader bug #163

frostedoyster commented Apr 1, 2024 •

edited by github-actions bot

Loading

PicoCentauri left a comment

PicoCentauri Apr 2, 2024

frostedoyster Apr 2, 2024

PicoCentauri Apr 2, 2024

frostedoyster Apr 2, 2024

PicoCentauri Apr 2, 2024

frostedoyster Apr 2, 2024

Luthaf Apr 2, 2024

frostedoyster Apr 2, 2024

Luthaf Apr 2, 2024

PicoCentauri Apr 2, 2024

frostedoyster Apr 2, 2024 •

edited

Loading

PicoCentauri Apr 2, 2024

frostedoyster Apr 2, 2024

frostedoyster commented Apr 2, 2024

PicoCentauri commented Apr 3, 2024

frostedoyster commented Apr 3, 2024

PicoCentauri left a comment

PicoCentauri Apr 4, 2024

PicoCentauri Apr 4, 2024

		def __len__(self):
		return sum(len(dl) for dl in self.dataloaders)

Fix combined dataloader bug #163

Fix combined dataloader bug #163

Conversation

frostedoyster commented Apr 1, 2024 • edited by github-actions bot Loading

PicoCentauri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frostedoyster Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frostedoyster commented Apr 2, 2024

PicoCentauri commented Apr 3, 2024

frostedoyster commented Apr 3, 2024

PicoCentauri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frostedoyster commented Apr 1, 2024 •

edited by github-actions bot

Loading

frostedoyster Apr 2, 2024 •

edited

Loading