Fix calculation of dataset size #3723

Infernaught · 2023-10-12T23:14:27Z

We previously calculated a dataset's size from the length of whatever the "dataset" parameter was that was passed in. However, this parameter could have been a filepath, meaning that the dataset.size parameter would be set incorrectly, potentially causing index out of bound errors if the actual length of the dataset was shorter than the length of the filepath string.

This PR fixes this problem by calculating the size after the filepath has been processed into a dataset.

github-actions · 2023-10-12T23:45:23Z

Unit Test Results

  6 files ±      0   6 suites ±0 20m 4s ⏱️ - 37m 28s
12 tests - 2 785   9 ✔️ - 2 775   3 💤 -   9 0 ❌ - 1
60 runs - 5 546 42 ✔️ - 5 532 18 💤 - 12 0 ❌ - 2

Results for commit 797c003. ± Comparison against base commit 92d0e0c.

♻️ This comment has been updated with latest results.

arnavgarg1 · 2023-10-13T09:51:42Z

Really nice work figuring this one out @Infernaught! If it doesn't balloon the scope, can we add some tests to this PR (or even as a fast follow) to make sure we catch these kinds of errors in the future incase we make changes to the batcher? Some ideas: Check that the number of rows being returned by the batcher is the same as what's in the dataset when using a Pandas DataFrame, Dask DataFrame, and a file path as the dataset.

arnavgarg1

You may need to get rid of the print statement!

justinxzhao · 2023-10-13T14:19:59Z

Really nice work figuring this one out @Infernaught! If it doesn't balloon the scope, can we add some tests to this PR (or even as a fast follow) to make sure we catch these kinds of errors in the future incase we make changes to the batcher? Some ideas: Check that the number of rows being returned by the batcher is the same as what's in the dataset when using a Pandas DataFrame, Dask DataFrame, and a file path as the dataset.

+1. Great find. Another test we should add is one that verifies that one pass through the batcher up to last_batch() actually uses each individual example exactly once.

justinxzhao

Thanks for adding the test!

Infernaught requested review from tgaddair, justinxzhao and arnavgarg1 October 12, 2023 23:14

arnavgarg1 approved these changes Oct 13, 2023

View reviewed changes

Fix calculation of dataset size

a587fc4

Infernaught closed this Oct 17, 2023

Infernaught force-pushed the fix_dataset_size branch from d336e00 to cb1f7d9 Compare October 17, 2023 17:32

Add tests for pandas dataset and batcher

797c003

Infernaught reopened this Oct 17, 2023

justinxzhao approved these changes Oct 17, 2023

View reviewed changes

justinxzhao merged commit c711d0a into master Oct 17, 2023
17 checks passed

justinxzhao deleted the fix_dataset_size branch October 17, 2023 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix calculation of dataset size #3723

Fix calculation of dataset size #3723

Infernaught commented Oct 12, 2023

github-actions bot commented Oct 12, 2023 •

edited

Loading

arnavgarg1 commented Oct 13, 2023

arnavgarg1 left a comment

justinxzhao commented Oct 13, 2023

justinxzhao left a comment

Fix calculation of dataset size #3723

Fix calculation of dataset size #3723

Conversation

Infernaught commented Oct 12, 2023

github-actions bot commented Oct 12, 2023 • edited Loading

Unit Test Results

arnavgarg1 commented Oct 13, 2023

arnavgarg1 left a comment

Choose a reason for hiding this comment

justinxzhao commented Oct 13, 2023

justinxzhao left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 12, 2023 •

edited

Loading