Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix calculation of dataset size #3723

Merged
merged 2 commits into from
Oct 17, 2023
Merged

Fix calculation of dataset size #3723

merged 2 commits into from
Oct 17, 2023

Conversation

Infernaught
Copy link
Contributor

We previously calculated a dataset's size from the length of whatever the "dataset" parameter was that was passed in. However, this parameter could have been a filepath, meaning that the dataset.size parameter would be set incorrectly, potentially causing index out of bound errors if the actual length of the dataset was shorter than the length of the filepath string.

This PR fixes this problem by calculating the size after the filepath has been processed into a dataset.

@github-actions
Copy link

github-actions bot commented Oct 12, 2023

Unit Test Results

  6 files  ±       0    6 suites  ±0   20m 4s ⏱️ - 37m 28s
12 tests  - 2 785    9 ✔️  - 2 775    3 💤  -   9  0  - 1 
60 runs   - 5 546  42 ✔️  - 5 532  18 💤  - 12  0  - 2 

Results for commit 797c003. ± Comparison against base commit 92d0e0c.

♻️ This comment has been updated with latest results.

@arnavgarg1
Copy link
Contributor

Really nice work figuring this one out @Infernaught! If it doesn't balloon the scope, can we add some tests to this PR (or even as a fast follow) to make sure we catch these kinds of errors in the future incase we make changes to the batcher? Some ideas: Check that the number of rows being returned by the batcher is the same as what's in the dataset when using a Pandas DataFrame, Dask DataFrame, and a file path as the dataset.

Copy link
Contributor

@arnavgarg1 arnavgarg1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to get rid of the print statement!

@justinxzhao
Copy link
Contributor

Really nice work figuring this one out @Infernaught! If it doesn't balloon the scope, can we add some tests to this PR (or even as a fast follow) to make sure we catch these kinds of errors in the future incase we make changes to the batcher? Some ideas: Check that the number of rows being returned by the batcher is the same as what's in the dataset when using a Pandas DataFrame, Dask DataFrame, and a file path as the dataset.

+1. Great find. Another test we should add is one that verifies that one pass through the batcher up to last_batch() actually uses each individual example exactly once.

@Infernaught Infernaught reopened this Oct 17, 2023
Copy link
Contributor

@justinxzhao justinxzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test!

@justinxzhao justinxzhao merged commit c711d0a into master Oct 17, 2023
17 checks passed
@justinxzhao justinxzhao deleted the fix_dataset_size branch October 17, 2023 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants