-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix calculation of dataset size #3723
Conversation
Really nice work figuring this one out @Infernaught! If it doesn't balloon the scope, can we add some tests to this PR (or even as a fast follow) to make sure we catch these kinds of errors in the future incase we make changes to the batcher? Some ideas: Check that the number of rows being returned by the batcher is the same as what's in the dataset when using a Pandas DataFrame, Dask DataFrame, and a file path as the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to get rid of the print statement!
+1. Great find. Another test we should add is one that verifies that one pass through the batcher up to |
d336e00
to
cb1f7d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test!
We previously calculated a dataset's size from the length of whatever the "dataset" parameter was that was passed in. However, this parameter could have been a filepath, meaning that the dataset.size parameter would be set incorrectly, potentially causing index out of bound errors if the actual length of the dataset was shorter than the length of the filepath string.
This PR fixes this problem by calculating the size after the filepath has been processed into a dataset.