Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_pytorch: enable prefetching #664

Merged
merged 1 commit into from
Dec 6, 2024
Merged

to_pytorch: enable prefetching #664

merged 1 commit into from
Dec 6, 2024

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Dec 5, 2024

#653 did not really enable prefetching. Prefetch was only implemented for map(), so the example gave me a false impression that the prefetching was working, but it was not.

Now, to_pytorch uses AsyncMapper to prefetch the data. The number of workers is set to 2 by default, but it can be changed by setting the prefetch in the settings.

For me, this dropped the time to download the data by 90%, from ~300s to now ~35s.

Closes #631.

 # 653 did not really enable prefetching. Prefetch was only implemented for map(), so the example gave me a false impression that the prefetching was working, but it was not.

Now, to_pytorch uses AsyncMapper to prefetch the data. The number of workers is set to 2 by default, but it can be changed by setting the `prefetch` in the settings.

For me, this dropped the time to load the data by 90%, from ~300s to now ~35s.
@skshetry skshetry requested a review from a team December 5, 2024 07:54
Copy link

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 9bbecac
Status: ✅  Deploy successful!
Preview URL: https://3ad70991.datachain-documentation.pages.dev
Branch Preview URL: https://prefetch-pytorch.datachain-documentation.pages.dev

View logs

@@ -31,6 +32,8 @@ def label_to_int(value: str, classes: list) -> int:


class PytorchDataset(IterableDataset):
prefetch: int = 2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping it same as the default.

prefetch: int = 2

total_rank, total_workers = self.get_rank_and_workers()
def _rows_iter(self, total_rank: int, total_workers: int):
catalog = self._get_catalog()
session = Session("PyTorch", catalog=catalog)
Copy link
Member Author

@skshetry skshetry Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to create a new session and a catalog here, since the AsyncMapper runs this on a separate thread.

@skshetry skshetry changed the title to_pytorch: enable prefetching to_pytorch: enable prefetching Dec 5, 2024
Copy link

codecov bot commented Dec 5, 2024

Codecov Report

Attention: Patch coverage is 74.35897% with 10 lines in your changes missing coverage. Please review.

Project coverage is 87.32%. Comparing base (911c22f) to head (9bbecac).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/lib/pytorch.py 74.35% 5 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #664      +/-   ##
==========================================
- Coverage   87.32%   87.32%   -0.01%     
==========================================
  Files         113      113              
  Lines       10717    10727      +10     
  Branches     1469     1469              
==========================================
+ Hits         9359     9367       +8     
- Misses        985      986       +1     
- Partials      373      374       +1     
Flag Coverage Δ
datachain 87.26% <74.35%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! 👍

@skshetry skshetry merged commit 7c9d193 into main Dec 6, 2024
37 of 38 checks passed
@skshetry skshetry deleted the prefetch-pytorch branch December 6, 2024 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_pytorch prefetching
2 participants