`to_pytorch`: enable prefetching #664

skshetry · 2024-12-05T07:54:08Z

#653 did not really enable prefetching. Prefetch was only implemented for map(), so the example gave me a false impression that the prefetching was working, but it was not.

Now, to_pytorch uses AsyncMapper to prefetch the data. The number of workers is set to 2 by default, but it can be changed by setting the prefetch in the settings.

For me, this dropped the time to download the data by 90%, from ~300s to now ~35s.

Closes #631.

# 653 did not really enable prefetching. Prefetch was only implemented for map(), so the example gave me a false impression that the prefetching was working, but it was not. Now, to_pytorch uses AsyncMapper to prefetch the data. The number of workers is set to 2 by default, but it can be changed by setting the `prefetch` in the settings. For me, this dropped the time to load the data by 90%, from ~300s to now ~35s.

cloudflare-workers-and-pages · 2024-12-05T07:55:11Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`9bbecac`
Status:	✅ Deploy successful!
Preview URL:	https://3ad70991.datachain-documentation.pages.dev
Branch Preview URL:	https://prefetch-pytorch.datachain-documentation.pages.dev

View logs

skshetry · 2024-12-05T07:55:52Z

src/datachain/lib/pytorch.py

@@ -31,6 +32,8 @@ def label_to_int(value: str, classes: list) -> int:


 class PytorchDataset(IterableDataset):
+    prefetch: int = 2


Keeping it same as the default.

datachain/src/datachain/lib/udf.py

Line 293 in 911c22f

prefetch: int = 2

skshetry · 2024-12-05T07:56:23Z

src/datachain/lib/pytorch.py

-        total_rank, total_workers = self.get_rank_and_workers()
+    def _rows_iter(self, total_rank: int, total_workers: int):
+        catalog = self._get_catalog()
+        session = Session("PyTorch", catalog=catalog)


We need to create a new session and a catalog here, since the AsyncMapper runs this on a separate thread.

codecov · 2024-12-05T08:07:23Z

Codecov Report

Attention: Patch coverage is 74.35897% with 10 lines in your changes missing coverage. Please review.

Project coverage is 87.32%. Comparing base (911c22f) to head (9bbecac).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/lib/pytorch.py	74.35%	5 Missing and 5 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #664      +/-   ##
==========================================
- Coverage   87.32%   87.32%   -0.01%     
==========================================
  Files         113      113              
  Lines       10717    10727      +10     
  Branches     1469     1469              
==========================================
+ Hits         9359     9367       +8     
- Misses        985      986       +1     
- Partials      373      374       +1

Flag	Coverage Δ
datachain	`87.26% <74.35%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dreadatour

Looks good to me! 👍

skshetry requested a review from a team December 5, 2024 07:54

skshetry commented Dec 5, 2024

View reviewed changes

skshetry mentioned this pull request Dec 5, 2024

prefetch examples to showcase performance gain #635

Open

skshetry changed the title ~~to_pytorch: enable prefetching~~ to_pytorch: enable prefetching Dec 5, 2024

dreadatour approved these changes Dec 5, 2024

View reviewed changes

skshetry merged commit 7c9d193 into main Dec 6, 2024
37 of 38 checks passed

skshetry deleted the prefetch-pytorch branch December 6, 2024 02:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`to_pytorch`: enable prefetching #664

`to_pytorch`: enable prefetching #664

skshetry commented Dec 5, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Dec 5, 2024

skshetry Dec 5, 2024

skshetry Dec 5, 2024 •

edited

Loading

codecov bot commented Dec 5, 2024 •

edited

Loading

dreadatour left a comment

		@@ -31,6 +32,8 @@ def label_to_int(value: str, classes: list) -> int:


		class PytorchDataset(IterableDataset):
		prefetch: int = 2

to_pytorch: enable prefetching #664

to_pytorch: enable prefetching #664

Conversation

skshetry commented Dec 5, 2024 • edited Loading

cloudflare-workers-and-pages bot commented Dec 5, 2024

Deploying datachain-documentation with Cloudflare Pages

skshetry Dec 5, 2024

Choose a reason for hiding this comment

skshetry Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Dec 5, 2024 • edited Loading

Codecov Report

dreadatour left a comment

Choose a reason for hiding this comment

`to_pytorch`: enable prefetching #664

`to_pytorch`: enable prefetching #664

skshetry commented Dec 5, 2024 •

edited

Loading

skshetry Dec 5, 2024 •

edited

Loading

codecov bot commented Dec 5, 2024 •

edited

Loading