prefetch: use a separate temporary cache for prefetching #730

skshetry · 2024-12-23T12:21:29Z

This PR will use a separate temporary cache for prefetching that resides in .datachain/tmp/prefetch-<random> directory when prefetch= is set but cache is not.
The temporary directory will be automatically deleted after the prefetching is done.

For cache=True, the cache will be reused and won't be deleted.

Please note that auto-cleanup does not work for PyTorch datasets because there is no way to invoke cleanup from the Dataset side. The DataLoader may still have cached data or rows even after the Dataset instance has finished iterating. As a result, values associated with a catalog/cache instance can outlive the Dataset instance.

One potential solution is to implement a custom dataloader or provide a user-facing API.
In this PR, I have implemented the latter. The PytorchDataset now includes a close() method, which can be used to clean up the temporary prefetch cache.

Ended up using weakref.finalize to clean up temporary cache directory during garbage collection. But calling close() should still be recommended approach for PytorchDataset.

cloudflare-workers-and-pages · 2024-12-23T12:21:31Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`4bd3b4b`
Status:	✅ Deploy successful!
Preview URL:	https://ec40ee3f.datachain-documentation.pages.dev
Branch Preview URL:	https://prefetch-cache.datachain-documentation.pages.dev

View logs

cloudflare-workers-and-pages · 2024-12-23T12:21:31Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`afae789`
Status:	✅ Deploy successful!
Preview URL:	https://d7bd07c5.datachain-documentation.pages.dev
Branch Preview URL:	https://prefetch-cache.datachain-documentation.pages.dev

View logs

codecov · 2024-12-24T19:26:49Z

Codecov Report

Attention: Patch coverage is 91.34615% with 18 lines in your changes missing coverage. Please review.

Project coverage is 87.43%. Comparing base (6862726) to head (d8f8f39).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/query/dataset.py	82.60%	7 Missing and 1 partial ⚠️
src/datachain/lib/file.py	60.00%	2 Missing and 2 partials ⚠️
src/datachain/progress.py	78.57%	3 Missing ⚠️
src/datachain/lib/pytorch.py	94.44%	1 Missing and 1 partial ⚠️
src/datachain/cache.py	95.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #730      +/-   ##
==========================================
+ Coverage   87.33%   87.43%   +0.09%     
==========================================
  Files         128      128              
  Lines       11222    11329     +107     
  Branches     1522     1529       +7     
==========================================
+ Hits         9801     9905     +104     
- Misses       1045     1046       +1     
- Partials      376      378       +2

Flag	Coverage Δ
datachain	`87.36% <91.34%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

shcheklein · 2025-01-07T16:35:03Z

@skshetry are we creating a single prefetch cache for the whole session or per UDF? E.g. if I'm processing a very large dataset at once (doesn't fit my machine) - how will it work?

skshetry · 2025-01-07T16:41:02Z

E.g. if I'm processing a very large dataset at once (doesn't fit my machine) - how will it work?

We are creating prefetch cache per-udf. For the whole session, the default cache can be repurposed by using cache=True, prefetch=N.

E.g. if I'm processing a very large dataset at once (doesn't fit my machine) - how will it work?

prefetching doesn't really help with this at the moment. We don't have cache pruning yet.

shcheklein · 2025-01-07T16:45:53Z

prefetching doesn't really help with this at the moment. We don't have cache pruning yet.

hmm, how is this implementation different from what we have before? correct me if I'm wrong, but I think we wanted to decouple cache and prefetch specifically because we didn't want to keep the full copy of data or are there any other use cases for this?

skshetry · 2025-01-07T16:49:51Z

prefetching doesn't really help with this at the moment. We don't have cache pruning yet.

hmm, how is this implementation different from what we have before? correct me if I'm wrong, but I think we wanted to decouple cache and prefetch specifically because we didn't want to keep the full copy of data or are there any other use cases for this?

Before this PR, prefetching only worked when cache=True was set, and used the default cache directory. So it was not enabled by default.

This PR is a fix for #647, which enables prefetching by default, and uses a temporary cache directory, i.e. decouples cache and prefetch settings.

shcheklein · 2025-01-07T16:54:20Z

But essentially it is still using cache underneath, right? Again, I'm thinking more in terms of the use case, not the API. My concern is that we are still caching the whole dataset pretty much.

skshetry · 2025-01-07T17:09:27Z

But essentially it is still using cache underneath, right? Again, I'm thinking more in terms of the use case, not the API. My concern is that we are still caching the whole dataset pretty much.

Yes, it is still using cache underneath, although I think of that as an implementation detail.

I understand your usecase, and we've discussed this in relation to other training dataloader tools that have LRU cache pruning. IIRC, we did not decide on implementing it right away.

See #635 (comment), where it was discussed to keep objects' for the lifetime of the udf. But #647 is more of a motivation for this PR.

Technical implementation wise, cache pruning could be considered as a next iteration of this - We can still use cache for prefetching, we only need to figure out how to preserve last-used information for pruning.
(or, whatever alternative strategy we end up using).

shcheklein · 2025-01-07T17:24:12Z

yep, my only question here is could have we used just the same cache (decouple settings, but keep using the same cache location). It would be simpler, right? and would work the same in those scenarios, unless I'm missing something. We could have asked for now people delete .datachain directory after the run or put cache on some ephemeral cluster storage ...

let me rephrase this - it seems complicated solution (at least PR makes look like this), while we don't solve fully the issue. I wonder if could have solved most of the stuff that this PR solves faster ...

skshetry · 2025-01-07T17:45:15Z

yep, my only question here is could have we used just the same cache (decouple settings, but keep using the same cache location). It would be simpler, right?

Yes, that was my first question too in #635 (comment).

let me rephrase this - it seems complicated solution (at least PR makes look like this), while we don't solve fully the issue. I wonder if could have solved most of the stuff that this PR solves faster ...

Most of the changes in this PR are for making things correct, and cleaning things properly.
I know it looks complicated, but it's stuff that we need regardless of whether we implement a separate prefetching cache or not. I needed all this to handle lifecycle of the cache properly.
Also, ran into lots of issues with asyncio where coroutines would not get cancelled and just continue running in the background. :(

But I agree that it took a lot of time.

dmpetrov · 2025-01-07T18:34:33Z

@skshetry it looks like most of the issues are related to pytorch when we have limited control on deleting files / closing session.

How about regular UDFs? If user runs a regular UDF with pre-fetch and without caching, what prevents us from caching and deleting data properly, right after a files was processed?

skshetry force-pushed the prefetch-cache branch from afae789 to 1266e4a Compare December 23, 2024 12:21

skshetry temporarily deployed to internal December 23, 2024 12:21 — with GitHub Actions Inactive

skshetry mentioned this pull request Dec 23, 2024

pre-fetch should work without caching as well #647

Closed

skshetry force-pushed the prefetch-cache branch from 1266e4a to 15c30fb Compare December 24, 2024 04:40

skshetry temporarily deployed to internal December 24, 2024 04:40 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 15c30fb to 1b34bc0 Compare December 24, 2024 08:04

skshetry temporarily deployed to internal December 24, 2024 08:04 — with GitHub Actions Inactive

skshetry temporarily deployed to internal December 24, 2024 15:39 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 1862bd0 to 90f1b7c Compare December 24, 2024 16:37

skshetry temporarily deployed to internal December 24, 2024 16:38 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 90f1b7c to 0ee1da1 Compare December 24, 2024 19:16

skshetry temporarily deployed to internal December 24, 2024 19:16 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 0ee1da1 to 0ca4e5f Compare December 25, 2024 07:53

skshetry temporarily deployed to internal December 25, 2024 07:53 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 0ca4e5f to 5770599 Compare December 25, 2024 08:03

skshetry temporarily deployed to internal December 25, 2024 08:04 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 5770599 to a58c8a3 Compare December 25, 2024 10:50

skshetry temporarily deployed to internal December 25, 2024 10:50 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from a58c8a3 to bb8cc22 Compare December 25, 2024 11:43

skshetry temporarily deployed to internal December 25, 2024 11:43 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from bb8cc22 to acd168e Compare December 25, 2024 12:01

skshetry temporarily deployed to internal December 25, 2024 12:01 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from acd168e to b7e620b Compare December 26, 2024 10:00

skshetry temporarily deployed to internal December 26, 2024 10:00 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from b7e620b to 278af30 Compare December 26, 2024 10:25

skshetry temporarily deployed to internal December 26, 2024 10:25 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from 278af30 to c1146ef Compare December 31, 2024 09:26

skshetry added 9 commits January 7, 2025 17:05

cancel future and wait for it

07ba315

hoist temporary cache creation to Mapper

e085052

refactor udfs

a4903c4

pytorchdataset: use weakref.finalize to cleanup temporary cache

b0ea2f2

AsyncMapper: manage iterable lifecycle

9a7510e

test that cache directory is properly closed on gc

00966a0

test that iterable is closed properly by mapper

84967b3

fix test

4e230de

wait for the producer to shutdown

4bd3b4b

skshetry force-pushed the prefetch-cache branch from 2a0b99b to 4bd3b4b Compare January 7, 2025 11:24

skshetry temporarily deployed to internal January 7, 2025 11:24 — with GitHub Actions Inactive

skshetry temporarily deployed to internal January 7, 2025 13:56 — with GitHub Actions Inactive

skshetry force-pushed the prefetch-cache branch from dec5ef2 to a99007a Compare January 7, 2025 14:12

skshetry temporarily deployed to internal January 7, 2025 14:13 — with GitHub Actions Inactive

create prefetch cache one level higher

d8f8f39

skshetry force-pushed the prefetch-cache branch from a99007a to d8f8f39 Compare January 7, 2025 15:00

skshetry temporarily deployed to internal January 7, 2025 15:00 — with GitHub Actions Inactive

skshetry enabled auto-merge (squash) January 7, 2025 15:18

skshetry merged commit da7d38f into main Jan 7, 2025
38 checks passed

skshetry deleted the prefetch-cache branch January 7, 2025 15:46

skshetry mentioned this pull request Jan 10, 2025

Cache for HF file system is broken #661

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prefetch: use a separate temporary cache for prefetching #730

prefetch: use a separate temporary cache for prefetching #730

skshetry commented Dec 23, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Dec 23, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Dec 23, 2024

codecov bot commented Dec 24, 2024 •

edited

Loading

shcheklein commented Jan 7, 2025

skshetry commented Jan 7, 2025 •

edited

Loading

shcheklein commented Jan 7, 2025

skshetry commented Jan 7, 2025 •

edited

Loading

shcheklein commented Jan 7, 2025

skshetry commented Jan 7, 2025 •

edited

Loading

shcheklein commented Jan 7, 2025

skshetry commented Jan 7, 2025 •

edited

Loading

dmpetrov commented Jan 7, 2025 •

edited

Loading

prefetch: use a separate temporary cache for prefetching #730

prefetch: use a separate temporary cache for prefetching #730

Conversation

skshetry commented Dec 23, 2024 • edited Loading

cloudflare-workers-and-pages bot commented Dec 23, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

cloudflare-workers-and-pages bot commented Dec 23, 2024

Deploying datachain-documentation with Cloudflare Pages

codecov bot commented Dec 24, 2024 • edited Loading

Codecov Report

shcheklein commented Jan 7, 2025

skshetry commented Jan 7, 2025 • edited Loading

shcheklein commented Jan 7, 2025

skshetry commented Jan 7, 2025 • edited Loading

shcheklein commented Jan 7, 2025

skshetry commented Jan 7, 2025 • edited Loading

shcheklein commented Jan 7, 2025

skshetry commented Jan 7, 2025 • edited Loading

dmpetrov commented Jan 7, 2025 • edited Loading

skshetry commented Dec 23, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Dec 23, 2024 •

edited

Loading

codecov bot commented Dec 24, 2024 •

edited

Loading

skshetry commented Jan 7, 2025 •

edited

Loading

skshetry commented Jan 7, 2025 •

edited

Loading

skshetry commented Jan 7, 2025 •

edited

Loading

skshetry commented Jan 7, 2025 •

edited

Loading

dmpetrov commented Jan 7, 2025 •

edited

Loading