Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prefetch: use a separate temporary cache for prefetching #730

Merged
merged 13 commits into from
Jan 7, 2025

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Dec 23, 2024

This PR will use a separate temporary cache for prefetching that resides in .datachain/tmp/prefetch-<random> directory when prefetch= is set but cache is not.
The temporary directory will be automatically deleted after the prefetching is done.

For cache=True, the cache will be reused and won't be deleted.

Please note that auto-cleanup does not work for PyTorch datasets because there is no way to invoke cleanup from the Dataset side. The DataLoader may still have cached data or rows even after the Dataset instance has finished iterating. As a result, values associated with a catalog/cache instance can outlive the Dataset instance.

One potential solution is to implement a custom dataloader or provide a user-facing API.
In this PR, I have implemented the latter. The PytorchDataset now includes a close() method, which can be used to clean up the temporary prefetch cache.


Ended up using weakref.finalize to clean up temporary cache directory during garbage collection. But calling close() should still be recommended approach for PytorchDataset.

Copy link

cloudflare-workers-and-pages bot commented Dec 23, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 4bd3b4b
Status: ✅  Deploy successful!
Preview URL: https://ec40ee3f.datachain-documentation.pages.dev
Branch Preview URL: https://prefetch-cache.datachain-documentation.pages.dev

View logs

Copy link

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: afae789
Status: ✅  Deploy successful!
Preview URL: https://d7bd07c5.datachain-documentation.pages.dev
Branch Preview URL: https://prefetch-cache.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Dec 24, 2024

Codecov Report

Attention: Patch coverage is 91.34615% with 18 lines in your changes missing coverage. Please review.

Project coverage is 87.43%. Comparing base (6862726) to head (d8f8f39).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/query/dataset.py 82.60% 7 Missing and 1 partial ⚠️
src/datachain/lib/file.py 60.00% 2 Missing and 2 partials ⚠️
src/datachain/progress.py 78.57% 3 Missing ⚠️
src/datachain/lib/pytorch.py 94.44% 1 Missing and 1 partial ⚠️
src/datachain/cache.py 95.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #730      +/-   ##
==========================================
+ Coverage   87.33%   87.43%   +0.09%     
==========================================
  Files         128      128              
  Lines       11222    11329     +107     
  Branches     1522     1529       +7     
==========================================
+ Hits         9801     9905     +104     
- Misses       1045     1046       +1     
- Partials      376      378       +2     
Flag Coverage Δ
datachain 87.36% <91.34%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skshetry skshetry enabled auto-merge (squash) January 7, 2025 15:18
@skshetry skshetry merged commit da7d38f into main Jan 7, 2025
38 checks passed
@skshetry skshetry deleted the prefetch-cache branch January 7, 2025 15:46
@shcheklein
Copy link
Member

@skshetry are we creating a single prefetch cache for the whole session or per UDF? E.g. if I'm processing a very large dataset at once (doesn't fit my machine) - how will it work?

@skshetry
Copy link
Member Author

skshetry commented Jan 7, 2025

E.g. if I'm processing a very large dataset at once (doesn't fit my machine) - how will it work?

We are creating prefetch cache per-udf. For the whole session, the default cache can be repurposed by using cache=True, prefetch=N.

E.g. if I'm processing a very large dataset at once (doesn't fit my machine) - how will it work?

prefetching doesn't really help with this at the moment. We don't have cache pruning yet.

@shcheklein
Copy link
Member

prefetching doesn't really help with this at the moment. We don't have cache pruning yet.

hmm, how is this implementation different from what we have before? correct me if I'm wrong, but I think we wanted to decouple cache and prefetch specifically because we didn't want to keep the full copy of data or are there any other use cases for this?

@skshetry
Copy link
Member Author

skshetry commented Jan 7, 2025

prefetching doesn't really help with this at the moment. We don't have cache pruning yet.

hmm, how is this implementation different from what we have before? correct me if I'm wrong, but I think we wanted to decouple cache and prefetch specifically because we didn't want to keep the full copy of data or are there any other use cases for this?

Before this PR, prefetching only worked when cache=True was set, and used the default cache directory. So it was not enabled by default.

This PR is a fix for #647, which enables prefetching by default, and uses a temporary cache directory, i.e. decouples cache and prefetch settings.

@shcheklein
Copy link
Member

But essentially it is still using cache underneath, right? Again, I'm thinking more in terms of the use case, not the API. My concern is that we are still caching the whole dataset pretty much.

@skshetry
Copy link
Member Author

skshetry commented Jan 7, 2025

But essentially it is still using cache underneath, right? Again, I'm thinking more in terms of the use case, not the API. My concern is that we are still caching the whole dataset pretty much.

Yes, it is still using cache underneath, although I think of that as an implementation detail.

I understand your usecase, and we've discussed this in relation to other training dataloader tools that have LRU cache pruning. IIRC, we did not decide on implementing it right away.

See #635 (comment), where it was discussed to keep objects' for the lifetime of the udf. But #647 is more of a motivation for this PR.

Technical implementation wise, cache pruning could be considered as a next iteration of this - We can still use cache for prefetching, we only need to figure out how to preserve last-used information for pruning.
(or, whatever alternative strategy we end up using).

@shcheklein
Copy link
Member

yep, my only question here is could have we used just the same cache (decouple settings, but keep using the same cache location). It would be simpler, right? and would work the same in those scenarios, unless I'm missing something. We could have asked for now people delete .datachain directory after the run or put cache on some ephemeral cluster storage ...

let me rephrase this - it seems complicated solution (at least PR makes look like this), while we don't solve fully the issue. I wonder if could have solved most of the stuff that this PR solves faster ...

@skshetry
Copy link
Member Author

skshetry commented Jan 7, 2025

yep, my only question here is could have we used just the same cache (decouple settings, but keep using the same cache location). It would be simpler, right?

Yes, that was my first question too in #635 (comment).

let me rephrase this - it seems complicated solution (at least PR makes look like this), while we don't solve fully the issue. I wonder if could have solved most of the stuff that this PR solves faster ...

Most of the changes in this PR are for making things correct, and cleaning things properly.
I know it looks complicated, but it's stuff that we need regardless of whether we implement a separate prefetching cache or not. I needed all this to handle lifecycle of the cache properly.
Also, ran into lots of issues with asyncio where coroutines would not get cancelled and just continue running in the background. :(

But I agree that it took a lot of time.

@dmpetrov
Copy link
Member

dmpetrov commented Jan 7, 2025

@skshetry it looks like most of the issues are related to pytorch when we have limited control on deleting files / closing session.

How about regular UDFs? If user runs a regular UDF with pre-fetch and without caching, what prevents us from caching and deleting data properly, right after a files was processed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants