-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prefetch: use a separate temporary cache for prefetching #730
Conversation
afae789
to
1266e4a
Compare
Deploying datachain-documentation with Cloudflare Pages
|
Deploying datachain-documentation with Cloudflare Pages
|
1266e4a
to
15c30fb
Compare
15c30fb
to
1b34bc0
Compare
1862bd0
to
90f1b7c
Compare
90f1b7c
to
0ee1da1
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #730 +/- ##
==========================================
+ Coverage 87.33% 87.43% +0.09%
==========================================
Files 128 128
Lines 11222 11329 +107
Branches 1522 1529 +7
==========================================
+ Hits 9801 9905 +104
- Misses 1045 1046 +1
- Partials 376 378 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
0ee1da1
to
0ca4e5f
Compare
0ca4e5f
to
5770599
Compare
5770599
to
a58c8a3
Compare
a58c8a3
to
bb8cc22
Compare
bb8cc22
to
acd168e
Compare
acd168e
to
b7e620b
Compare
b7e620b
to
278af30
Compare
278af30
to
c1146ef
Compare
2a0b99b
to
4bd3b4b
Compare
dec5ef2
to
a99007a
Compare
a99007a
to
d8f8f39
Compare
@skshetry are we creating a single prefetch cache for the whole session or per UDF? E.g. if I'm processing a very large dataset at once (doesn't fit my machine) - how will it work? |
We are creating prefetch cache per-udf. For the whole session, the default cache can be repurposed by using
prefetching doesn't really help with this at the moment. We don't have cache pruning yet. |
hmm, how is this implementation different from what we have before? correct me if I'm wrong, but I think we wanted to decouple cache and prefetch specifically because we didn't want to keep the full copy of data or are there any other use cases for this? |
Before this PR, prefetching only worked when This PR is a fix for #647, which enables prefetching by default, and uses a temporary cache directory, i.e. decouples cache and prefetch settings. |
But essentially it is still using cache underneath, right? Again, I'm thinking more in terms of the use case, not the API. My concern is that we are still caching the whole dataset pretty much. |
Yes, it is still using cache underneath, although I think of that as an implementation detail. I understand your usecase, and we've discussed this in relation to other training dataloader tools that have LRU cache pruning. IIRC, we did not decide on implementing it right away. See #635 (comment), where it was discussed to keep objects' for the lifetime of the udf. But #647 is more of a motivation for this PR. Technical implementation wise, cache pruning could be considered as a next iteration of this - We can still use cache for prefetching, we only need to figure out how to preserve |
yep, my only question here is could have we used just the same cache (decouple settings, but keep using the same cache location). It would be simpler, right? and would work the same in those scenarios, unless I'm missing something. We could have asked for now people delete .datachain directory after the run or put cache on some ephemeral cluster storage ... let me rephrase this - it seems complicated solution (at least PR makes look like this), while we don't solve fully the issue. I wonder if could have solved most of the stuff that this PR solves faster ... |
Yes, that was my first question too in #635 (comment).
Most of the changes in this PR are for making things correct, and cleaning things properly. But I agree that it took a lot of time. |
@skshetry it looks like most of the issues are related to pytorch when we have limited control on deleting files / closing session. How about regular UDFs? If user runs a regular UDF with pre-fetch and without caching, what prevents us from caching and deleting data properly, right after a files was processed? |
This PR will use a separate temporary cache for prefetching that resides in
.datachain/tmp/prefetch-<random>
directory whenprefetch=
is set butcache
is not.The temporary directory will be automatically deleted after the prefetching is done.
For
cache=True
, the cache will be reused and won't be deleted.Please note that auto-cleanup does not work for PyTorch datasets because there is no way to invoke cleanup from theDataset
side. TheDataLoader
may still have cached data or rows even after theDataset
instance has finished iterating. As a result, values associated with acatalog
/cache
instance can outlive theDataset
instance.One potential solution is to implement a custom dataloader or provide a user-facing API.In this PR, I have implemented the latter. The
PytorchDataset
now includes aclose()
method, which can be used to clean up the temporary prefetch cache.Ended up using
weakref.finalize
to clean up temporary cache directory during garbage collection. But callingclose()
should still be recommended approach forPytorchDataset
.