-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check the Last-Modified header of the remote file #19
base: master
Are you sure you want to change the base?
Conversation
Hi, I see an issue filed to do with mistakenly treating a file as incomplete when it was actually complete, and I see here you are trying to ascertain whether a file download is complete or not — and I see a 'help wanted' tag... I understand that you currently describe grab as stateless, but of course then it gets tricky trying to actually work out what's arguably the most important state: is a file already a completed download. Perhaps one simple piece of state might help here: perhaps whilst a file is still in the process of downloading its filename could have '.part' appended to it, then you know for sure if it's complete or not. I think this would solve all of the issues you mention in #18 (comment) I suspect that relying purely on local timestamps for all parts of this logic is just a bug-magnet, as you already detail. — I'm not sure how you can solve your third bullet, 'resumed downloads are never corrupted by remote file changes', without keeping some kind of state/metadata. Just knowing the current part-downloaded file's size and timestamps doesn't seem enough to know if its the same file or not. As you state in #18, perhaps that should be the responsibility of the parent/caller. But perhaps grab should provide a hook for this. |
I agree, storing state might be the only sane way to solve these issues. I could also store E-Tags in the state file to see if the remote file has changed. Thanks for your feedback. The problem is this gives grab such a heavier footprint and starts writing files to people disks without their say so. I intended for it to be an easy way to monitor the progress of multiple parallel downloads and quickly resume if things go sideways. As you say, a hook for the caller to manage this problem might be the answer. I did recently add Request.BeforeCopy. Could this suit your needs? |
I could also add a |
I think it certainly requires more thought / discussion, but I currently think hooks are probably the way to go. It's possible that the existing BeforeCopy might work for me, I've not fully investigated — but would that alone help us solve the outstanding issues to do with resuming partial downloads? I have an intended use case, which involves caching ~200k files, plus various metadata (and refreshing/updating the cache later). But I think it should probably not be the responsibility of grab to manage all of that metadata — because: idiomacy, separation of concerns, less deps, probably not everyone's use case, etc. Plus implementing state/metadata persistence using a single file won't scale up in that scenario. (That data belongs in a database, probably owned by the parent app). Obviously, I want to build on something robust and fault tolerant, so I'm happy to do whatever it takes to ensure there are no corrupted files due to bad resumes. Hence the feedback! 👍 I still think using a specific file extension/suffix for partially downloaded files could probably be good thing — In its favour: it solves several issues easily and quite sanely, and is somewhat familiar behaviour (because many browsers do similar, although I'd be more than surprised if they didn't also store some metadata/state); On the other hand, it's not a full solution to the outstanding problems, specifically it does nothing to ensure 'resumed downloads are never corrupted by remote file changes', which, albeit an edge case, would still result in buggy behaviour. Another viewpoint/consideration is whether implementing such renaming/moving would interfere in any way with implementing 'proper' state/metadata management, or would it perhaps be helpful? Food for thought. |
I wonder if your use-case with 200k files is similar to the one that birthed this project - namely, cloning YUM package repositories? Grab should offer some convenience by way of managing the concurrency and batching of your 200k requests, while providing thread-safe status updates. You can technically guarantee eventual consistency using checksums, or (while not guaranteed) disabling resume. I've closed the related issues |
FWIW I have the same requirements as @jimsmart and approached the solution using the |
2d742c7
to
3c0c4ad
Compare
This PR aims to achieve the following outcomes:
If-Modified-Since
for completed downloads and checks for a304
responseLast-Modified
headers with local timestamps when deciding to resume or overwrite incomplete downloads