Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check the Last-Modified header of the remote file #19

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cavaliercoder
Copy link
Collaborator

This PR aims to achieve the following outcomes:

  • grab sets If-Modified-Since for completed downloads and checks for a 304 response
  • grab compares Last-Modified headers with local timestamps when deciding to resume or overwrite incomplete downloads
  • resumed downloads are never corrupted by remote file changes

@jimsmart
Copy link

jimsmart commented Jan 14, 2018

Hi,

I see an issue filed to do with mistakenly treating a file as incomplete when it was actually complete, and I see here you are trying to ascertain whether a file download is complete or not — and I see a 'help wanted' tag...

I understand that you currently describe grab as stateless, but of course then it gets tricky trying to actually work out what's arguably the most important state: is a file already a completed download.

Perhaps one simple piece of state might help here: perhaps whilst a file is still in the process of downloading its filename could have '.part' appended to it, then you know for sure if it's complete or not.

I think this would solve all of the issues you mention in #18 (comment)

I suspect that relying purely on local timestamps for all parts of this logic is just a bug-magnet, as you already detail.

— I'm not sure how you can solve your third bullet, 'resumed downloads are never corrupted by remote file changes', without keeping some kind of state/metadata. Just knowing the current part-downloaded file's size and timestamps doesn't seem enough to know if its the same file or not. As you state in #18, perhaps that should be the responsibility of the parent/caller. But perhaps grab should provide a hook for this.

@cavaliercoder
Copy link
Collaborator Author

I agree, storing state might be the only sane way to solve these issues. I could also store E-Tags in the state file to see if the remote file has changed. Thanks for your feedback.

The problem is this gives grab such a heavier footprint and starts writing files to people disks without their say so. I intended for it to be an easy way to monitor the progress of multiple parallel downloads and quickly resume if things go sideways. As you say, a hook for the caller to manage this problem might be the answer.

I did recently add Request.BeforeCopy. Could this suit your needs?

@cavaliercoder
Copy link
Collaborator Author

I could also add a StoreState and RestoreState that are only called if the caller explicitly wants to take advantage of this feature.

@jimsmart
Copy link

I think it certainly requires more thought / discussion, but I currently think hooks are probably the way to go. It's possible that the existing BeforeCopy might work for me, I've not fully investigated — but would that alone help us solve the outstanding issues to do with resuming partial downloads?

I have an intended use case, which involves caching ~200k files, plus various metadata (and refreshing/updating the cache later). But I think it should probably not be the responsibility of grab to manage all of that metadata — because: idiomacy, separation of concerns, less deps, probably not everyone's use case, etc. Plus implementing state/metadata persistence using a single file won't scale up in that scenario. (That data belongs in a database, probably owned by the parent app). Obviously, I want to build on something robust and fault tolerant, so I'm happy to do whatever it takes to ensure there are no corrupted files due to bad resumes. Hence the feedback! 👍

I still think using a specific file extension/suffix for partially downloaded files could probably be good thing — In its favour: it solves several issues easily and quite sanely, and is somewhat familiar behaviour (because many browsers do similar, although I'd be more than surprised if they didn't also store some metadata/state); On the other hand, it's not a full solution to the outstanding problems, specifically it does nothing to ensure 'resumed downloads are never corrupted by remote file changes', which, albeit an edge case, would still result in buggy behaviour. Another viewpoint/consideration is whether implementing such renaming/moving would interfere in any way with implementing 'proper' state/metadata management, or would it perhaps be helpful?

Food for thought.

@cavaliercoder
Copy link
Collaborator Author

I wonder if your use-case with 200k files is similar to the one that birthed this project - namely, cloning YUM package repositories?

Grab should offer some convenience by way of managing the concurrency and batching of your 200k requests, while providing thread-safe status updates. You can technically guarantee eventual consistency using checksums, or (while not guaranteed) disabling resume.

I've closed the related issues wontfix, but I'll keep this open as an experiment to see how it would look if we implemented this kind of guarantee and state persistence in grab. It certainly seems at least to be in high demand.

@cavaliercoder
Copy link
Collaborator Author

Relates to #21 and #18.

@justinfx
Copy link

justinfx commented Jul 3, 2020

FWIW I have the same requirements as @jimsmart and approached the solution using the BeforeCopy hook. I hit one issue which I just fixed via #80 (thanks for merging).
So at this point I hook in before the copy and consult my local datastore for the hash from the previous download, to compare with a hash that is being delivered in the current response headers. It allows me to cancel if I conclude that I already have the complete same file. I use the AfterCopy hook to dynamically read the hash from the header and do SetCheckSum with it just before it moves to that state.
I don't think grab should learn about state, but I do think the more hooks we have into the request and response lifecycle, the better. Because the user knows the details of the state and not grab. If there were earlier hooks before the file is opened, I could have solved my issue there as well without a fix in #80.
Grab is a great library so far and it would remove the burden of responsibility and make it even more flexible for users to have more hooks.

@cavaliercoder cavaliercoder force-pushed the master branch 2 times, most recently from 2d742c7 to 3c0c4ad Compare January 8, 2022 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants