-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#157319228] implement SearchgovDomain#delay method #49
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM except that the requests to fetch robots.txt files are over HTTP instead of HTTPS. Is that because not all sites support HTTPS? Would it be too crazy to try HTTPS first and fall back on HTTP for sites that aren't HTTPS yet?
I should add that the reason I'm wondering whether we should be using HTTPS is not because I think it's important for those requests to be encrypted but because I'm wondering whether it's important that we verify the authenticity of the site that we're sending requests to. If someone somehow hijacks DNS and is able to therefore trick one of our jobs to fetch and process their own robots.txt file, is there a way they could somehow make that robots.txt file dangerous for us to process? I looked at the robotex gem, and I can't see a way that someone could make a robots.txt file malicious to it. I'm thinking that maybe the difficulty of hijacking DNS combined with the fact that it might not be possible to craft a malicious robots.txt file makes it not worth the effort to first try HTTPS and then fall back on HTTP. |
I think I've convinced myself that using HTTP instead of HTTPS doesn't pose a danger to us, but I've come across a separate issue with the robotex gem and fetching robots.txt over HTTP instead of HTTPS. For sites that automatically redirect HTTP requests to HTTPS (which I'm assuming a lot of .gov sites do), the robotex gem doesn't report the correct crawl delay when using HTTP requests. This is because the HTTP request to fetch robots.txt gets a redirect as a response which robotex does not follow. Since robotex can't find any crawl delay in that response, it returns nil. As an example, look at https://www.gsa.gov/robots.txt and see that it specifies a crawl delay of 10 seconds. If you use robotex with HTTPS then you get the correct crawl delay:
But if you use robotex with HTTP you don't get the correct crawl delay:
|
@noremmie , the issue you address is fixed by chriskite/robotex@51c7c12. There's a PR for that in that repo (chriskite/robotex#8), but considering I've gotten no response from the repo owner for months, maybe you could review that in my branch: I should also add that I've considered adding a |
@MothOnMars your Robotex change in your I think my initial reaction to seeing HTTP instead of HTTPS was dogmatic, and now that I've thought it through more I don't think it's dangerous for us to fetch |
No worries. I should have had you review that anyway. I'm going to update the Gemfile and add a spec for the issue I fixed in Robotex. |
e3708e1
to
2d245b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
No description provided.