-
Notifications
You must be signed in to change notification settings - Fork 9
Robots.txt not respected if first page is redirected #9
Comments
@MothOnMars makes sense! Google robots.txt page, says this:
Now since robotex seems to be discountinued also, I would argue that you could try to fix this in Medusa or replace robotex with some other working gem that does the same. Both I like! |
Thanks for the feedback, @brutuscat. Other robots parser gems I've looked at also don't have a lot of recent dev activity, so first I'll put a PR in to fix this in Robotex. If there's no response at that point, I'll look into swapping out the gem in Medusa. |
PR for Robotex: chriskite/robotex#8 |
@MothOnMars given that this is a problem with robotex, couldn't you just bundle update the robotex gem to your GitHub branch in your project? We do not ship a Gemfile.lock and the current constraint on the gem is on |
Yeah, that's how I resolved the issue in my own repo. I was just adding the Robotex PR/issue info here for visibility in case other Medusa users encounter the issue. I'll close this up. |
@MothOnMars great thank you. Going forward I will be removing or revamping @chriskite gems that were abandoned either with gems that are getting support or branching and modernising its code. BTW will you be open or have some time this year (or the next) to give me feedback on the upcoming changes on the Medusa gem? As you can see in #14 some changes are coming that I expect to be somehow disrupting but I would like to understand how I could help you and other users to migrate to this. |
Sure, I'd be happy to. Ruby is short on good crawlers, and Medusa has been the best I've found. I'd love to see it become an official gem. I can also take a look at how the |
@MothOnMars please do! I also look forward to publish the gem, it is that just now I do not consider it is v1. Once I'm done replacing all "stalled" or old gems it will be ready. |
If you set Medusa to crawl
http://www.foo.com
, which is redirected tohttps://www.foo.com
, Medusa will successfully crawl the site, but it will not respects robots.txt. This appears to be happening because Robotex will attempt pull the robots.txt file from http://www.foo.com/robots.txt without following the redirection. This results in no robot rules for the domain www.foo.com.Example:
In https://www.yelp.com/robots.txt:
Disallow: /biz_link
I'd be happy to put in a PR to resolve this, but I've been going back and forth about whether the fix should be done in Robotex or Medusa.
The text was updated successfully, but these errors were encountered: