Robots.txt not respected if first page is redirected #9

MothOnMars · 2017-11-29T22:49:39Z

If you set Medusa to crawl http://www.foo.com, which is redirected to https://www.foo.com, Medusa will successfully crawl the site, but it will not respects robots.txt. This appears to be happening because Robotex will attempt pull the robots.txt file from http://www.foo.com/robots.txt without following the redirection. This results in no robot rules for the domain www.foo.com.

Example:
In https://www.yelp.com/robots.txt:
Disallow: /biz_link

> robotex = Robotex.new "My User Agent"

> robotex.allowed?("https://www.yelp.com/biz_link")
false

> robotex = Robotex.new "My User Agent"

> robotex.allowed?("http://www.yelp.com/biz_link")
true

I'd be happy to put in a PR to resolve this, but I've been going back and forth about whether the fix should be done in Robotex or Medusa.

The text was updated successfully, but these errors were encountered:

brutuscat · 2017-11-30T09:46:06Z

@MothOnMars makes sense!

Google robots.txt page, says this:

3xx (redirection)
Redirects will generally be followed until a valid result can be found (or a loop is recognized). We will follow a limited number of redirect hops (RFC 1945 for HTTP/1.0 allows up to 5 hops) and then stop and treat it as a 404. Handling of robots.txt redirects to disallowed URLs is undefined and discouraged. Handling of logical redirects for the robots.txt file based on HTML content that returns 2xx (frames, JavaScript, or meta refresh-type redirects) is undefined and discouraged.

Now since robotex seems to be discountinued also, I would argue that you could try to fix this in Medusa or replace robotex with some other working gem that does the same. Both I like!

MothOnMars · 2017-11-30T16:46:53Z

Thanks for the feedback, @brutuscat. Other robots parser gems I've looked at also don't have a lot of recent dev activity, so first I'll put a PR in to fix this in Robotex. If there's no response at that point, I'll look into swapping out the gem in Medusa.

MothOnMars · 2020-07-27T16:35:20Z

PR for Robotex: chriskite/robotex#8
Issue: chriskite/robotex#7

brutuscat · 2020-07-28T11:49:38Z

@MothOnMars given that this is a problem with robotex, couldn't you just bundle update the robotex gem to your GitHub branch in your project? We do not ship a Gemfile.lock and the current constraint on the gem is on >= 1 then for your project this should work given that you will be using Medusa but with a newer/better version of robotex. Does it makes sense?

See https://bundler.io/guides/git.html

MothOnMars · 2020-07-28T17:18:35Z

Yeah, that's how I resolved the issue in my own repo. I was just adding the Robotex PR/issue info here for visibility in case other Medusa users encounter the issue. I'll close this up.

brutuscat · 2020-07-29T09:31:04Z

@MothOnMars great thank you. Going forward I will be removing or revamping @chriskite gems that were abandoned either with gems that are getting support or branching and modernising its code.

BTW will you be open or have some time this year (or the next) to give me feedback on the upcoming changes on the Medusa gem? As you can see in #14 some changes are coming that I expect to be somehow disrupting but I would like to understand how I could help you and other users to migrate to this.

MothOnMars · 2020-07-30T18:28:24Z

Sure, I'd be happy to. Ruby is short on good crawlers, and Medusa has been the best I've found. I'd love to see it become an official gem. I can also take a look at how the moneta-medusa-storage branch would work for us in its current state.

brutuscat · 2020-08-04T12:53:59Z

@MothOnMars please do!

I also look forward to publish the gem, it is that just now I do not consider it is v1. Once I'm done replacing all "stalled" or old gems it will be ready.

MothOnMars closed this as completed Jul 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robots.txt not respected if first page is redirected #9

Robots.txt not respected if first page is redirected #9

MothOnMars commented Nov 29, 2017

brutuscat commented Nov 30, 2017

MothOnMars commented Nov 30, 2017

MothOnMars commented Jul 27, 2020 •

edited

Loading

brutuscat commented Jul 28, 2020 •

edited

Loading

MothOnMars commented Jul 28, 2020

brutuscat commented Jul 29, 2020

MothOnMars commented Jul 30, 2020 •

edited

Loading

brutuscat commented Aug 4, 2020

Robots.txt not respected if first page is redirected #9

Robots.txt not respected if first page is redirected #9

Comments

MothOnMars commented Nov 29, 2017

brutuscat commented Nov 30, 2017

MothOnMars commented Nov 30, 2017

MothOnMars commented Jul 27, 2020 • edited Loading

brutuscat commented Jul 28, 2020 • edited Loading

MothOnMars commented Jul 28, 2020

brutuscat commented Jul 29, 2020

MothOnMars commented Jul 30, 2020 • edited Loading

brutuscat commented Aug 4, 2020

MothOnMars commented Jul 27, 2020 •

edited

Loading

brutuscat commented Jul 28, 2020 •

edited

Loading

MothOnMars commented Jul 30, 2020 •

edited

Loading