Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combination of Crawl-delay and badbot Disallow results in blocking of Googlebot #51

Open
mojmirdurik opened this issue May 27, 2022 · 3 comments

Comments

@mojmirdurik
Copy link

mojmirdurik commented May 27, 2022

For example Googlebot gets blocked by following robots.txt (check it in google testing tool):

# Slow down bots
User-agent: *
Crawl-delay: 10

# Disallow: Badbot
User-agent: badbot
Disallow: /

# allow explicitly all other bots
User-agent: *
Disallow:

If you remove Crawl-delay directive Googlebot will pass. This works:

# Disallow: Badbot
User-agent: badbot
Disallow: /

# allow explicitly all other bots
User-agent: *
Disallow:

And this too:

# Disallow: Badbot
User-agent: badbot
Disallow: /

If you would like to use Crawl-delay directive and to not block Googlebot you must add Allow directive:

# Slow down bots
User-agent: *
Crawl-delay: 10

# Disallow: Badbot
User-agent: badbot
Disallow: /

# allow explicitly all other bots
User-agent: *
Disallow:

# allow explicitly all other bots (supported only by google and bing)
User-agent: *
Allow: /

Both Crawl-delay and Allow are unofficial directives. Crawl-delay is widely supported (except of Googlebot). Allow is supported only by Googlebot and Bingbot (AFAIK). Normally Googlebot should be allowed by all above mentioned robots.txt. E.g. if you choose Adsbot-Google in mentioned google tool it will pass for all. All other google bots will fail in the same way. For first time we have noticed this unexpected behaviour at the end of 2021.

Is this a mistake in parsing of robots.txt by Googlebot or do I just miss something?

@mojmirdurik mojmirdurik changed the title Combination of Crawl-delay and badbot disallow results in blocking of googlebot Combination of Crawl-delay and badbot Disallow results in blocking of Googlebot May 27, 2022
@garyillyes
Copy link
Contributor

Hi Mojmir and thanks for opening this issue. Custom lines such as Crawl-delay and Sitemap should indeed not affect parsing of other lines, in fact they should be ignored as also stipulated in the REP internet draft. For example, from the perspective of Googlebot these two robots.txt snippets are equivalent:

User-agent: *
Crawl-delay: 10

User-agent: badbot
Disallow: /

VS

User-agent: *
Some other unsupported line in plain text

User-agent: badbot
Disallow: /

This means that User-agent: * is merged together with User-agent: badbot, essentially disallowing everything for the global (*) user-agent.

Not ignoring custom lines such as Crawl-delay was a bug, and has been fixed in the following commit: c8ac4b1

Unfortunately the testing tool in Google Search Console, unlike Googlebot, is not using this library so we haven't gotten to fixing this obscure bug there, too.

@mojmirdurik
Copy link
Author

mojmirdurik commented May 31, 2022

Hi Gary, thank you for your answer.

I didn't know - the syntax of robots.txt is really a bit tricky. Unofficial rules (e.g. Crawl-delay) can result in different evaluations by different bots. Because if unofficial rule is ignored by bot then the two groups are merged into one and the meaning of robots.txt can be changed dramatically.

Having this on mind it is better to put unofficial rules at the end of file, especially if they are used with User-agent: *. So Googlebot will be blocked by robots.txt like this (ignoring Crawl-delay):

User-agent: *
Crawl-delay: 10

User-agent: badbot
Disallow: /

...but not by robots.txt like this:

User-agent: badbot
Disallow: /

User-agent: *
Crawl-delay: 10

...even both seems to do the same thing.

@garyillyes
Copy link
Contributor

You're correct, lines that are not supported by Googlebot but are in a group otherwise like Crawl-delay in your examples, ideally should be at the end of file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants