Dynamic robots.txt to reduce search space #361

tleb · 2024-12-19T15:41:28Z

Currently, our robots.txt is:

User-Agent: *
Allow: /
Crawl-Delay: 5

Well, the Crawl-Delay is not respected by the biggest crawlers. We've seen days where a bot User-Agent is at an average of 5 req/s. We can however hope they respect Allow, which we could generate dynamically to avoid letting crawlers hope they can fully index Elixir one day.

We should limit to few versions for each project. Eg, for Linux, it should be a few old releases (like the latest v2.6), the LTS releases, and the last few N versions.

For example, for Linux, we would go from 6843 to 10~30 tags. Doing so would mean that crawlers would actually be able to index Elixir, and stop once done. Currently, it isn't possible any crawler is done seeing the theoretical page count.

Top crawlers in terms of requests per seconds are labeling themselves properly in User-Agent, we hope those respect robots.txt.

The text was updated successfully, but these errors were encountered:

This was referenced Dec 20, 2024

"Error 503 Backend fetch failed" — backend CPU load is too high #365

Closed

Dynamically generate robots.txt #364

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic robots.txt to reduce search space #361

Dynamic robots.txt to reduce search space #361

tleb commented Dec 19, 2024

Dynamic robots.txt to reduce search space #361

Dynamic robots.txt to reduce search space #361

Comments

tleb commented Dec 19, 2024