Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic robots.txt to reduce search space #361

Open
tleb opened this issue Dec 19, 2024 · 0 comments
Open

Dynamic robots.txt to reduce search space #361

tleb opened this issue Dec 19, 2024 · 0 comments

Comments

@tleb
Copy link
Member

tleb commented Dec 19, 2024

Currently, our robots.txt is:

User-Agent: *
Allow: /
Crawl-Delay: 5

Well, the Crawl-Delay is not respected by the biggest crawlers. We've seen days where a bot User-Agent is at an average of 5 req/s. We can however hope they respect Allow, which we could generate dynamically to avoid letting crawlers hope they can fully index Elixir one day.

We should limit to few versions for each project. Eg, for Linux, it should be a few old releases (like the latest v2.6), the LTS releases, and the last few N versions.

For example, for Linux, we would go from 6843 to 10~30 tags. Doing so would mean that crawlers would actually be able to index Elixir, and stop once done. Currently, it isn't possible any crawler is done seeing the theoretical page count.

Top crawlers in terms of requests per seconds are labeling themselves properly in User-Agent, we hope those respect robots.txt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant