-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check cross-submits for sitemaps #32
Comments
Further scenario: a news site redirects on of their news articles to a page on another site as kind of an advertisement. We need to check the robots.txt of the target site, of course. But we should ignore the sitemap directives. |
@sebastian-nagel I made a draft PR #68 which contains fix for the issue but have a question. Sitemap 1️⃣ Should we consider this as a cross-submit? |
Maybe we can make this configurable and optionally allow submits on the level of the registered (pay-level) domain or the private domain?
|
Maybe one more comment about cross-submits: there is a long-standing open issue crawler-commons/crawler-commons#85 for this. Could be worth to implement it there, too. This would make it possible to simplify the code in the future. |
Sitemaps are automatically detected in the robots.txt but not checked for cross-submits. From time to time this leads to spam-like injections of URLs not matching the news genre. Recently, via one of their periodicals a publishing company "injects" their entire publishing program including landing pages for books and other media. This also happened for real estate ads before.
Note that the sitemaps must follow the news sitemap format which is the barrier for most cross-submits but not always.
The text was updated successfully, but these errors were encountered: