Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for cross submits #68

Open
wants to merge 6 commits into
base: 2.x
Choose a base branch
from

Conversation

silentninja
Copy link

Fixes #32

Skips unverified sitemap links by checking the URLs of the robots.txt of the target link

Copy link
Collaborator

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @silentninja, thanks for the contribution! The code looks good and the accompanying unit test is very appreciated.

I see that the PR is still marked as "draft". But one comment ahead which might require a deeper change:

How are sitemap indexes handled? For example in the following constellation:

robots.txt -> sitemap-index.xml -> sitemap-news.xml.gz
                                -> sitemap-video.xml.gz
                                -> sitemap-books.xml.gz

Then looking into the robots.txt alone is not enough. A recursive lookup into a sitemap to check whether it's an index seems too expensive, esp. because the robots.txt is likely cached while sitemaps are definitely not.

What about keeping traces in the status index for any sitemap from which robots.txt it was detected? A similar feature is already available in StormCrawler to trace the seed origin of URLs, see metadata.track.path in MetadataTransfer. Of course, it would be sufficient to track the original robots.txt host name(s). Note: it's not uncommon that a sitemap is referenced from multiple hosts. This way it would not even necessary to fetch any robots.txt in case it is not found in the cache.

One real example of such a "(news) sitemap detection chain":

https://www.anews.com.tr/robots.txt
  -> https://www.anews.com.tr/sitemap/index.xml
     -> https://www.anews.com.tr/sitemap/news.xml

@silentninja
Copy link
Author

silentninja commented Feb 26, 2025

Thanks for the review @sebastian-nagel.

What about keeping traces in the status index for any sitemap from which robots.txt it was detected. A similar feature is already available in StormCrawler to trace the seed origin of URLs, see metadata.track.path in MetadataTransfer

Why do we need to keep traces in the status index if metadata.track.path already contains the trace?

How are sitemap indexes handled

Sitemap indexes are tricky especially for sitemap with incomplete trace. For example, if the seed is the https://www.anews.com.tr/sitemap/news.xml in the anews example, we would still have to recursively fetch the sitemap index to find out the lineage in case the domains are different.

@silentninja
Copy link
Author

silentninja commented Feb 26, 2025

@sebastian-nagel I made some commits to check the path using MetadataTransfer like you suggested. Can you see if that makes sense?

@silentninja silentninja marked this pull request as ready for review March 4, 2025 16:20
@silentninja
Copy link
Author

cross-submits within the pay-level domain are definitely not safe for large hosting domains (blogspot.com, github.io, etc.) and would allow to inject spam links

@sebastian-nagel I addressed most of your concerns except for filtering out the large hosting domains. The Apache Http Client which we use to get the root domain does not differentiate these private domains. We need to create a separate list from the Suffix list for these large hosting domains and filter it out. I will create a issue to track it

@sebastian-nagel
Copy link
Collaborator

The Apache Http Client which we use to get the root domain does not differentiate these private domains. We need to create a separate list from the Suffix list for these large hosting domains and filter it out.

This is already implemented in crawler-commons' EffectiveTldFinder. It's already a dependency, but we should upgrade it.

…ttp.conn.util.PublicSuffixMatcher to get the hostnames when checking for cross submit
@silentninja
Copy link
Author

The Apache Http Client which we use to get the root domain does not differentiate these private domains. We need to create a separate list from the Suffix list for these large hosting domains and filter it out.

This is already implemented in crawler-commons' EffectiveTldFinder. It's already a dependency, but we should upgrade it.

Neat! I made changes to the PR based on your suggestion. Thanks for the suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants