-
Websites / departments in my organisation usually have a robots.txt with the following simple entry:
I am not sure of how to deal with it, using heritrix 3.4 to crawl. I tend to set
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Heritrix does not currently support sitemaps (although there's a draft pull request adding it: #262) and does not support wildcards in Disallow lines (feature request #250). I haven't tested it but I would guess the rule Update (2022): Sitemaps are now supported. |
Beta Was this translation helpful? Give feedback.
Heritrix does not currently support sitemaps (although there's a draft pull request adding it: #262) and does not support wildcards in Disallow lines (feature request #250). I haven't tested it but I would guess the rule
Disallow: /*?*
will be interpreted as matching paths that actually start with the literal string/*?
. It will not match/index.html?foo
.Update (2022): Sitemaps are now supported.