Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full crawl by default, reword strategies #40
Full crawl by default, reword strategies #40
Changes from all commits
6c9019a
88d624b
ae16216
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern that Konstantin has raised is that this might put the spotlight in another issue which is the duplicate product URLs crawled with varying URL Query parameters. In any case, we already have a POC made for it which would work without any configuration/intervention at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another concern that Konstantin has raised would be if we might change to another default
crawl_strategy
in the future. Perhaps when the subCategory link misclassification has been addressed. This could cause another set of changes to the spider.I think
crawl_strategy=full
as the default would be okay here (moving forward) since users are most likely to use the homepage as the seed URL input. The currentcrawl_strategy=navigation
as the default would most likely cause some mistakes on the intended crawl behavior.If they want to crawl a specific category, I think users would be more careful and aware of the
crawl_strategy
since they would specifically pick the category URL they want to crawl, putting more attention on it.