-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full crawl by default, reword strategies #40
Conversation
Codecov Report
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #40 +/- ##
==========================================
+ Coverage 98.61% 98.81% +0.19%
==========================================
Files 12 12
Lines 506 506
==========================================
+ Hits 499 500 +1
+ Misses 7 6 -1
|
|
||
|
||
class EcommerceSpiderParams(BaseSpiderParams): | ||
crawl_strategy: EcommerceCrawlStrategy = Field( | ||
title="Crawl strategy", | ||
description="Determines how the start URL and follow-up URLs are crawled.", | ||
default=EcommerceCrawlStrategy.navigation, | ||
default=EcommerceCrawlStrategy.full, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern that Konstantin has raised is that this might put the spotlight in another issue which is the duplicate product URLs crawled with varying URL Query parameters. In any case, we already have a POC made for it which would work without any configuration/intervention at all.
|
||
|
||
class EcommerceSpiderParams(BaseSpiderParams): | ||
crawl_strategy: EcommerceCrawlStrategy = Field( | ||
title="Crawl strategy", | ||
description="Determines how the start URL and follow-up URLs are crawled.", | ||
default=EcommerceCrawlStrategy.navigation, | ||
default=EcommerceCrawlStrategy.full, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another concern that Konstantin has raised would be if we might change to another default crawl_strategy
in the future. Perhaps when the subCategory link misclassification has been addressed. This could cause another set of changes to the spider.
I think crawl_strategy=full
as the default would be okay here (moving forward) since users are most likely to use the homepage as the seed URL input. The current crawl_strategy=navigation
as the default would most likely cause some mistakes on the intended crawl behavior.
If they want to crawl a specific category, I think users would be more careful and aware of the crawl_strategy
since they would specifically pick the category URL they want to crawl, putting more attention on it.
No description provided.