Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full crawl by default, reword strategies #40

Merged
merged 3 commits into from
Feb 15, 2024

Conversation

Gallaecio
Copy link
Contributor

No description provided.

@Gallaecio Gallaecio requested review from kmike, wRAR and BurnzZ February 15, 2024 12:23
@codecov-commenter
Copy link

codecov-commenter commented Feb 15, 2024

Codecov Report

Merging #40 (46651ee) into main (f326a39) will increase coverage by 0.19%.
The diff coverage is 100.00%.

❗ Current head 46651ee differs from pull request most recent head ae16216. Consider uploading reports for the commit ae16216 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #40      +/-   ##
==========================================
+ Coverage   98.61%   98.81%   +0.19%     
==========================================
  Files          12       12              
  Lines         506      506              
==========================================
+ Hits          499      500       +1     
+ Misses          7        6       -1     
Files Coverage Δ
zyte_spider_templates/spiders/ecommerce.py 100.00% <100.00%> (+1.20%) ⬆️



class EcommerceSpiderParams(BaseSpiderParams):
crawl_strategy: EcommerceCrawlStrategy = Field(
title="Crawl strategy",
description="Determines how the start URL and follow-up URLs are crawled.",
default=EcommerceCrawlStrategy.navigation,
default=EcommerceCrawlStrategy.full,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern that Konstantin has raised is that this might put the spotlight in another issue which is the duplicate product URLs crawled with varying URL Query parameters. In any case, we already have a POC made for it which would work without any configuration/intervention at all.



class EcommerceSpiderParams(BaseSpiderParams):
crawl_strategy: EcommerceCrawlStrategy = Field(
title="Crawl strategy",
description="Determines how the start URL and follow-up URLs are crawled.",
default=EcommerceCrawlStrategy.navigation,
default=EcommerceCrawlStrategy.full,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another concern that Konstantin has raised would be if we might change to another default crawl_strategy in the future. Perhaps when the subCategory link misclassification has been addressed. This could cause another set of changes to the spider.

I think crawl_strategy=full as the default would be okay here (moving forward) since users are most likely to use the homepage as the seed URL input. The current crawl_strategy=navigation as the default would most likely cause some mistakes on the intended crawl behavior.

If they want to crawl a specific category, I think users would be more careful and aware of the crawl_strategy since they would specifically pick the category URL they want to crawl, putting more attention on it.

@Gallaecio Gallaecio merged commit 5fcb399 into zytedata:main Feb 15, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants