-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define standard params #34
Conversation
This now has conflicts. |
description="ISO 3166-1 alpha-2 2-character string specified in " | ||
"https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation.", | ||
default=None, | ||
json_schema_extra={ | ||
"enumMeta": { | ||
code: { | ||
"title": GEOLOCATION_OPTIONS_WITH_CODE[code], | ||
} | ||
for code in Geolocation | ||
} | ||
}, | ||
) | ||
max_requests: Optional[int] = Field( | ||
description=( | ||
"The maximum number of Zyte API requests allowed for the crawl.\n" | ||
"\n" | ||
"Requests with error responses that cannot be retried or exceed " | ||
"their retry limit also count here, but they incur in no costs " | ||
"and do not increase the request count in Scrapy Cloud." | ||
), | ||
default=100, | ||
json_schema_extra={ | ||
"widget": "request-limit", | ||
}, | ||
) | ||
crawl_strategy: CrawlStrategy = Field( | ||
title="Crawl strategy", | ||
description="Determines how the start URL and follow-up URLs are crawled.", | ||
default=CrawlStrategy.navigation, | ||
json_schema_extra={ | ||
"enumMeta": { | ||
CrawlStrategy.full: { | ||
"title": "Full", | ||
"description": "Follow most links within the domain of URL in an attempt to discover and extract as many products as possible.", | ||
}, | ||
CrawlStrategy.navigation: { | ||
"title": "Navigation", | ||
"description": "Follow pagination, subcategories, and product detail pages.", | ||
}, | ||
CrawlStrategy.pagination_only: { | ||
"title": "Pagination Only", | ||
"description": ( | ||
"Follow pagination and product detail pages. SubCategory links are ignored. " | ||
"Use this when some subCategory links are misidentified by ML-extraction." | ||
), | ||
}, | ||
}, | ||
}, | ||
) | ||
extract_from: Optional[ExtractFrom] = Field( | ||
title="Extraction source", | ||
description=( | ||
"Whether to perform extraction using a browser request " | ||
"(browserHtml) or an HTTP request (httpResponseBody)." | ||
), | ||
default=None, | ||
json_schema_extra={ | ||
"enumMeta": { | ||
ExtractFrom.browserHtml: { | ||
"title": "browserHtml", | ||
"description": "Use browser rendering. Often provides the best quality.", | ||
}, | ||
ExtractFrom.httpResponseBody: { | ||
"title": "httpResponseBody", | ||
"description": "Use HTTP responses. Cost-efficient and fast extraction method, which works well on many websites.", | ||
}, | ||
}, | ||
}, | ||
) | ||
|
||
|
||
def make_params( | ||
cls_name, | ||
params, | ||
*, | ||
default=None, | ||
required=None, | ||
set_args=None, | ||
): | ||
fields = {} | ||
default = default or {} | ||
required = set(required) if required else set() | ||
for param in params: | ||
field = AllParams.model_fields[param] | ||
if field in required: | ||
field.default = PydanticUndefined | ||
else: | ||
try: | ||
field.default = default[param] | ||
except KeyError: | ||
pass | ||
fields[param] = (field.annotation, field) | ||
model = create_model( | ||
cls_name, | ||
__config__=ConfigDict(extra="forbid"), | ||
**fields, | ||
) | ||
if set_args: | ||
model.set_args = set_args | ||
return model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before further progress, I would like to decide if this is the way we want to approach this, i.e. a single model with all param definitions and a function to create a model with a subset of that.
@kmike Regarding taking field reuse further in #38, I tried removing the annotation of Looking at their error handling code, I think using a type annotation is a hard requirement.
|
@kmike Regarding the option of subclassing the However, I am not convinced about the approach, i.e. forcing every subclass to declare every field they want to hide, as opposed to defining the fields they want to have. I imagine with time the number of hidden fields would be larger than the number of kept fields. And when a new (optional) field is added upstream, it will show up in the UI after updating the library until hidden through code. |
Closing in favor of #46. |
To do:
CrawlStrategy
? Do we define different “AllParams” per vertical?