Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define standard params #34

Closed
wants to merge 1 commit into from
Closed

Conversation

Gallaecio
Copy link
Contributor

@Gallaecio Gallaecio commented Feb 6, 2024

To do:

@wRAR
Copy link
Member

wRAR commented Feb 9, 2024

This now has conflicts.

Comment on lines +39 to +147
description="ISO 3166-1 alpha-2 2-character string specified in "
"https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation.",
default=None,
json_schema_extra={
"enumMeta": {
code: {
"title": GEOLOCATION_OPTIONS_WITH_CODE[code],
}
for code in Geolocation
}
},
)
max_requests: Optional[int] = Field(
description=(
"The maximum number of Zyte API requests allowed for the crawl.\n"
"\n"
"Requests with error responses that cannot be retried or exceed "
"their retry limit also count here, but they incur in no costs "
"and do not increase the request count in Scrapy Cloud."
),
default=100,
json_schema_extra={
"widget": "request-limit",
},
)
crawl_strategy: CrawlStrategy = Field(
title="Crawl strategy",
description="Determines how the start URL and follow-up URLs are crawled.",
default=CrawlStrategy.navigation,
json_schema_extra={
"enumMeta": {
CrawlStrategy.full: {
"title": "Full",
"description": "Follow most links within the domain of URL in an attempt to discover and extract as many products as possible.",
},
CrawlStrategy.navigation: {
"title": "Navigation",
"description": "Follow pagination, subcategories, and product detail pages.",
},
CrawlStrategy.pagination_only: {
"title": "Pagination Only",
"description": (
"Follow pagination and product detail pages. SubCategory links are ignored. "
"Use this when some subCategory links are misidentified by ML-extraction."
),
},
},
},
)
extract_from: Optional[ExtractFrom] = Field(
title="Extraction source",
description=(
"Whether to perform extraction using a browser request "
"(browserHtml) or an HTTP request (httpResponseBody)."
),
default=None,
json_schema_extra={
"enumMeta": {
ExtractFrom.browserHtml: {
"title": "browserHtml",
"description": "Use browser rendering. Often provides the best quality.",
},
ExtractFrom.httpResponseBody: {
"title": "httpResponseBody",
"description": "Use HTTP responses. Cost-efficient and fast extraction method, which works well on many websites.",
},
},
},
)


def make_params(
cls_name,
params,
*,
default=None,
required=None,
set_args=None,
):
fields = {}
default = default or {}
required = set(required) if required else set()
for param in params:
field = AllParams.model_fields[param]
if field in required:
field.default = PydanticUndefined
else:
try:
field.default = default[param]
except KeyError:
pass
fields[param] = (field.annotation, field)
model = create_model(
cls_name,
__config__=ConfigDict(extra="forbid"),
**fields,
)
if set_args:
model.set_args = set_args
return model
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before further progress, I would like to decide if this is the way we want to approach this, i.e. a single model with all param definitions and a function to create a model with a subset of that.

@Gallaecio Gallaecio mentioned this pull request Feb 15, 2024
4 tasks
@Gallaecio
Copy link
Contributor Author

Gallaecio commented Feb 19, 2024

@kmike Regarding taking field reuse further in #38, I tried removing the annotation of extract_source from the models into the Field instance (as annotation), but it still triggers: Field 'extract_from' requires a type annotation.

Looking at their error handling code, I think using a type annotation is a hard requirement.

All field definitions, including overrides, require a type annotation.

@Gallaecio
Copy link
Contributor Author

Gallaecio commented Feb 19, 2024

@kmike Regarding the option of subclassing the AllParams class and hiding fields instead of the make_params option, it seems technically doable: we can redefine fields, and we can make it so that they are excluded from the schema, it seems.

However, I am not convinced about the approach, i.e. forcing every subclass to declare every field they want to hide, as opposed to defining the fields they want to have. I imagine with time the number of hidden fields would be larger than the number of kept fields. And when a new (optional) field is added upstream, it will show up in the UI after updating the library until hidden through code.

@Gallaecio
Copy link
Contributor Author

Closing in favor of #46.

@Gallaecio Gallaecio closed this Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants