Define standard params #34

Gallaecio · 2024-02-06T11:03:25Z

To do:

Agree on the approach.
- How to we handle CrawlStrategy? Do we define different “AllParams” per vertical?
Complete, merge and release Support a set_args method in param models scrapy-plugins/scrapy-spider-metadata#11
Tests.
Documentation.
MyPy.

wRAR · 2024-02-09T08:53:48Z

This now has conflicts.

Gallaecio · 2024-02-09T08:58:32Z

zyte_spider_templates/params.py

+        description="ISO 3166-1 alpha-2 2-character string specified in "
+        "https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation.",
+        default=None,
+        json_schema_extra={
+            "enumMeta": {
+                code: {
+                    "title": GEOLOCATION_OPTIONS_WITH_CODE[code],
+                }
+                for code in Geolocation
+            }
+        },
+    )
+    max_requests: Optional[int] = Field(
+        description=(
+            "The maximum number of Zyte API requests allowed for the crawl.\n"
+            "\n"
+            "Requests with error responses that cannot be retried or exceed "
+            "their retry limit also count here, but they incur in no costs "
+            "and do not increase the request count in Scrapy Cloud."
+        ),
+        default=100,
+        json_schema_extra={
+            "widget": "request-limit",
+        },
+    )
+    crawl_strategy: CrawlStrategy = Field(
+        title="Crawl strategy",
+        description="Determines how the start URL and follow-up URLs are crawled.",
+        default=CrawlStrategy.navigation,
+        json_schema_extra={
+            "enumMeta": {
+                CrawlStrategy.full: {
+                    "title": "Full",
+                    "description": "Follow most links within the domain of URL in an attempt to discover and extract as many products as possible.",
+                },
+                CrawlStrategy.navigation: {
+                    "title": "Navigation",
+                    "description": "Follow pagination, subcategories, and product detail pages.",
+                },
+                CrawlStrategy.pagination_only: {
+                    "title": "Pagination Only",
+                    "description": (
+                        "Follow pagination and product detail pages. SubCategory links are ignored. "
+                        "Use this when some subCategory links are misidentified by ML-extraction."
+                    ),
+                },
+            },
+        },
+    )
+    extract_from: Optional[ExtractFrom] = Field(
+        title="Extraction source",
+        description=(
+            "Whether to perform extraction using a browser request "
+            "(browserHtml) or an HTTP request (httpResponseBody)."
+        ),
+        default=None,
+        json_schema_extra={
+            "enumMeta": {
+                ExtractFrom.browserHtml: {
+                    "title": "browserHtml",
+                    "description": "Use browser rendering. Often provides the best quality.",
+                },
+                ExtractFrom.httpResponseBody: {
+                    "title": "httpResponseBody",
+                    "description": "Use HTTP responses. Cost-efficient and fast extraction method, which works well on many websites.",
+                },
+            },
+        },
+    )
+
+
+def make_params(
+    cls_name,
+    params,
+    *,
+    default=None,
+    required=None,
+    set_args=None,
+):
+    fields = {}
+    default = default or {}
+    required = set(required) if required else set()
+    for param in params:
+        field = AllParams.model_fields[param]
+        if field in required:
+            field.default = PydanticUndefined
+        else:
+            try:
+                field.default = default[param]
+            except KeyError:
+                pass
+        fields[param] = (field.annotation, field)
+    model = create_model(
+        cls_name,
+        __config__=ConfigDict(extra="forbid"),
+        **fields,
+    )
+    if set_args:
+        model.set_args = set_args
+    return model


Before further progress, I would like to decide if this is the way we want to approach this, i.e. a single model with all param definitions and a function to create a model with a subset of that.

Gallaecio · 2024-02-19T09:00:43Z

@kmike Regarding taking field reuse further in #38, I tried removing the annotation of extract_source from the models into the Field instance (as annotation), but it still triggers: Field 'extract_from' requires a type annotation.

Looking at their error handling code, I think using a type annotation is a hard requirement.

All field definitions, including overrides, require a type annotation.

Gallaecio · 2024-02-19T09:26:55Z

@kmike Regarding the option of subclassing the AllParams class and hiding fields instead of the make_params option, it seems technically doable: we can redefine fields, and we can make it so that they are excluded from the schema, it seems.

However, I am not convinced about the approach, i.e. forcing every subclass to declare every field they want to hide, as opposed to defining the fields they want to have. I imagine with time the number of hidden fields would be larger than the number of kept fields. And when a new (optional) field is added upstream, it will show up in the UI after updating the library until hidden through code.

Gallaecio · 2024-04-05T12:19:42Z

Closing in favor of #46.

Initial draft

c2f15e8

Gallaecio requested review from kmike, wRAR, BurnzZ, VMRuiz, PyExplorer and proway2 February 6, 2024 11:03

Gallaecio commented Feb 9, 2024

View reviewed changes

Gallaecio mentioned this pull request Feb 15, 2024

Add an input URL list parameter #38

Merged

4 tasks

Gallaecio mentioned this pull request Mar 20, 2024

Define standard parameters through individual mixins #46

Merged

Gallaecio closed this Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define standard params #34

Define standard params #34

Gallaecio commented Feb 6, 2024 •

edited

Loading

wRAR commented Feb 9, 2024

Gallaecio Feb 9, 2024

Gallaecio commented Feb 19, 2024 •

edited

Loading

Gallaecio commented Feb 19, 2024 •

edited

Loading

Gallaecio commented Apr 5, 2024

Define standard params #34

Define standard params #34

Conversation

Gallaecio commented Feb 6, 2024 • edited Loading

wRAR commented Feb 9, 2024

Gallaecio Feb 9, 2024

Choose a reason for hiding this comment

Gallaecio commented Feb 19, 2024 • edited Loading

Gallaecio commented Feb 19, 2024 • edited Loading

Gallaecio commented Apr 5, 2024

Gallaecio commented Feb 6, 2024 •

edited

Loading

Gallaecio commented Feb 19, 2024 •

edited

Loading

Gallaecio commented Feb 19, 2024 •

edited

Loading