feat: ✨ add conversion from resource properties to Pandera schema #1051

martonvago · 2025-02-14T19:56:17Z

Description

This PR adds the ability to convert a set of resource properties into a Pandera schema. This schema will be able to validate a polars dataframe against itself.

This PR needs an in-depth review.

Checklist

Added or updated tests
Updated documentation
Ran just run-all

martonvago · 2025-02-14T19:59:21Z

seedcase_sprout/core/sprout_checks/data/get_polars_data_type.py

+
+    If the Frictionless field type has formatting constraints that are not included
+    in any specialised polars data type, the mapping is to string. The formatting
+    constraints are then checked without polars.


The other option would have been to define custom data types for each of these, but pandera with polars doesn't support this (yet).

martonvago · 2025-02-14T20:00:36Z

seedcase_sprout/core/sprout_checks/data/get_polars_data_type.py

+        The polars data type the field is mapped to.
+
+    Raises:
+        NotImplementedError: If Sprout doesn't yet support the Frictionless field type.


I.e. I haven't written any check functions for it yet :P

martonvago · 2025-02-14T20:07:46Z

seedcase_sprout/core/sprout_checks/data/resource_properties_to_pandera_schema.py

+            dtype=get_polars_data_type(field.type),
+            checks=get_pandera_checks(field),
+            nullable=not get_nested_attr(field, "constraints.required", default=False),
+            coerce=True,


First coerce the value to dtype, then try to apply any validation checks.

martonvago · 2025-02-14T20:10:49Z

seedcase_sprout/core/sprout_checks/data/resource_properties_to_pandera_schema.py

+        for field in fields
+    }
+
+    return pap.DataFrameSchema(columns, strict=True)


strict=True: the data will need to have exactly the same columns as defined in the schema. This is the Frictionless default, but there are customisation options in the Data Package Standard.

martonvago · 2025-02-15T15:32:27Z

seedcase_sprout/core/sprout_checks/data/get_polars_data_type.py

+        case _:
+            return pl.String


This catches the any field type, which is also the default field type. I chose pl.String for it because it has no validation constraints and can contain anything that polars can read.
There's also a pl.Unknown type, which looks great, but Pandera gives a warning if I use it, saying it's not supported (not that it needs to do anything in particular...).

I think converting to string is a very valid argument.

seedcase_sprout/core/sprout_checks/data/get_polars_data_type.py

lwjohnst86 · 2025-02-17T11:55:15Z

seedcase_sprout/core/sprout_checks/data/get_polars_data_type.py

+        case _:
+            return pl.String


I think converting to string is a very valid argument.

lwjohnst86 · 2025-02-17T11:57:48Z

seedcase_sprout/core/sprout_checks/data/get_polars_data_type.py

+        case (
+            "string"
+            | "boolean"
+            | "datetime"
+            | "date"
+            | "time"
+            | "year"
+            | "yearmonth"
+            | "duration"
+            | "list"
+            | "array"
+            | "object"
+            | "geopoint"
+        ):


Hmm, Polars has nearly all of these data types (e.g. https://docs.pola.rs/api/python/stable/reference/datatypes.html). What's the reasoning for not using them? And Frictionless has them too. Is this a limitation of Pandera?

That's true, but unfortunately they are not perfect matches, so we cannot default on their validation mechanisms.

Frictionless object and array are explicitly JSON, but Polars Object and Array are any Python object and any Python array.

For the date/time related ones, they describe the same things but have different formatting constraints: Frictionless wants to validate against the XML format, Polars against something different. So we cannot safely cast e.g. a datetime value to a Polars datetime because some valid XML datetimes are invalid in Polars and some Polars datetimes are invalid in XML.

Even booleans are problematic, because Polars will cast a number like 123 to True, but Frictionless wants to flag that as invalid.

As all we want to do with the data frame is to check it against the resource properties, it's easiest to keep these as strings and have a check function for each. If we wanted to do operations on the values, it would be better to read them in as more specific data types.

Gotcha. That's odd. I'll add a comment to the code then to explain this.

seedcase_sprout/core/sprout_checks/data/get_polars_data_type.py

lwjohnst86 · 2025-02-17T13:08:41Z

tests/core/sprout_checks/data/test_resource_properties_to_pandera_schema.py

I'll review this after we've discussed by comments above.

lwjohnst86

One small comment that I'll commit, otherwise, very nice 🌠

seedcase_sprout/core/sprout_checks/get_polars_data_type.py

…prout into feat/pandera-data-type-checks

martonvago · 2025-02-19T13:20:24Z

~~Hmm, this is not under src now~~ ✔️

The base branch was changed.

signekb

Looks good 👍 🌟 Just a couple of questions

signekb · 2025-02-24T09:44:16Z

src/seedcase_sprout/core/sprout_checks/get_polars_data_type.py

+            | "datetime"
+            | "date"
+            | "time"
+            | "year"
+            | "yearmonth"
+            | "duration"


This might be my ignorance - and could be something you’ve discussed last week - but how come we don’t use something like pl.datetime for datetime and pl.date for date?

Good question!
Original answer

Since then, for me, the question has become more about mapping between Frictionless data/field types and parquet data types, as that is the final form of the data. More details here

src/seedcase_sprout/core/sprout_checks/resource_properties_to_pandera_schema.py

tests/core/sprout_checks/test_resource_properties_to_pandera_schema.py

martonvago added 7 commits February 14, 2025 13:19

feat: ✨ add pandera checks

9c8472b

test: ✅ add test for unknown XML type

53485fe

test: ✅ add geopoint test

9fc5c74

fix: 🐛 fix import type

1e5e766

refactor: ♻️ use xmlschema's base64 validator

102e3d8

feat: ✨ add get_polars_data_type

20c5efd

feat: ✨ add resource_properties_to_pandera_schema

516e700

martonvago self-assigned this Feb 14, 2025

martonvago commented Feb 14, 2025

View reviewed changes

refactor: ♻️ change any data type to string

91f2587

martonvago commented Feb 15, 2025

View reviewed changes

martonvago marked this pull request as ready for review February 17, 2025 08:29

martonvago requested a review from a team as a code owner February 17, 2025 08:29

martonvago added 3 commits February 17, 2025 12:21

refactor: ♻️ remove data folder

ab5c3a5

refactor: ♻️ rename file

f7c5c42

refactor: ♻️ rename functions

5130710

lwjohnst86 reviewed Feb 17, 2025

View reviewed changes

martonvago and others added 4 commits February 17, 2025 17:16

refactor: ♻️ update error messages

483f107

Merge branch 'feat/pandera-data-type-checks' into feat/pandera-schema

b0f8b0b

docs: 📝 capitalise Polars

1c10299

docs: ✏️ consistency with wording in docstrings

da18351

lwjohnst86 previously approved these changes Feb 19, 2025

View reviewed changes

seedcase_sprout/core/sprout_checks/get_polars_data_type.py Outdated Show resolved Hide resolved

docs: ✏️ add explanation on casing of types to string

581768b

martonvago commented Feb 19, 2025

View reviewed changes

seedcase_sprout/core/sprout_checks/get_polars_data_type.py Outdated Show resolved Hide resolved

martonvago and others added 3 commits February 19, 2025 10:35

tweak comment

6fb0fc8

Merge branch 'main' of https://github.com/seedcase-project/seedcase-s…

a2588d5

…prout into feat/pandera-data-type-checks

Merge branch 'main' into feat/pandera-data-type-checks

483e835

Merge branch 'feat/pandera-data-type-checks' into feat/pandera-schema

907dcaa

refactor: ♻️ move files under src

c4c0304

Base automatically changed from feat/pandera-data-type-checks to main February 19, 2025 18:27

Merge remote-tracking branch 'origin/main' into feat/pandera-schema

2661d75

lwjohnst86 previously approved these changes Feb 21, 2025

View reviewed changes

signekb requested changes Feb 24, 2025

View reviewed changes

refactor: ♻️ review markups

5f70a47

martonvago dismissed lwjohnst86’s stale review via 5f70a47 February 24, 2025 14:23

martonvago requested a review from signekb February 24, 2025 14:24

lwjohnst86 merged commit 798bab0 into main Feb 24, 2025
3 checks passed

lwjohnst86 deleted the feat/pandera-schema branch February 24, 2025 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ✨ add conversion from resource properties to Pandera schema #1051

feat: ✨ add conversion from resource properties to Pandera schema #1051

martonvago commented Feb 14, 2025

martonvago Feb 14, 2025

martonvago Feb 14, 2025

martonvago Feb 14, 2025

martonvago Feb 14, 2025

martonvago Feb 15, 2025

lwjohnst86 Feb 17, 2025

lwjohnst86 Feb 17, 2025

lwjohnst86 Feb 17, 2025

martonvago Feb 17, 2025

lwjohnst86 Feb 19, 2025

lwjohnst86 Feb 17, 2025

lwjohnst86 left a comment

martonvago commented Feb 19, 2025 •

edited

Loading

signekb left a comment

signekb Feb 24, 2025

martonvago Feb 24, 2025

feat: ✨ add conversion from resource properties to Pandera schema #1051

feat: ✨ add conversion from resource properties to Pandera schema #1051

Conversation

martonvago commented Feb 14, 2025

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lwjohnst86 left a comment

Choose a reason for hiding this comment

martonvago commented Feb 19, 2025 • edited Loading

signekb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martonvago commented Feb 19, 2025 •

edited

Loading