Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ✨ add conversion from resource properties to Pandera schema #1051

Merged
merged 23 commits into from
Feb 24, 2025

Conversation

martonvago
Copy link
Contributor

Description

This PR adds the ability to convert a set of resource properties into a Pandera schema. This schema will be able to validate a polars dataframe against itself.

This PR needs an in-depth review.

Checklist

  • Added or updated tests
  • Updated documentation
  • Ran just run-all

@martonvago martonvago self-assigned this Feb 14, 2025

If the Frictionless field type has formatting constraints that are not included
in any specialised polars data type, the mapping is to string. The formatting
constraints are then checked without polars.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option would have been to define custom data types for each of these, but pandera with polars doesn't support this (yet).

The polars data type the field is mapped to.

Raises:
NotImplementedError: If Sprout doesn't yet support the Frictionless field type.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. I haven't written any check functions for it yet :P

dtype=get_polars_data_type(field.type),
checks=get_pandera_checks(field),
nullable=not get_nested_attr(field, "constraints.required", default=False),
coerce=True,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First coerce the value to dtype, then try to apply any validation checks.

for field in fields
}

return pap.DataFrameSchema(columns, strict=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strict=True: the data will need to have exactly the same columns as defined in the schema. This is the Frictionless default, but there are customisation options in the Data Package Standard.

Comment on lines 44 to 45
case _:
return pl.String
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This catches the any field type, which is also the default field type. I chose pl.String for it because it has no validation constraints and can contain anything that polars can read.
There's also a pl.Unknown type, which looks great, but Pandera gives a warning if I use it, saying it's not supported (not that it needs to do anything in particular...).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think converting to string is a very valid argument.

@martonvago martonvago marked this pull request as ready for review February 17, 2025 08:29
@martonvago martonvago requested a review from a team as a code owner February 17, 2025 08:29
Comment on lines 44 to 45
case _:
return pl.String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think converting to string is a very valid argument.

Comment on lines 25 to 38
case (
"string"
| "boolean"
| "datetime"
| "date"
| "time"
| "year"
| "yearmonth"
| "duration"
| "list"
| "array"
| "object"
| "geopoint"
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, Polars has nearly all of these data types (e.g. https://docs.pola.rs/api/python/stable/reference/datatypes.html). What's the reasoning for not using them? And Frictionless has them too. Is this a limitation of Pandera?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but unfortunately they are not perfect matches, so we cannot default on their validation mechanisms.

  • Frictionless object and array are explicitly JSON, but Polars Object and Array are any Python object and any Python array.
  • For the date/time related ones, they describe the same things but have different formatting constraints: Frictionless wants to validate against the XML format, Polars against something different. So we cannot safely cast e.g. a datetime value to a Polars datetime because some valid XML datetimes are invalid in Polars and some Polars datetimes are invalid in XML.
  • Even booleans are problematic, because Polars will cast a number like 123 to True, but Frictionless wants to flag that as invalid.

As all we want to do with the data frame is to check it against the resource properties, it's easiest to keep these as strings and have a check function for each. If we wanted to do operations on the values, it would be better to read them in as more specific data types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. That's odd. I'll add a comment to the code then to explain this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review this after we've discussed by comments above.

lwjohnst86
lwjohnst86 previously approved these changes Feb 19, 2025
Copy link
Member

@lwjohnst86 lwjohnst86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment that I'll commit, otherwise, very nice 🌠

@martonvago
Copy link
Contributor Author

martonvago commented Feb 19, 2025

Hmm, this is not under src now ✔️

Base automatically changed from feat/pandera-data-type-checks to main February 19, 2025 18:27
@lwjohnst86 lwjohnst86 dismissed their stale review February 19, 2025 18:27

The base branch was changed.

lwjohnst86
lwjohnst86 previously approved these changes Feb 21, 2025
Copy link
Member

@signekb signekb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍 🌟 Just a couple of questions

Comment on lines +35 to +40
| "datetime"
| "date"
| "time"
| "year"
| "yearmonth"
| "duration"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be my ignorance - and could be something you’ve discussed last week - but how come we don’t use something like pl.datetime for datetime and pl.date for date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question!
Original answer

Since then, for me, the question has become more about mapping between Frictionless data/field types and parquet data types, as that is the final form of the data. More details here

@martonvago martonvago requested a review from signekb February 24, 2025 14:24
@lwjohnst86 lwjohnst86 merged commit 798bab0 into main Feb 24, 2025
3 checks passed
@lwjohnst86 lwjohnst86 deleted the feat/pandera-schema branch February 24, 2025 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants