-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ✨ add conversion from resource properties to Pandera schema #1051
Conversation
|
||
If the Frictionless field type has formatting constraints that are not included | ||
in any specialised polars data type, the mapping is to string. The formatting | ||
constraints are then checked without polars. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option would have been to define custom data types for each of these, but pandera with polars doesn't support this (yet).
The polars data type the field is mapped to. | ||
|
||
Raises: | ||
NotImplementedError: If Sprout doesn't yet support the Frictionless field type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e. I haven't written any check functions for it yet :P
dtype=get_polars_data_type(field.type), | ||
checks=get_pandera_checks(field), | ||
nullable=not get_nested_attr(field, "constraints.required", default=False), | ||
coerce=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First coerce the value to dtype
, then try to apply any validation checks.
for field in fields | ||
} | ||
|
||
return pap.DataFrameSchema(columns, strict=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strict=True
: the data will need to have exactly the same columns as defined in the schema. This is the Frictionless default, but there are customisation options in the Data Package Standard.
case _: | ||
return pl.String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This catches the any
field type, which is also the default field type. I chose pl.String
for it because it has no validation constraints and can contain anything that polars can read.
There's also a pl.Unknown
type, which looks great, but Pandera gives a warning if I use it, saying it's not supported (not that it needs to do anything in particular...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think converting to string is a very valid argument.
seedcase_sprout/core/sprout_checks/data/get_polars_data_type.py
Outdated
Show resolved
Hide resolved
seedcase_sprout/core/sprout_checks/data/get_polars_data_type.py
Outdated
Show resolved
Hide resolved
case _: | ||
return pl.String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think converting to string is a very valid argument.
case ( | ||
"string" | ||
| "boolean" | ||
| "datetime" | ||
| "date" | ||
| "time" | ||
| "year" | ||
| "yearmonth" | ||
| "duration" | ||
| "list" | ||
| "array" | ||
| "object" | ||
| "geopoint" | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, Polars has nearly all of these data types (e.g. https://docs.pola.rs/api/python/stable/reference/datatypes.html). What's the reasoning for not using them? And Frictionless has them too. Is this a limitation of Pandera?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but unfortunately they are not perfect matches, so we cannot default on their validation mechanisms.
- Frictionless
object
andarray
are explicitly JSON, but PolarsObject
andArray
are any Python object and any Python array. - For the date/time related ones, they describe the same things but have different formatting constraints: Frictionless wants to validate against the XML format, Polars against something different. So we cannot safely cast e.g. a datetime value to a Polars datetime because some valid XML datetimes are invalid in Polars and some Polars datetimes are invalid in XML.
- Even booleans are problematic, because Polars will cast a number like 123 to True, but Frictionless wants to flag that as invalid.
As all we want to do with the data frame is to check it against the resource properties, it's easiest to keep these as strings and have a check function for each. If we wanted to do operations on the values, it would be better to read them in as more specific data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. That's odd. I'll add a comment to the code then to explain this.
seedcase_sprout/core/sprout_checks/data/get_polars_data_type.py
Outdated
Show resolved
Hide resolved
seedcase_sprout/core/sprout_checks/data/get_polars_data_type.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll review this after we've discussed by comments above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment that I'll commit, otherwise, very nice 🌠
…prout into feat/pandera-data-type-checks
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍 🌟 Just a couple of questions
| "datetime" | ||
| "date" | ||
| "time" | ||
| "year" | ||
| "yearmonth" | ||
| "duration" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be my ignorance - and could be something you’ve discussed last week - but how come we don’t use something like pl.datetime
for datetime
and pl.date
for date
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question!
Original answer
Since then, for me, the question has become more about mapping between Frictionless data/field types and parquet data types, as that is the final form of the data. More details here
Description
This PR adds the ability to convert a set of resource properties into a Pandera schema. This schema will be able to validate a polars dataframe against itself.
This PR needs an in-depth review.
Checklist
just run-all