fix(rust,python): make python schema_overrides
information available to the rust-side inference code when initialising from records/dicts
#12045
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #12032.
We weren't passing information available in the user-provided "schema_overrides" param down into the Rust-side functions where the initial type inference occurs; consequently large UInt64 values would first get inferred as Float64 before being cast back to UInt64, thereby losing accuracy.
This PR enhances the Rust-side
read_dicts
andfinish_from_rows
methods, allowing them to integrate the additional/optional information so that they can establish the correct dtypes for overridden columns without first mediating via an inferred type.Before
After
Note that if "schema_overrides" is not provided, the column is loaded as Float64, so the user should see that they need to provide an explicit override, where applicable.
🚀 As well as fixing this edge case, we should also get a little extra performance when loading from dicts where only partial type information is provided (via "schema_overrides"), as the affected columns will no longer trigger a post-init cast - they will be directly loaded as the correct type.
👀 Note: if the whole schema is given up-front (via the "schema" param) then none of these issues apply and everything is loaded correctly/efficiently.