Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust,python): make python schema_overrides information available to the rust-side inference code when initialising from records/dicts #12045

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Oct 26, 2023

Closes #12032.

  • We weren't passing information available in the user-provided "schema_overrides" param down into the Rust-side functions where the initial type inference occurs; consequently large UInt64 values would first get inferred as Float64 before being cast back to UInt64, thereby losing accuracy.

  • This PR enhances the Rust-side read_dicts and finish_from_rows methods, allowing them to integrate the additional/optional information so that they can establish the correct dtypes for overridden columns without first mediating via an inferred type.

Before

import polars as pl
recs = [
    {"id":9187643043065364490, "misc":333},
    {"id":9223671840084328467, "misc":666},
    {"id":9187643043065364505, "misc":999},
]
df = pl.from_records( recs, schema_overrides={"id":pl.UInt64} )
df
# shape: (3, 2)
# ┌─────────────────────┬──────┐
# │ id                  ┆ misc │
# │ ---                 ┆ ---  │
# │ u64                 ┆ i64  │
# ╞═════════════════════╪══════╡
# │ 9187643043065364480 ┆ 333  │   << all "id" values are wrong as
# │ 9223671840084328448 ┆ 666  │      they were cast back from f64
# │ 9187643043065364480 ┆ 999  │
# └─────────────────────┴──────┘

After

df
# shape: (3, 2)
# ┌─────────────────────┬──────┐
# │ id                  ┆ misc │
# │ ---                 ┆ ---  │
# │ u64                 ┆ i64  │
# ╞═════════════════════╪══════╡
# │ 9187643043065364490 ┆ 333  │   << all "id" values are now correct
# │ 9223671840084328467 ┆ 666  │      (avoided intermediate type/cast)
# │ 9187643043065364505 ┆ 999  │
# └─────────────────────┴──────┘

Note that if "schema_overrides" is not provided, the column is loaded as Float64, so the user should see that they need to provide an explicit override, where applicable.


🚀 As well as fixing this edge case, we should also get a little extra performance when loading from dicts where only partial type information is provided (via "schema_overrides"), as the affected columns will no longer trigger a post-init cast - they will be directly loaded as the correct type.

👀 Note: if the whole schema is given up-front (via the "schema" param) then none of these issues apply and everything is loaded correctly/efficiently.

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Oct 26, 2023
@alexander-beedie alexander-beedie force-pushed the improve-schema-inference-with-overrides branch 2 times, most recently from 7381c24 to d6cf3cc Compare October 26, 2023 10:05
py-polars/src/dataframe.rs Outdated Show resolved Hide resolved
py-polars/src/dataframe.rs Outdated Show resolved Hide resolved
py-polars/src/dataframe.rs Outdated Show resolved Hide resolved
py-polars/src/dataframe.rs Outdated Show resolved Hide resolved
@ritchie46 ritchie46 merged commit 29539fb into pola-rs:main Nov 1, 2023
33 checks passed
@alexander-beedie alexander-beedie deleted the improve-schema-inference-with-overrides branch November 1, 2023 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

schema_overrides loses precision on uint too large for float
2 participants