fix(rust,python): make python `schema_overrides` information available to the rust-side inference code when initialising from records/dicts #12045

alexander-beedie · 2023-10-26T08:37:30Z

Closes #12032.

We weren't passing information available in the user-provided "schema_overrides" param down into the Rust-side functions where the initial type inference occurs; consequently large UInt64 values would first get inferred as Float64 before being cast back to UInt64, thereby losing accuracy.
This PR enhances the Rust-side read_dicts and finish_from_rows methods, allowing them to integrate the additional/optional information so that they can establish the correct dtypes for overridden columns without first mediating via an inferred type.

Before

import polars as pl
recs = [
    {"id":9187643043065364490, "misc":333},
    {"id":9223671840084328467, "misc":666},
    {"id":9187643043065364505, "misc":999},
]
df = pl.from_records( recs, schema_overrides={"id":pl.UInt64} )

df
# shape: (3, 2)
# ┌─────────────────────┬──────┐
# │ id                  ┆ misc │
# │ ---                 ┆ ---  │
# │ u64                 ┆ i64  │
# ╞═════════════════════╪══════╡
# │ 9187643043065364480 ┆ 333  │   << all "id" values are wrong as
# │ 9223671840084328448 ┆ 666  │      they were cast back from f64
# │ 9187643043065364480 ┆ 999  │
# └─────────────────────┴──────┘

After

df
# shape: (3, 2)
# ┌─────────────────────┬──────┐
# │ id                  ┆ misc │
# │ ---                 ┆ ---  │
# │ u64                 ┆ i64  │
# ╞═════════════════════╪══════╡
# │ 9187643043065364490 ┆ 333  │   << all "id" values are now correct
# │ 9223671840084328467 ┆ 666  │      (avoided intermediate type/cast)
# │ 9187643043065364505 ┆ 999  │
# └─────────────────────┴──────┘

Note that if "schema_overrides" is not provided, the column is loaded as Float64, so the user should see that they need to provide an explicit override, where applicable.

🚀 As well as fixing this edge case, we should also get a little extra performance when loading from dicts where only partial type information is provided (via "schema_overrides"), as the affected columns will no longer trigger a post-init cast - they will be directly loaded as the correct type.

👀 Note: if the whole schema is given up-front (via the "schema" param) then none of these issues apply and everything is loaded correctly/efficiently.

py-polars/src/dataframe.rs

…e to the rust-side inference code when initialising from records/dicts

alexander-beedie requested review from ritchie46 and stinodego as code owners October 26, 2023 08:37

github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Oct 26, 2023

alexander-beedie force-pushed the improve-schema-inference-with-overrides branch 2 times, most recently from 7381c24 to d6cf3cc Compare October 26, 2023 10:05

ritchie46 requested changes Oct 26, 2023

View reviewed changes

py-polars/src/dataframe.rs Outdated Show resolved Hide resolved

py-polars/src/dataframe.rs Outdated Show resolved Hide resolved

py-polars/src/dataframe.rs Outdated Show resolved Hide resolved

ritchie46 requested changes Oct 27, 2023

View reviewed changes

py-polars/src/dataframe.rs Outdated Show resolved Hide resolved

alexander-beedie and others added 3 commits October 27, 2023 13:40

fix(rust,python): make python schema_overrides information availabl…

a03bb4c

…e to the rust-side inference code when initialising from records/dicts

improve some naming/typing

f3e3a26

avoid realloc; mutable schema dtype iter/assign

e0f2537

alexander-beedie force-pushed the improve-schema-inference-with-overrides branch from a6258d2 to e0f2537 Compare October 27, 2023 14:21

alexander-beedie requested a review from orlp as a code owner October 27, 2023 14:21

alexander-beedie requested a review from ritchie46 October 27, 2023 16:40

ritchie46 approved these changes Nov 1, 2023

View reviewed changes

ritchie46 merged commit 29539fb into pola-rs:main Nov 1, 2023
33 checks passed

alexander-beedie deleted the improve-schema-inference-with-overrides branch November 1, 2023 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rust,python): make python `schema_overrides` information available to the rust-side inference code when initialising from records/dicts #12045

fix(rust,python): make python `schema_overrides` information available to the rust-side inference code when initialising from records/dicts #12045

alexander-beedie commented Oct 26, 2023 •

edited

Loading

fix(rust,python): make python schema_overrides information available to the rust-side inference code when initialising from records/dicts #12045

fix(rust,python): make python schema_overrides information available to the rust-side inference code when initialising from records/dicts #12045

Conversation

alexander-beedie commented Oct 26, 2023 • edited Loading

Before

After

fix(rust,python): make python `schema_overrides` information available to the rust-side inference code when initialising from records/dicts #12045

fix(rust,python): make python `schema_overrides` information available to the rust-side inference code when initialising from records/dicts #12045

alexander-beedie commented Oct 26, 2023 •

edited

Loading