Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust, python): prevent re-ordering of dict keys inside .apply #10172

Merged
merged 1 commit into from
Jul 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions py-polars/src/apply/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,10 @@ fn iterator_to_struct<'a>(
// ]
let mut struct_fields: BTreeMap<&str, Vec<AnyValue>> = BTreeMap::new();

// as a BTreeMap sorts its keys, we also need to track the original
Copy link
Member

@ritchie46 ritchie46 Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use PlIndexMap if insertion order needs to be maintained.

I think the other changes can be reverted after that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks @ritchie46 - my searching did find https://doc.rust-lang.org/stable/nightly-rustc/rustc_data_structures/sorted_map/struct.SortedIndexMultiMap.html but not PlIndexMap

Trying PlIndexMap gives me this back from the compiler, I'm not sure how to proceed.

error[E0599]: the method `par_iter` exists for struct `IndexMap<&str, Vec<AnyValue<'_>>, RandomState>`, but its trait bounds were not satisfied
   --> src/apply/mod.rs:114:14
    |
113 | /             struct_fields
114 | |             .par_iter()
    | |_____________-^^^^^^^^
    |
   ::: /Users/user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/indexmap-2.0.0/src/map.rs:81:1
    |
81  |   pub struct IndexMap<K, V, S = RandomState> {
    |   ------------------------------------------ doesn't satisfy `_: IntoParallelRefIterator<'_>`
    |
    = note: the following trait bounds were not satisfied:
            `&indexmap::map::IndexMap<&str, Vec<polars_rs::prelude::AnyValue<'_>>, ahash::RandomState>: polars_rs::export::rayon::iter::IntoParallelIterator`
            which is required by `indexmap::map::IndexMap<&str, Vec<polars_rs::prelude::AnyValue<'_>>, ahash::RandomState>: polars_rs::export::rayon::iter::IntoParallelRefIterator<'_>`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. Then this solution is fine. Thank you! 👍

// order of the field names
let mut field_names_ordered = Vec::with_capacity(flds.len());

// use the first value and the known null count to initialize the buffers
// if we find a new key later on, we make a new entry in the BTree
for (value, fld) in vals.into_iter().zip(flds) {
Expand All @@ -62,6 +66,7 @@ fn iterator_to_struct<'a>(
buf.push(AnyValue::Null);
}
buf.push(value);
field_names_ordered.push(fld.name() as &str);
struct_fields.insert(fld.name(), buf);
}

Expand All @@ -86,6 +91,7 @@ fn iterator_to_struct<'a>(
for (key, val) in dict.iter() {
let key = key.str().unwrap().to_str().unwrap();
let buf = struct_fields.entry(key).or_insert_with(|| {
field_names_ordered.push(key);
let mut buf = Vec::with_capacity(capacity);
for _ in 0..(init_null_count + current_len) {
buf.push(AnyValue::Null);
Expand All @@ -110,9 +116,9 @@ fn iterator_to_struct<'a>(
}

let fields = POOL.install(|| {
struct_fields
field_names_ordered
.par_iter()
.map(|(name, avs)| Series::new(name, avs))
.map(|name| Series::new(name, struct_fields.get(name).unwrap()))
.collect::<Vec<_>>()
});

Expand Down
5 changes: 5 additions & 0 deletions py-polars/tests/unit/operations/test_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -381,3 +381,8 @@ def test_apply_shifted_chunks() -> None:
"column_0": ["test", "test123", "tests"],
"column_1": [None, "test", "test123"],
}


def test_apply_dict_order_10128() -> None:
df = pl.select(pl.lit("").apply(lambda x: {"c": 1, "b": 2, "a": 3}))
assert df.to_dict(False) == {"literal": [{"c": 1, "b": 2, "a": 3}]}