-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meet and Beat Pandas' Support for Nested DataFrames and Arrays #18743
Comments
No it isn't. Pandas just applies the python add operation recursively to the inner objects. This is terrible expensive, not optimizable, not eligble for multithreading, or vectorization. Performance wise this isn't much better than using python dictionaries. It just works because it relies on Python interpreting. |
closing then, but thanks for the request! |
If you did use structs in your example: df_all = pl.DataFrame({'dfs': map(pl.DataFrame.to_struct, [df1, df2, df3])})
# shape: (3, 1)
# ┌─────────────────────────────────┐
# │ dfs │
# │ --- │
# │ list[struct[2]] │
# ╞═════════════════════════════════╡
# │ [{"A",18}, {"B",22}, … {"E",30… │
# │ [{"F",10}, {"G",12}, … {"J",19… │
# │ [{"K",41}, {"L",22}, … {"O",18… │
# └─────────────────────────────────┘ But you can't add lists of structs: df_all + df_all
# InvalidOperationError: `add` operation not supported for dtype `list[struct[2]]` I think currently, only structs (and Arrays?) can be "added"? df_all.explode(pl.all()) + df_all.explode(pl.all())
# shape: (15, 1)
# ┌───────────┐
# │ dfs │
# │ --- │
# │ struct[2] │
# ╞═══════════╡
# │ {"AA",36} │
# │ {"BB",44} │
# │ {"CC",38} │
# │ {"DD",28} │
# │ {"EE",60} │
# │ … │
# │ {"KK",82} │
# │ {"LL",44} │
# │ {"MM",56} │
# │ {"NN",50} │
# │ {"OO",36} │
# └───────────┘ |
Yes, list arithmetic is close to landing: #17823 |
Description
Currently Polars is no match for Pandas for operations involving nested dataframes.
Pandas
Output:
Polars
Output - polars can't handle this:
This also applies to arrays:
Pandas:
Polars:
Discussion:
Although Polars has struct type, this is different, as it doesn't compact data down to one row and column. The pandas option supports compact operations on nested data, e.g. an entire customers' dataset, within one 'cell'. This is really good.
Feature request:
What is polars could do this, but expand offer the ability to add, substract, multiply etc., including on a respective row basis ? This would open up the ability to compactly work with powerful datasets. E.g., each customers' entire sales record history in column one (with their own in their respective row), their entire company contact history in column 2, etc., so basically providing not just the aggregate results, but the ability for non-aggregate results to appear under each entry. This could for e.g. include a dictionary/ list of operations / expressions to apply for the respective columns, and/or lazy frames.
This would be hectic, and help bring us into the matrix.
Please consider it.
Thanks,
Pete
The text was updated successfully, but these errors were encountered: