Meet and Beat Pandas' Support for Nested DataFrames and Arrays #18743

pbower · 2024-09-14T10:15:05Z

Description

Currently Polars is no match for Pandas for operations involving nested dataframes.

Pandas

import pandas as pd
#create first DataFrame
df1 = pd.DataFrame({'item': ['A', 'B', 'C', 'D', 'E'],
                    'sales': [18, 22, 19, 14, 30]})

#create second  DataFrame
df2 = pd.DataFrame({'item': ['F', 'G', 'H', 'I', 'J'],
                    'sales': [10, 12, 13, 13, 19]})

#create third DataFrame
df3 = pd.DataFrame({'item': ['K', 'L', 'M', 'N', 'O'],
                    'sales': [41, 22, 28, 25, 18]})

df_all = pd.DataFrame({'dfs':[df1, df2, df3]})

df_all + df_all

Output:

Polars

df1 = pl.DataFrame(df1)
df2 = pl.DataFrame(df2)
df3 = pl.DataFrame(df3)

df_all = pl.DataFrame({'dfs':[df1, df2, df3]})
df_all + df_all

Output - polars can't handle this:

This also applies to arrays:

Pandas:

import numpy as np

array1 = np.arange(6).reshape(2, 3)
array2 = np.arange(6, 12).reshape(2, 3)

df_with_arrays = pd.DataFrame({
    'arrays': [array1, array2]
})

print(df_with_arrays)

                     arrays
0    [[0, 1, 2], [3, 4, 5]]
1  [[6, 7, 8], [9, 10, 11]]

Polars:

 import numpy as np
 
 array1 = np.arange(6).reshape(2, 3)
 array2 = np.arange(6, 12).reshape(2, 3)
 
 df_with_arrays = pl.DataFrame({
     'arrays': [array1, array2]
 })
 
 print(df_with_arrays)
 ---------------------------------------------------------------------------
 ValueError                                Traceback (most recent call last)
 ...
 ValueError: cannot parse numpy data type dtype('O') into Polars data type

Discussion:

Although Polars has struct type, this is different, as it doesn't compact data down to one row and column. The pandas option supports compact operations on nested data, e.g. an entire customers' dataset, within one 'cell'. This is really good.

Feature request:

What is polars could do this, but expand offer the ability to add, substract, multiply etc., including on a respective row basis ? This would open up the ability to compactly work with powerful datasets. E.g., each customers' entire sales record history in column one (with their own in their respective row), their entire company contact history in column 2, etc., so basically providing not just the aggregate results, but the ability for non-aggregate results to appear under each entry. This could for e.g. include a dictionary/ list of operations / expressions to apply for the respective columns, and/or lazy frames.

This would be hectic, and help bring us into the matrix.

Please consider it.

Thanks,
Pete

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-09-14T10:57:21Z

Although Polars has struct type, this is different, as it doesn't compact data down to one row and column. The pandas option supports compact operations on nested data, e.g. an entire customers' dataset, within one 'cell'. This is really good.

No it isn't. Pandas just applies the python add operation recursively to the inner objects. This is terrible expensive, not optimizable, not eligble for multithreading, or vectorization. Performance wise this isn't much better than using python dictionaries.

It just works because it relies on Python interpreting.

MarcoGorelli · 2024-09-14T11:01:57Z

closing then, but thanks for the request!

cmdlineluser · 2024-09-14T13:59:38Z

If you did use structs in your example:

df_all = pl.DataFrame({'dfs': map(pl.DataFrame.to_struct, [df1, df2, df3])})
# shape: (3, 1)
# ┌─────────────────────────────────┐
# │ dfs                             │
# │ ---                             │
# │ list[struct[2]]                 │
# ╞═════════════════════════════════╡
# │ [{"A",18}, {"B",22}, … {"E",30… │
# │ [{"F",10}, {"G",12}, … {"J",19… │
# │ [{"K",41}, {"L",22}, … {"O",18… │
# └─────────────────────────────────┘

But you can't add lists of structs:

df_all + df_all
# InvalidOperationError: `add` operation not supported for dtype `list[struct[2]]`

I think currently, only structs (and Arrays?) can be "added"?

df_all.explode(pl.all()) + df_all.explode(pl.all())
# shape: (15, 1)
# ┌───────────┐
# │ dfs       │
# │ ---       │
# │ struct[2] │
# ╞═══════════╡
# │ {"AA",36} │
# │ {"BB",44} │
# │ {"CC",38} │
# │ {"DD",28} │
# │ {"EE",60} │
# │ …         │
# │ {"KK",82} │
# │ {"LL",44} │
# │ {"MM",56} │
# │ {"NN",50} │
# │ {"OO",36} │
# └───────────┘

ritchie46 · 2024-09-15T08:11:39Z

Yes, list arithmetic is close to landing: #17823

pbower added the enhancement New feature or an improvement of an existing feature label Sep 14, 2024

pbower changed the title ~~Meet and Beat Pandas' Support for Nested DataFrames~~ Meet and Beat Pandas' Support for Nested DataFrames and Arrays Sep 14, 2024

MarcoGorelli closed this as completed Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meet and Beat Pandas' Support for Nested DataFrames and Arrays #18743

Meet and Beat Pandas' Support for Nested DataFrames and Arrays #18743

pbower commented Sep 14, 2024 •

edited

Loading

ritchie46 commented Sep 14, 2024

MarcoGorelli commented Sep 14, 2024

cmdlineluser commented Sep 14, 2024

ritchie46 commented Sep 15, 2024

Meet and Beat Pandas' Support for Nested DataFrames and Arrays #18743

Meet and Beat Pandas' Support for Nested DataFrames and Arrays #18743

Comments

pbower commented Sep 14, 2024 • edited Loading

Description

Pandas

Polars

ritchie46 commented Sep 14, 2024

MarcoGorelli commented Sep 14, 2024

cmdlineluser commented Sep 14, 2024

ritchie46 commented Sep 15, 2024

pbower commented Sep 14, 2024 •

edited

Loading