Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meet and Beat Pandas' Support for Nested DataFrames and Arrays #18743

Closed
pbower opened this issue Sep 14, 2024 · 4 comments
Closed

Meet and Beat Pandas' Support for Nested DataFrames and Arrays #18743

pbower opened this issue Sep 14, 2024 · 4 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@pbower
Copy link

pbower commented Sep 14, 2024

Description

Currently Polars is no match for Pandas for operations involving nested dataframes.

Pandas

import pandas as pd
#create first DataFrame
df1 = pd.DataFrame({'item': ['A', 'B', 'C', 'D', 'E'],
                    'sales': [18, 22, 19, 14, 30]})

#create second  DataFrame
df2 = pd.DataFrame({'item': ['F', 'G', 'H', 'I', 'J'],
                    'sales': [10, 12, 13, 13, 19]})

#create third DataFrame
df3 = pd.DataFrame({'item': ['K', 'L', 'M', 'N', 'O'],
                    'sales': [41, 22, 28, 25, 18]})

df_all = pd.DataFrame({'dfs':[df1, df2, df3]})

df_all + df_all

Output:
image

Polars

df1 = pl.DataFrame(df1)
df2 = pl.DataFrame(df2)
df3 = pl.DataFrame(df3)

df_all = pl.DataFrame({'dfs':[df1, df2, df3]})
df_all + df_all

Output - polars can't handle this:
image

This also applies to arrays:

Pandas:

import numpy as np

array1 = np.arange(6).reshape(2, 3)
array2 = np.arange(6, 12).reshape(2, 3)

df_with_arrays = pd.DataFrame({
    'arrays': [array1, array2]
})

print(df_with_arrays)

                     arrays
0    [[0, 1, 2], [3, 4, 5]]
1  [[6, 7, 8], [9, 10, 11]]

Polars:

 import numpy as np
 
 array1 = np.arange(6).reshape(2, 3)
 array2 = np.arange(6, 12).reshape(2, 3)
 
 df_with_arrays = pl.DataFrame({
     'arrays': [array1, array2]
 })
 
 print(df_with_arrays)
 ---------------------------------------------------------------------------
 ValueError                                Traceback (most recent call last)
 ...
 ValueError: cannot parse numpy data type dtype('O') into Polars data type
 

Discussion:

Although Polars has struct type, this is different, as it doesn't compact data down to one row and column. The pandas option supports compact operations on nested data, e.g. an entire customers' dataset, within one 'cell'. This is really good.

Feature request:

What is polars could do this, but expand offer the ability to add, substract, multiply etc., including on a respective row basis ? This would open up the ability to compactly work with powerful datasets. E.g., each customers' entire sales record history in column one (with their own in their respective row), their entire company contact history in column 2, etc., so basically providing not just the aggregate results, but the ability for non-aggregate results to appear under each entry. This could for e.g. include a dictionary/ list of operations / expressions to apply for the respective columns, and/or lazy frames.

This would be hectic, and help bring us into the matrix.

Please consider it.

Thanks,
Pete

@pbower pbower added the enhancement New feature or an improvement of an existing feature label Sep 14, 2024
@pbower pbower changed the title Meet and Beat Pandas' Support for Nested DataFrames Meet and Beat Pandas' Support for Nested DataFrames and Arrays Sep 14, 2024
@ritchie46
Copy link
Member

Although Polars has struct type, this is different, as it doesn't compact data down to one row and column. The pandas option supports compact operations on nested data, e.g. an entire customers' dataset, within one 'cell'. This is really good.

No it isn't. Pandas just applies the python add operation recursively to the inner objects. This is terrible expensive, not optimizable, not eligble for multithreading, or vectorization. Performance wise this isn't much better than using python dictionaries.

It just works because it relies on Python interpreting.

@MarcoGorelli
Copy link
Collaborator

closing then, but thanks for the request!

@cmdlineluser
Copy link
Contributor

If you did use structs in your example:

df_all = pl.DataFrame({'dfs': map(pl.DataFrame.to_struct, [df1, df2, df3])})
# shape: (3, 1)
# ┌─────────────────────────────────┐
# │ dfs                             │
# │ ---                             │
# │ list[struct[2]]                 │
# ╞═════════════════════════════════╡
# │ [{"A",18}, {"B",22}, … {"E",30… │
# │ [{"F",10}, {"G",12}, … {"J",19… │
# │ [{"K",41}, {"L",22}, … {"O",18… │
# └─────────────────────────────────┘

But you can't add lists of structs:

df_all + df_all
# InvalidOperationError: `add` operation not supported for dtype `list[struct[2]]`

I think currently, only structs (and Arrays?) can be "added"?

df_all.explode(pl.all()) + df_all.explode(pl.all())
# shape: (15, 1)
# ┌───────────┐
# │ dfs       │
# │ ---       │
# │ struct[2] │
# ╞═══════════╡
# │ {"AA",36} │
# │ {"BB",44} │
# │ {"CC",38} │
# │ {"DD",28} │
# │ {"EE",60} │
# │ …         │
# │ {"KK",82} │
# │ {"LL",44} │
# │ {"MM",56} │
# │ {"NN",50} │
# │ {"OO",36} │
# └───────────┘

@ritchie46
Copy link
Member

Yes, list arithmetic is close to landing: #17823

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants