Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a decimals int = 2 parameter in describe #10188

Closed
stevenlis opened this issue Jul 31, 2023 · 6 comments
Closed

Add a decimals int = 2 parameter in describe #10188

stevenlis opened this issue Jul 31, 2023 · 6 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@stevenlis
Copy link

Problem description

import polars as pl

df = pl.DataFrame(
    {"A": [1.0, 2.0, 3.0, 4.0, 6.0, 8.0],
     "B": [1.4, 2.0, 3.3, 4.6, 6.0, 8.0],
     "C": ['a', 'b', 'c', 'c', 'a', 'd']}
)
df.describe()

The mean and standard deviation results always have many decimal places.

shape: (9, 4)
┌────────────┬──────────┬──────────┬──────┐
│ describe   ┆ A        ┆ B        ┆ C    │
│ ---        ┆ ---      ┆ ---      ┆ ---  │
│ str        ┆ f64      ┆ f64      ┆ str  │
╞════════════╪══════════╪══════════╪══════╡
│ count      ┆ 6.0      ┆ 6.0      ┆ 6    │
│ null_count ┆ 0.0      ┆ 0.0      ┆ 0    │
│ mean       ┆ 4.0      ┆ 4.216667 ┆ null │
│ std        ┆ 2.607681 ┆ 2.503131 ┆ null │
│ min        ┆ 1.0      ┆ 1.4      ┆ a    │
│ max        ┆ 8.0      ┆ 8.0      ┆ d    │
│ median     ┆ 3.5      ┆ 3.95     ┆ null │
│ 25%        ┆ 2.0      ┆ 2.0      ┆ null │
│ 75%        ┆ 6.0      ┆ 6.0      ┆ null │
└────────────┴──────────┴──────────┴──────┘

This could be fixed with the existing Expr.round, so I don't think there will be any performance drawbacks.

df.describe().with_columns(
    pl.col(pl.Float64).round(2)
)
shape: (9, 4)
┌────────────┬──────┬──────┬──────┐
│ describe   ┆ A    ┆ B    ┆ C    │
│ ---        ┆ ---  ┆ ---  ┆ ---  │
│ str        ┆ f64  ┆ f64  ┆ str  │
╞════════════╪══════╪══════╪══════╡
│ count      ┆ 6.0  ┆ 6.0  ┆ 6    │
│ null_count ┆ 0.0  ┆ 0.0  ┆ 0    │
│ mean       ┆ 4.0  ┆ 4.22 ┆ null │
│ std        ┆ 2.61 ┆ 2.5  ┆ null │
│ min        ┆ 1.0  ┆ 1.4  ┆ a    │
│ max        ┆ 8.0  ┆ 8.0  ┆ d    │
│ median     ┆ 3.5  ┆ 3.95 ┆ null │
│ 25%        ┆ 2.0  ┆ 2.0  ┆ null │
│ 75%        ┆ 6.0  ┆ 6.0  ┆ null │
└────────────┴──────┴──────┴──────┘

We could have a default value like decimals: int = 2 or 3 so that users don't have to do expr.round themselves anymore.

@stevenlis stevenlis added the enhancement New feature or an improvement of an existing feature label Jul 31, 2023
@ritchie46
Copy link
Member

You shouldn't round floats for formatting purposes. This is a formatting issue, the underlying data should not change.

I think you should discuss on a thread regarding floating point formatting, but I think this should not be in describe. Rather in the config settings.

@stevenlis
Copy link
Author

stevenlis commented Jul 31, 2023

@ritchie46 The issue is that a Config option would have a global impact. But I want to be able to see the actual decimal places in my main dataframe so that I know if I need to round those float columns or not. So it would make sense to control results in describe separately. I don't think it makes much sense to use a Config context manager to format the describe results every time you run it.

#7475

@stevenlis
Copy link
Author

@ritchie46
Maybe we can bring the float_precision in describe somehow #7475 (comment)

@ritchie46
Copy link
Member

Config context manager to format the describe results every time you run it.

I think it does. Formatting shouldn't be in keyword arguments.

@stevenlis
Copy link
Author

stevenlis commented Jul 31, 2023

@ritchie46 Sure, let me rephrase that. A describe returns a summary of calculation results. We should have a means to control the precision of the calculations, right? So, it's not formatting but precision.

@ritchie46
Copy link
Member

Rounding floating point values decreases the precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants