-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvement for describe
#8093
Comments
I will raise a PR to fix this. Also noticed the ordering is not consistent between the two.
This is a question about formatting dataframes in the output in the terminal, not specific to
This will be a challenge, as it means that different data types all need to be represented as the same data type. Note that
I appreciate the suggestion, but incorporating nice-to-haves without considering downsides it not the way to go for a library. Just because you think these specific stats are important, does not mean everyone thinks that way. Another person may think those are useless stats (for his or her use case) and has four other stats that are "very important". Yet another person comes along and complains the output is too complicated, and yet another person complains that on his or her very large dataframe the stats take too long to compute. Adding flags to turn things on leads to a debate about default values, more complexity, etc, etc. I.e. you get the point. Not saying we should never add or change statistics, but I would prefer to do this through a more thorough write up discussing both pros and cons. Feel free to make such a proposal. In the mean time, I think you are best off writing your own |
FYI: as for (2), this PR (in progress) would allow the number of displayed decimal places to be set frame-wide: #7475 |
@zundertj Thanks for the suggestions.
I'm not quite sure how this is related to dataframes formatting in general. Do you mean there should be a round method for dataframe, which automatically excludes string variables? I remember @ritchie46 mentioned that he doesn't wanna add too many methods for dataframe.
If you have a string variable, it will become a string column indeed. I think it's fine and at least you can format some of the columns like
I'm not quite sure how I should interpret your arguments here, and those are some basic points that can be used against any ideas I believe. I didn't see any survey data in terms of what kind of folks are using polars, and what a use case would be considered as a "average use case", and what would not. One can literally argue |
I found another issue that for a boolen column if all None, the Mean is import polars as pl
pl.DataFrame(
{'a': [None, None, None],
'b': [None, None, None],
'c': [None, None, None],
'd': [None, None, None]},
schema={'a': pl.Float32,
'b': pl.Int64,
'c': pl.Utf8,
'd': pl.Boolean}
).describe()
shape: (7, 5)
┌────────────┬──────┬──────┬──────┬──────┐
│ describe ┆ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ str ┆ f64 │
╞════════════╪══════╪══════╪══════╪══════╡
│ count ┆ 3.0 ┆ 3.0 ┆ 3 ┆ 3.0 │
│ null_count ┆ 3.0 ┆ 3.0 ┆ 3 ┆ 3.0 │
│ mean ┆ null ┆ null ┆ null ┆ NaN │
│ std ┆ null ┆ null ┆ null ┆ null │
│ min ┆ null ┆ null ┆ null ┆ null │
│ max ┆ null ┆ null ┆ null ┆ null │
│ median ┆ null ┆ null ┆ null ┆ null │
└────────────┴──────┴──────┴──────┴──────┘ |
And also found that import polars as pl
pl.DataFrame(
{'a': [None, float('nan'), 1, 2, 4, 5],
'b': [None, None, 1, 2, 4, 5],
'c': [float('nan'), float('nan'), 1, 2, 4, 5]}
).describe()
shape: (7, 4)
┌────────────┬─────┬──────────┬─────┐
│ describe ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞════════════╪═════╪══════════╪═════╡
│ count ┆ 6.0 ┆ 6.0 ┆ 6.0 │
│ null_count ┆ 1.0 ┆ 2.0 ┆ 0.0 │
│ mean ┆ NaN ┆ 3.0 ┆ NaN │
│ std ┆ NaN ┆ 1.825742 ┆ NaN │
│ min ┆ 1.0 ┆ 1.0 ┆ 1.0 │
│ max ┆ 5.0 ┆ 5.0 ┆ 5.0 │
│ median ┆ 4.0 ┆ 3.0 ┆ 4.5 │
└────────────┴─────┴──────────┴─────┘ |
So this is about the formatting of the outputted dataframe, right? I.e. I can see that per your example it is annoying that for some stats there are many decimals, cluttering the result. But as On the final discussion point: I think everyone agrees that Let's try to be constructive on this. I'll start: Pandas (and yes, we should not just copy Pandas, I have made that point myself often enough ...) has for numeric data: |
@zundertj Thanks for getting back on this. I indeed think a precision/format/decimals param should affect the describe output dataframe only. But I guess a global setting for floating formatting as @alexander-beedie mentioned should works fine. I will try it out once the PR is merged. I agree with everything you said. I think at least 25% and 75% should be added, along with mean and median, you can get a sense of the skewness of your data. I just found that polars already have a expr.skew... Btw, for some reason, the current output is also not sorted... the median comes after max... instead of min, median, max...which looks a bit odd to me. I also wonder your opinion on the nan behaviors I posted above. |
FYI: seems the percentiles were already available on the Rust-side equivalent, and just got added to Python: #8169. I'm going to fix-up the typing on that commit (bit too restrictive with |
Thanks for adding the 25% and 75% to the describe method, very useful. Any thoughts of adding unique count as well? It is quite nice with categorical columns |
Problem description
Series.describe()
but only DataFrame.Add a precision/format/decimals param in describe. Since the describe returns a DataFrame that does not have a .round method and the first column is a string, it perhaps makes sense to add a param in the function instead.
Add a orientation param so that each stats becomes a column (like what you usually see in journal papers). When you have many variables to describe, it becomes hard to scroll horizontally and read the results, and the current workaround with transpose doesn't look so nice. Since each stats is a column now, it also allows you to format them easily, such as convert
count
to an int instead of a float. (this perhaps could be solved byfirst_column_as_header
in transpose #8095)The text was updated successfully, but these errors were encountered: